Ingesting Databases

Let’s delve into loading information from a database.

We'll cover the following

Spark and databases

Though there are many different DataSources involved in big data processing, relational databases can still be the de facto choice as a data repository, specifically in situations where the business domains require data normalization, relationships between domain models, strongly consistent transactions, and so on.

Spark offers the possibility of interacting with RDBMS to load a whole table, filter information, and load a fraction of a table by executing queries on the table. It also provides functionality to run operations while ingesting from the database, such as filtering and aggregation, that help minimize the volumes of data retrieved.

In regards to minimizing the amount of data retrieved, one sensible strategy is to filter information at the database level while querying tables, if possible. The immediate benefit of this strategy is the reduction of data volumes transferred from the source.

Note: We should always think in big data terms. We should assume that volumes of information are always huge, and aim at efficiency in every possible corner of our application.

Java applications always use a dialect while interacting with a database, and Spark incorporates the same principle. In fact, Spark establishes a connection to a database by using Java’s JDBC drivers for a particular vendor (such as PostgreSQL or MySQL)

For this reason, the JDBC drivers for a particular database vendor (or a dialect) have to be available in the worker nodes of the cluster or locally in the classpath, if doing standalone developments.

The following illustration summarizes Spark interaction with the database:

Get hands-on with 1200+ tech skills courses.