How to Ingest Files: Part I

Learn how to ingest information from files, in several formats, using Spark.

Ingesting data from files in Spark

Ingesting data is the first stage of a Big Data pipeline, and in many cases, it determines the initial step in a typical batch process. Spark offers developers a wide range of options when it comes to ingesting files from different formats.

The formats studied in this lesson are exemplified with a single project. Having a separate project for each of them would be overkill, and the operations are fairly concise.

Spark fulfills the loading functionality internally by using parsers, so let’s highlight the essential steps of using a parser:

  1. The input of a parser is the path of a file. This path can also be a regular expression, which enables the developer to load multiple files at once.

  2. Parsers take options as extra arguments, which we showcase in our examples, but the values for these options are case sensitive, so “myPath” and “mypath” are not considered the same.

The below widget contains the project for this lesson:

Get hands-on with 1200+ tech skills courses.