Search⌘ K
AI Features

How to Ingest Files: Part II

Explore methods to ingest various file formats such as XML, raw text, and Parquet using the Spark Java API. Understand how to configure Spark for XML parsing, handle raw text ingestion, and work with columnar Parquet files. Gain practical skills for integrating diverse data formats into big data batch applications.

Ingestion of XML files

Extensible Markup Files, or XML files, are still broadly present in the realm of data formats.

These files are structured, extensible, self-describing (easy to read for us humans), and can be validated by using XSD files in conjunction with them.

Note: For more information on the XML format, please refer to: https://www.w3.org/XML/

On the downside, they tend to be quite verbose, and sometimes, depending on the complexity of their structure, very hard to read. Nonetheless, this format is widely used, and Spark finds no impediments to parsing it for us.

The project ...