Case Study
Explore alternatives to CSV format, like dictionaries and JSON, for flexible data serialization and processing.
We'll cover the following
In the case study present in the previous chapters, we’ve been skirting an issue that arises frequently when working with complex data. Files have both a logical layout and a physical format. We’ve been laboring under a tacit assumption that our files are in CSV format, with a layout defined by the first line of the file. In Chapter 2, we touched on file loading. In Chapter 6, we revisited loading data and partitioning it into training and testing sets.
In both previous chapters, we trusted that the data would be in a CSV format. This isn’t a great assumption to make. We need to look at the alternatives and elevate our assumptions into a design choice. We also need to build in the flexibility to make changes as the context for using our application evolves.
It’s common to map complex objects to dictionaries, which have a tidy JSON representation. For this reason, the Classifier
web application makes use of dictionaries. We can also parse CSV data into dictionaries. The idea of working with dictionaries provides a kind of grand unification of CSV, Python, and JSON. We’ll start by looking at the CSV format before moving on to some alternatives for serialization, like JSON.
CSV format designs
We can make use of the csv
module to read and write files. CSV stands for Comma-Separated Values, designed (originally) to export and import data from a spreadsheet.
The CSV format describes a sequence of rows. Each row is a sequence of strings. That’s all there is, and it can be a bit of a limitation.
The comma in CSV is a role, not a specific character. The purpose of this character is to separate the columns of data. For the most part, the role of the comma is played by the literal “,
”. But other actors can fill this role. It’s common to see the tab character, written as "\t"
or "\x09"
, fill the role of the comma.
The end-of-line is often the CRLF sequence, written as "\r\n"
or \x0d\x0a
. On macOS X and Linux, it’s also possible to use a single newline character, \n
, at the end of each row. Again, this is a role, and other characters could be used.
In order to contain the comma character within a column’s data, the data can be quoted. This is often done by surrounding a column’s value with the "
character. It’s possible to specify a different quote character when describing a CSV dialect.
Because CSV data is simply a sequence of strings, any other interpretation
of the data requires processing by our application. For example, within
the TrainingSample
class, the load()
method includes processing like the following:
Get hands-on with 1200+ tech skills courses.