Directory Structure

Why good organization is needed

Good organization is a prerequisite for a readable and maintainable piece of software. How we organize code, configuration, and other files will affect how easily another person can understand and modify our code.

We can write all our code in a single file and hardcode it with configuration values and magic numbers, but this makes it extremely difficult for someone who is new to the project to parse what we’ve written. Most data scientists work in a team environment, so our objective is writing code that is not only functional and free of defects but also readable and maintainable by others. A logically sound directory structure is the perfect starting point for good code organization.

A well-designed directory structure achieves the following:

  • It makes it easy for others, especially those new to the project, to understand the architecture of the system.

  • It separates logically distinct units. For example, data processing functionalities are put in one subdirectory, while modeling functionalities are put in another. Configuration files are put in a separate directory. Unit test cases are separated from the rest of the code.

  • It separates code, configuration, and documentation.

  • It prevents confusion and makes the code easy to maintain.

The directory structure of the pipeline code

We’ll organize our directories in the following manner:

  • ml_pipeline_tutorial/

  • config/

    • projects/

  • data/

  • ml_pipeline/

    • datasets/

    • mixins/

    • models/

    • tests/

  • tests/

The ml_pipeline_tutorial directory is the top-level directory that contains everything we create in this course. We can create this anywhere in our file system.

The config directory contains all configuration (config) files. They contain the path to the data or the uniform resource identifier (URI) of the remote data source, data-related parameters, names of features to use for training the model, the target variable name, and the type of model we would like to use, for example. They also contain some hyperparameters and the operations to perform on the data during the pipeline run. Why are these values stored in config files as opposed to code? First, it reduces the chance of breaking the program when adding frequently changed values, and second, it allows flexibility. We’ll discuss config files in more detail in another chapter.

The data directory contains all data files. As mentioned earlier, for the purpose of keeping things simple in this course, we’ll work with flat files on the local disk rather than files fetched from a cloud service or data loaded from databases. Note that the contents of the data directory are never checked in to version control. In Git, for example, you would create a .gitignore file under ml_pipeline_tutorial and enter the following to prevent data from being checked in to our repository:

Get hands-on with 1200+ tech skills courses.