Introduction to Duper

Get a brief overview of Duper.

We'll cover the following

Duper

Q1: What’s the environment, and what are its constraints?
Q2: What are the focal points?
Q3: What are the runtime characteristics?
Q4: What do we protect from errors?
Q5: How do we get this thing running?

Novice Elixir users might struggle with how to organize applications. They might ask themselves when to use servers, how supervisors fit in, and how many applications should be written.

For Elixir, the following approach might help us think through the above issues.

We should ask ourselves these five questions:

What’s the environment, and what are its constraints?
What are the obvious focal points?
What are the runtime characteristics?
What do we protect from errors?
How do we get this thing running?

What we’ll show in this chapter is just an ad hoc approach. Please don’t take it as a methodology. But if we’re feeling overwhelmed when designing a new Elixir system, these steps might help.

Let’s write a simple application to illustrate what we mean.

Duper

We have loads of duplicate files littering our computers. In an effort to tame this, let’s write a duplicate file finder. We’ll call it Duper so we can later create a paid version called SuperDuper. It’ll work by scanning all the files in a directory tree and calculating a hash for each. If two files have the same hash, we’ll report them as duplicates.

Let’s start asking the questions.

Q1: What’s the environment, and what are its constraints?

We’re going to run this on a typical computer. It’ll have roughly two orders of magnitude more file storage than main memory. Files will range in size from $10^{0}$ to $10^{10}$ bytes, and there will be roughly $10^{7}$ of them.

What this means:

We need to allow for the fact that although we have to load files into memory to determine their hashes, it’s possible we won’t have enough memory to load the largest files as a whole. We definitely won’t be able to process all the files at once.

It also means that our design will need to cater to both big and small files. Big files will take more time to read into memory than small files, and they will also take longer to hash.

Q2: What are the focal points?

A focal point represents the responsibility of the application. By considering the focal points now, we can reduce coupling in the application as a whole. Each focal point can be tightly coupled internally but loosely coupled to the others. This coupling can be both structural (for example, the representation of data) and temporal (for example, the sequence in which things will happen).

In Duper, we can easily identify some key focal points:

We need to have a place where we collect results. We are calculating a hash value for each file, so this results store will need to hold all of these. Because we’re looking for duplicate hashes, it would make sense for this to be some kind of key-value store internally, where the key is the hash and the value is a list of all files with that hash. However, this is an implementation detail, and the implementation shouldn’t leak through our API.
We need to have something that can traverse the filesystem, returning each path just once.
We need to have something that can take a path and calculate the hash of the corresponding file. Because an individual file may be too big to fit in memory, we’ll have to read it in chunks, calculating the hash incrementally.
Because we know we’ll need to be processing multiple files concurrently in order to maximize our use of the CPU and IO bandwidth, we’ll need something that orchestrates the overall process. This list may well change as we start to write code, but it’s good enough to get us to the next step.

What this means:

At the very least, each focus we identify is an Elixir module. It’s wise to assume that most if not all are going to end up being servers. Many should even be separate Elixir applications, but that’s not something we’ll dig into here. Our code will be structured into four servers. Although we could do it with fewer, using four means we can have specific characteristics for each. The four are as follows:

The Results server is the most important server because it holds the results of the scanning in memory. We need it to be reliable, so we won’t put much code in it.
The PathFinder server is responsible for returning the paths to each file in the directory tree, one at a time.
The WorkerSupervisor server asks the PathFinder for a path, calculates the hash of the resulting file’s contents, and passes the result to the gatherer.
The Gatherer server is the server that both gets everything going and determines when things have been completed. When they’ve been completed, it fetches the results and reports on them.

Q3: What are the runtime characteristics?

Our application is going to spend the vast majority of its time in the workers because this is where we read the files and calculate the hash values. Our goal is to keep both the processors and the IO bus as busy as possible in order to maximize performance.

If we have just one worker, then it would read a file, hash it, read the next, hash it, and so on. We’d alternate between being IO bound and CPU bound. This doesn’t come close to maximizing our performance.

On the other hand, if we had one worker for each file, then they could be reading and hashing at the same time. However, we’d run out of memory on our machine because we’d effectively be trying to load our filesystem into memory.

The sweet spot lies in between.

One approach is to create n workers, and then divide the work equally between them. This is the typical push model: plan the work upfront and let it execute. The problem with this approach is that it assumes that each file is about the same size. If that’s not the case, then it would be possible to give one worker mostly small files and another mostly large files. The first would finish early, and would then sit idle while the second chewed through its workload.

The approach we prefer in this scenario is what we call a hungry consumer. It’s a pull model where each worker asks for the next thing to do, processes it, and then asks for more work. In this scheme, a worker that has a small file to process will get it done quickly, then ask for more work. One with a bigger file will take more time. There’ll never be an idle worker until we get to the very last files.

The following sequence diagram shows how messages flow in this system. Notice that we have a mixture of synchronous messaging (the pairs of arrows going in opposite directions) and asynchronous messaging.

Get hands-on with 1200+ tech skills courses.

Getting Started

Pattern Matching

Immutability

Elixir Basics

Anonymous Functions

Modules and Named Functions

Lists and Recursion

Maps, Keyword Lists, Sets, and Structs

Processing Collections: Enum and Stream

Strings and Binaries

Control Flow

Organizing a Project

Tooling

Working with Multiple Processes

Nodes—The Key to Distributing Services

OTP: Servers

OTP: Supervisors

A More Complex Example

OTP: Applications

Tasks and Agents

Macros and Code Evaluation

Linking Modules: Behaviours and use

Protocols—Polymorphic Functions

More Cool Stuff

Appendix

Introduction to Duper

Duper

Q1: What’s the environment, and what are its constraints?

Q2: What are the focal points?

Q3: What are the runtime characteristics?