Fetch Large Datasets With Streams

Learn how to fetch large datasets with streams.

What are streams

Streams are a core part of Elixir. We use streams for:

  • Lazy processing
  • To avoid loading lots of data into memory at once
  • For processing infinite data streams.

Many of Elixir’s concurrency constructs build on top of streams, such as the Task.async_stream function and the GenStage and Flow packages.

It’s essential to try to utilize concurrency when working with databases because much of the time spent during query execution is waiting for network I/O, during which the CPU is free to do other work.

We’ve used Repo.all to fetch data from the database throughout much of the course. Its stream-based counterpart is Repo.stream. It returns a lazy stream that can work with a database as its source.

Like other Elixir streams, it won’t start loading data until it is used and traversed, and we can use it in combination with other functions in the Stream module. It only fetches rows from the database when they are needed—by default, it fetches chunks of 500 at a time.

Example

Let’s look at an example of how to use Repo.stream to process many records. Say that we want to dump all of our artists records out to a file on our local filesystem. Assume for the moment that save_artist_record is a function that writes the record to a file.

Get hands-on with 1200+ tech skills courses.