Search⌘ K
AI Features

Generating Functions from External data

Explore how Elixir combines macros and compile-time code generation to transform external data files into efficient function definitions. Understand the use of UnicodeData.txt in the String.Unicode module for advanced pattern matching and performance. Discover how these techniques reduce boilerplate and enable robust internationalization and MIME-type validation libraries.

We'll cover the following...

We’ve performed compile-time code generation through careful use of macros. In this chapter, we will exploit Elixir’s module system. With advanced metaprogramming, we can embed data and behavior within modules directly from outside sources of information.

When we use Elixir’s module system, countless lines of boilerplate can be removed while producing highly optimized programs. First, we’ll explore how Elixir embeds an entire Unicode database at compile time for its robust Unicode support.

Next, we’ll build MIME-type validation and internationalization libraries, while applying compile-time optimizations that aren’t possible in many languages.

Note: Knowing when and where to use this technique will allow us to construct fast, maintainable programs in strikingly few lines of code.

What is MIME?

MIME is short for Multipurpose Internet Mail Extensions. It is an Internet standard that extends the format of email messages to support text in character sets other than ASCII and attachments of audio, video, images, and application programs.

String Unicode example

Turning raw data into code might sound impractical, but it’s an extremely nice solution to a number of problems. The way Elixir manages its String Unicode is a great metaprogramming example to date. The String.Unicode module of the standard library dynamically generates thousands of function definitions from external data when compiled. These generated functions patterns match all known Unicode characters to achieve some of the best Unicode support in languages today.

Let’s elaborate on String.Unicode module to understand how Elixir makes this happen. Instead of manually mapping tens of thousands of Unicode code points into an Elixir data structure, a UnicodeData.txt file is checked into the Elixir source repository, which contains every known Unicode code-point mapping. This dataset is read in at compile time to produce function definitions that handle Unicode conversions.

Here’s an overview of how it works:

Shell
## UnicodeData.txt snippet:
...
00C7;LATIN CAPITAL LETTER C WITH CEDILLA;Lu;0;L;0043 0327;...
00C8;LATIN CAPITAL LETTER E WITH GRAVE;Lu;0;L;0045 0300;...
00C9;LATIN CAPITAL LETTER E WITH ACUTE;Lu;0;L;0045 0301;...
00CA;LATIN CAPITAL LETTER E WITH CIRCUMFLEX;Lu;0;L;0045 0302;...
00CB;LATIN CAPITAL LETTER E WITH DIAERESIS;Lu;0;L;0045 0308;...
...

The UnicodeData.txt file contains 27,000 lines of these semicolon-delimited code-point mappings. The String.Unicode module opens the file at compile-time and parses the code points into function definitions. The final expansion contains a function definition per code point for case conversions and other string transformations.

Let’s see what the cross-section of String.Unicode would look like after its functions have been generated. It will help us to understand how generating functions from data files opens up unique pattern-matching possibilities.

Erlang
defmodule String.Unicode do
...
def upcase(string), do: do_upcase(string) |> IO.iodata_to_binary
...
defp do_upcase("é" <> rest) do
:binary.bin_to_list("É") ++ do_upcase(rest) end
defp do_upcase("ć" <> rest) do
:binary.bin_to_list("Ć") ++ do_upcase(rest)
end
defp do_upcase("ü" <> rest) do
:binary.bin_to_list("Ü") ++ do_upcase(rest) end
...
defp do_upcase(char <> rest) do
:binary.bin_to_list(char) ++ do_upcase(rest)
end
...
end

The compiled module contains thousands of these definitions. When converting a string like “Thanks José!” to uppercase, String.Unicode calls do_upcase/1 recursively for each code point in the string. When “é” is encountered, the generated function for that code point is matched and returns the uppercase version.

It’s an extremely elegant solution to an otherwise difficult problem. Let’s break it down to see how the algorithm works.

Using the Erlang Virtual Machine’s pattern-matching engine, Elixir gains performant string manipulation in a fraction of the code that must be written by hand. The beauty of this technique is that new unicode characters can be supported in the future by updating UnicodeData.txt and running mix compile.

Now that you’ve had a glimpse of the way Elixir takes advantage of code generation from external data, we will apply the above technique to our own MIME-type and internationalization libraries in the next lesson.