PostgreSQL knows how to deal with characters and text, and it implements several data types for that, all documented in the character types chapter of the documentation.

About the data type itself, it must be noted that text and varchar are the same things as far as PostgreSQL is concerned, and character varying is an alias for varchar. When using varchar(15), we’re basically telling PostgreSQL to manage a text column with a check constraint of 15 characters.

Yes, PostgreSQL knows how to count characters even with Unicode encoding.

Text processing in PostgreSQL

There’s a very rich set of PostgreSQL functions to process the text—we can find them all in the String Functions and Operators documentation chapter—with functions such as overlay(), substring(), position(), or trim(). Or aggregates such as string_agg(). There are also regular expression functions, including the very powerful regexp_split_to_table().

For more about PostgreSQL regular expressions, read the main documentation about them in the chapter on pattern matching.

Pattern matching

Additionally, to the classic like and ilike patterns and to the SQL standard similar to operators, PostgreSQL embeds support for a full-blown regular expression matching engine. The main operator implementing REGEXP is ~, and then we find the derivatives for not matching and match either case. In total, we have four operators: ~, !~, ~*, and !~*.

Regular expressions

The regular expression split functions are powerful in many use cases. In particular, they’re very helpful when we have to work with a messy schema in which a single column represents several bits of information in a pseudo-specified way. An example of such a dataset is available in open data: the Archives de la Planète or “planet archives.” The data is available as CSV and, once loaded, looks like the following, all in French (but it doesn’t matter very much for our purposes here):

─[ RECORD 1 ]──────────────────────────────────────────────
id          │ IF39599
inventory   │ A 2 037
orig_legend │ Serbie, Monastir Bitolj, Un Turc
legend      │ Un Turc
location    │ Monastir (actuelle Bitola), Macédoine
date        │ mai 1913
operator    │ Auguste Léon
themes      │ Habillement > Habillement traditionnel,Etres …
            │…humains > Homme,Etres humains > Portrait,Rela…
            │…tions internationales > Présence étrangère
collection  │ Archives de la Planète

Note: PostgreSQL also supports indexing for regular expressions thanks to its trigram extension: pg_trgm.

You can see that the themes column contains several categories for a single entry, separated by a comma. Within that comma-separated list, we find another classification, this time separated with a greater than sign, which looks like a hierarchical categorization of the themes.

The value IF39599 from the id column is relevant to that series of themes:

   id    │           cat1            │           cat2           
 IF39599 │ Habillement               │ Habillement traditionnel
 IF39599 │ Etres humains             │ Homme
 IF39599 │ Etres humains             │ Portrait
 IF39599 │ Relations internationales │ Présence étrangère
(4 rows)

The question is, how do we get that information? Also, is it possible to have an idea of the distribution of the whole dataset in relation to the categories embedded in the themes column?

With PostgreSQL, this is easy enough to achieve.

Pattern matching and regular expressions

Let’s use the techniques available in PostgreSQL to achieve the required output.

Using split in PostgreSQL

We’re going to split the themes column using a regular expression:

Get hands-on with 1200+ tech skills courses.