Character and Text
Understand PostgreSQL's character and text data types, including text and varchar equivalence. Learn to process text using built-in functions, manage delimiters with split_part, and apply powerful regular expressions. Explore practical examples, including splitting and normalizing complex text data, and get introduced to full-text search capabilities for advanced text querying.
PostgreSQL knows how to deal with characters and text, and it implements several data types for that, all documented in the character types chapter of the documentation.
About the data type itself, it must be noted that text and varchar are the same things as far as PostgreSQL is concerned, and character varying is an alias for varchar. When using varchar(15), we’re basically telling PostgreSQL to manage a text column with a check constraint of 15 characters.
Yes, PostgreSQL knows how to count characters even with Unicode encoding.
Text processing in PostgreSQL
There’s a very rich set of PostgreSQL functions to process the text—we can find them all in the String Functions and
Operators documentation chapter—with functions such as overlay(), substring(), position(), or trim(). Or aggregates such as string_agg(). There are also regular expression functions, including the very powerful regexp_split_to_table().
For more about PostgreSQL regular expressions, read the main documentation about them in the chapter on pattern matching.
Pattern matching
Additionally, to the classic like and ilike patterns and to the SQL standard similar to operators, PostgreSQL embeds support for a full-blown regular expression matching engine. The main operator implementing REGEXP is ~, and then we find the derivatives for not matching and match either case. In total, we have four operators: ~, !~, ~*, and !~*.
Regular expressions
The regular expression split functions are powerful in many use cases. In particular, they’re very helpful when we have to work with a messy schema in which a single column represents several bits of information in a pseudo-specified way. An example of such a dataset is available in open data: the Archives de la Planète or “planet archives.” The data is available as CSV and, once loaded, looks like the following, all in French (but it doesn’t matter very much for our purposes here):
─[ RECORD 1 ]──────────────────────────────────────────────
id │ IF39599
inventory │ A 2 037
orig_legend │ Serbie, Monastir Bitolj, Un Turc
legend │ Un Turc
location │ Monastir (actuelle Bitola), Macédoine
date │ mai 1913
operator │ Auguste Léon
...
themes │ Habillement > Habillement traditionnel,Etres …
│…humains > Homme,Etres humains > Portrait,Rela…
│…tions internationales > Présence étrangère
...
collection │ Archives de la Planète
...
Note: PostgreSQL also supports indexing for regular expressions thanks to its trigram extension:
pg_trgm.
You can see that the themes column contains several categories for a single entry, separated by a comma. Within that comma-separated list, we find another classification, this time separated with a greater than sign, which looks like a hierarchical categorization of the themes.
The value IF39599 from the id column is relevant to that series of themes:
id │ cat1 │ cat2
═════════╪═══════════════════════════╪══════════════════════════
IF39599 │ Habillement │ Habillement traditionnel
IF39599 │ Etres humains │ Homme
IF39599 │ Etres humains │ Portrait
IF39599 │ Relations internationales │ Présence étrangère
(4 rows)
The question is, how do we get that information? Also, is it possible to have an idea of the distribution of the whole dataset in relation to the categories embedded in the themes column?
With PostgreSQL, this is easy enough to achieve.
Pattern matching and regular expressions
Let’s use the techniques available in PostgreSQL to achieve the required output.
Using split in PostgreSQL
We’re going to split the themes column using a regular expression:
We get the following table:
id │ regexp_split_to_table
═════════╪════════════════════════════════════════════════
IF39599 │ Habillement > Habillement traditionnel
IF39599 │ Etres humains > Homme
IF39599 │ Etres humains > Portrait
IF39599 │ Relations internationales > Présence étrangère
(4 rows)
We have a table with an entry per theme for the same document; we can further split each entry into a two-level category. We do that this time with regexp_split_to_array() so as to keep the categories together:
We have the following:
id │ categories
═════════╪════════════════════════════════════════════════════
IF39599 │ {Habillement,"Habillement traditionnel"}
IF39599 │ {"Etres humains",Homme}
IF39599 │ {"Etres humains",Portrait}
IF39599 │ {"Relations internationales","Présence étrangère"}
(4 rows)
Using split_part() in PostgreSQL
When working with delimited text, we may need only one part of the value instead of splitting it into multiple rows. PostgreSQL provides the split_part() function for this purpose. It returns the field at a given position after splitting a string using a specified delimiter. PostgreSQL defines it as split_part(string, delimiter, n), where fields are counted starting from 1. If the requested field does not exist, the function returns an empty string. PostgreSQL also allows negative positions, which count from the end of the string.
The syntax is shown below:
split_part(string, delimiter, n)
stringis the input text.delimiteris the character or substring used to split the text.nis the field number to return.
The example below extracts different parts from the same input value:
In the query above:
split_part('red,green,blue', ',', 2)returnsgreen.split_part('red,green,blue', ',', 5)returns an empty string because that field does not exist.split_part('red,green,blue', ',', -1)returnsblue, because negative positions count from the end.
Use split_part() when the delimiter is known and you need only one field from the input text. This makes it a simple option for extracting values such as the first name from a full name, the domain from an email address, or a segment from a structured identifier. If the goal is to split a string into multiple rows, functions such as regexp_split_to_table() are more suitable.
Normalize the output
We’re almost there. For the content to be normalized, we want to have the categories in their own separate columns—say, category and subcategory:
Let’s make sense of the open data:
id │ category │ subcategory
═════════╪═══════════════════════════╪══════════════════════════
IF39599 │ Habillement │ Habillement traditionnel
IF39599 │ Etres humains │ Homme
IF39599 │ Etres humains │ Portrait
IF39599 │ Relations internationales │ Présence étrangère
(4 rows)
As a side note, cleaning up a dataset after we’ve imported it into PostgreSQL makes the difference clear between the classic ETL jobs (extract, transform, load) and the powerful ELT jobs (extract, load, transform), where we can transform our data using a data processing language: SQL.
So, now that we know how to have a clear view of the dataset, let’s inquire about the categories used in our dataset:
That query returns 175 rows, so here’s only the extract:
category │ subcategory │ count
════════════════════════╪══════════════════════════════╪═══════
Activite économique │ Agriculture / élevage │ 138
Activite économique │ Artisanat │ 81
Activite économique │ Banque / finances │ 2
Activite économique │ Boutique / magasin │ 39
Activite économique │ Commerce ambulant │ 5
Activite économique │ Commerce extérieur │ 1
Activite économique │ Cueillette / chasse │ 9
...
Art │ Peinture │ 15
Art │ Renaissance │ 52
Art │ Sculpture │ 87
Art │ Théâtre │ 7
Art │ ¤ │ 333
...
Habillement │ Uniforme scolaire │ 1
Habillement │ Vêtement de travail │ 3
Habillement │ ¤ │ 163
Habitat / Architecture │ Architecture civile publique │ 37
Habitat / Architecture │ Architecture commerciale │ 24
Habitat / Architecture │ Architecture de jardin │ 31
...
Vie quotidienne │ Vie domestique │ 8
Vie quotidienne │ Vie rurale │ 5
Vie quotidienne │ ¤ │ 64
¤ │ ¤ │ 4449
(175 rows)
Each subcategory appears only within the same category each time, and we’ve chosen to do a roll-up analysis of our dataset here. Other grouping sets are available, such as the cube or manually editing the dimensions we’re interested in.
In an ELT assignment, we would create a new categories table containing each entry we saw in the roll-up query only once, as a catalog, and an association table between the main opendata.archives_planete table and this categories catalog, where each archive entry might have several categories and subcategories assigned, and each category, of course, might have several archive entries assigned.
Here, the topic is about text function processing in PostgreSQL, so we just run the query against the base dataset.
Full-text search in PostgreSQL
When mentioning advanced string matching and the regular expression, we must also mention PostgreSQL’s implementation of a full-text search with support for documents, advanced text search queries, ranking, highlighting, pluggable parsers, dictionaries and stemmers, synonyms, and thesaurus. Additionally, it’s possible to configure all those pieces. If you want further information about advanced searches of documents that we manage in PostgreSQL, please refer to the documentation.
Try it yourself
We’ll perform four tasks in the following playground.
- The query to split a table using regexp is given in the
regexp-split.sqlfile. Execute it using the following command:
- The query to split the string in a table using REGEXP is given in the
regexp-split-array.sqlfile. Execute it using the following command:
- The query to split a column using REGEXP is given in the
regexp-split-column.sqlfile. Execute it using the following command:
- The query to retrieve categories available in the
archives_planetedata set is given in theregexp-rollup.sqlfile. Execute it using the following command:
- The query to split a string using
split_part()is given in thesplit-part.sqlfile. Execute it using the following command:
Click the “Run” button to start:
select
split_part('red,green,blue', ',', 2) as second_part,
split_part('red,green,blue', ',', 5) as missing_part,
split_part('red,green,blue', ',', -1) as last_part;Tip: Press the up and down arrow keys to move between the rows and
qto quit. Typeexitor\qto log out from the database.