Loading the Data

Get an introduction to another dataset and load the related data.

We’re going to look at the LastFm dataset, the official song tag, and the song similarity dataset of the Million Song Dataset (MSD). Quoting from the link above:

“The MSD team is proud to partner with Last.fm in order to bring you the largest research collection of song-level tags and precomputed song-level similarity. All the data is associated with MSD tracks, which makes it easy to link it to other MSD resources: audio features, artist data, lyrics, etc.”

Importing the dataset

First, we need to import this dataset into a PostgreSQL database. The dataset is offered both as a SQLite database and a JSON file. Loading the SQLite database is easy thanks to pgloader:

$ curl -L -o /tmp/lastfm_tags.db
       http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_tags.db

$ pgloader /tmp/lastfm_tags.db pgsql://appdev@localhost/appdev

We get the following output, meaning the data is now available in our PostgreSQL database for further indexing:

             table name     errors       read   imported      bytes      total time
-----------------------  ---------  ---------  ---------  ---------  --------------
                  fetch          0          0          0                     0.000s
        fetch meta data          0          8          8                     0.028s
         Create Schemas          0          0          0                     0.000s
       Create SQL Types          0          0          0                     0.006s
          Create tables          0          6          6                     0.031s
         Set Table OIDs          0          3          3                     0.009s
-----------------------  ---------  ---------  ---------  ---------  --------------
                   tids          0     505216     505216     9.2 MB          1.893s
                   tags          0     522366     522366     8.6 MB          1.781s
                tid_tag          0    8598630    8598630   135.7 MB         32.614s
-----------------------  ---------  ---------  ---------  ---------  --------------
COPY Threads Completion          0          4          4                    34.366s
         Create Indexes          0          5          5                  2m14.346s
 Index Build Completion          0          5          5                    36.976s
        Reset Sequences          0          0          0                     0.054s
           Primary Keys          0          0          0                     0.000s
    Create Foreign Keys          0          0          0                     0.000s
        Create Triggers          0          0          0                     0.001s
       Install Comments          0          0          0                     0.000s
-----------------------  ---------  ---------  ---------  ---------  --------------
      Total import time          ✓    9626212    9626212   153.4 MB       3m25.743s

Here, pgloader extracted the table and index definitions from the SQLite database using the sqlite_master catalog and the PRAGMA table_info() commands, and it migrated the data in a streaming fashion to PostgreSQL using the COPY protocol.

Querying the data

Having a look at the demo_tags.py script from the Last.fm project, we can see how to use relations here, and we realize they are using the 64-bit signed integer ROWID system column. We need something comparable to be able to make sense of the data:

Get hands-on with 1200+ tech skills courses.