Unirank: Data Preview (head, pipe and csvlook)

The ranking of universities has become a common task performed by many institutions, each of them proposes a different ranking based in several weighted categories. Examples of those rankings are: Webometrics Ranking of World Universities, THES - QS World Universities Rankings and Academic Ranking of World Universities. The first ranking measures the visibility of the universities and their global performance in the web. The last two attempt to measure the performance of the universities based in categories like prizes received by members, citations, and publications. Employers, especially from the multinational organisations use rankings to find universities to source graduates, so attending a high-ranking university can help in a competitive job market.

In this lesson we will use a simple (publicly available) dataset obtained from the data.world called: US News Universities Rankings 2017 edition. From this data, using Bash we will explore different features and finally find an interesting fact about the correlation of tuition fees and uni rank. This simple dataset contains the following fields.

  • Name - institution name
  • Location - City, State where located
  • Rank - read methodology here.
  • Description - a snippet of text overview from U.S. News.
  • Tuition and fees - combined tuition and fees.
  • Undergrad Enrollment - number of enrolled undergratuate students

In each project described in this book, we will attempt to learn a few Bash commands and tricks.

Learning objectives

By completing this, you will learn to use the following Bash commands:

  • head – output the first part of files
  • tail – opposite to head
  • cat – concatenate and print files
  • sort – sort file contents
  • uniq – remove duplicate entries

Data download

You should download the data from here, as we have slightly simplified the data and let’s save the data as: unirank.csv. I believe my course would be incomplete without video demos of the commands I am showing here. Therefore, I will add a video demo with each lesson. Watch and enjoy it before you proceed to the next part:

Video Lecture (Unirank.csv dataset introduction and data preview)

Dataset preview (use the ‘head’ command)

This dataset is small (toy) and we could in principle open it in a text editor or in Excel. However, real-world datasets are often larger and cumbersome to open in their entirety. Let’s assume as if it were a Big Data (and unstructured) and we want to get a sneak peak of the data. This is often the first thing to do when you get your hands on new data- previewing; it is important to get a sense for what it contains, how it is organized, and whether the data makes sense in the first place.

To help us get a preview of the data, we can use the command head, which as the name suggests, shows the first few lines of a file (the unirank.csv dataset has been already been stored onto the course storagespace, thanks to educative.io team). Simply press “Run” and see the output!

#!/bin/bash
head unirank.csv

However, you will find the outputs are not very interesting on the first place, therefore we install a tool called csvkit, which is a suite of command-line tools for converting to and working with CSV (install: sudo pip install csvkit).

This will greatly help our future analyses. After we have installed the csvkit, we re-run the head command, but outputs piped (|, which basically chains the output of the first command to the input of the next, soon we’ll learn about it) through the csvlook command from the csvkit suit:

#!/bin/bash
head unirank.csv | csvlook

You should see the first 10 lines of the file output onto the screen, to see more than the first 10 lines, e.g. the first 25, use the -n option:

#!/bin/bash
head -n 25 unirank.csv | csvlook

Here, the dataset name unirankingdata.csv is a command-line argument that is given to the command head and the -n is an option which allows us to overwrite the 10-line default. Such command-line options are typically specified with a dash followed by a string, a space, and the value of the option (e.g. -n 25 ). However, often the options don’t require a value but instead are made for toggling a feature on or off, for example top -h shows the help page for the command top that shows off all the running process and apps.

widget

From the first 25 lines of the file, we can infer that the data is formatted as a file with separated values. From the first line (often called a header line) and the first few lines of data, we can infer the column contents: Name, City, State, Tuition and fees, Undergrad Enrollment and Rank.

Do you want to know more?

Read the attched man pages:

'head' man page
'csvlook' man page
'pipe' tldp man page