Unirank: Data Preview (head, pipe and csvlook)

The ranking of universities has become a common task performed by many institutions, each of them proposes a different ranking based in several weighted categories. Examples of those rankings are: Webometrics Ranking of World Universities, THES - QS World Universities Rankings and Academic Ranking of World Universities. The first ranking measures the visibility of the universities and their global performance in the web. The last two attempt to measure the performance of the universities based in categories like prizes received by members, citations, and publications. Employers, especially from the multinational organisations use rankings to find universities to source graduates, so attending a high-ranking university can help in a competitive job market.

In this lesson we will use a simple (publicly available) dataset obtained from the data.world called: US News Universities Rankings 2017 edition. From this data, using Bash we will explore different features and finally find an interesting fact about the correlation of tuition fees and uni rank. This simple dataset contains the following fields.

Name - institution name
Location - City, State where located
Rank - read methodology here.
Description - a snippet of text overview from U.S. News.
Tuition and fees - combined tuition and fees.
Undergrad Enrollment - number of enrolled undergratuate students

In each project described in this book, we will attempt to learn a few Bash commands and tricks.

Learning objectives

By completing this, you will learn to use the following Bash commands:

head – output the first part of files
tail – opposite to head
cat – concatenate and print files
sort – sort file contents
uniq – remove duplicate entries

Data download

You should download the data from here, as we have slightly simplified the data and let’s save the data as: unirank.csv. I believe my course would be incomplete without video demos of the commands I am showing here. Therefore, I will add a video demo with each lesson. Watch and enjoy it before you proceed to the next part:

Dataset preview (use the ‘head’ command)

This dataset is small (toy) and we could in principle open it in a text editor or in Excel. However, real-world datasets are often larger and cumbersome to open in their entirety. Let’s assume as if it were a Big Data (and unstructured) and we want to get a sneak peak of the data. This is often the first thing to do when you get your hands on new data- previewing; it is important to get a sense for what it contains, how it is organized, and whether the data makes sense in the first place.

To help us get a preview of the data, we can use the command head, which as the name suggests, shows the first few lines of a file (the unirank.csv dataset has been already been stored onto the course storagespace, thanks to educative.io team). Simply press “Run” and see the output!

However, you will find the outputs are not very interesting on the first place, therefore we install a tool called csvkit, which is a suite of command-line tools for converting to and working with CSV (install: sudo pip install csvkit).

This will greatly help our future analyses. After we have installed the csvkit, we re-run the head command, but outputs piped (|, which basically chains the output of the first command to the input of the next, soon we’ll learn about it) through the csvlook command from the csvkit suit:

Here, the dataset name unirankingdata.csv is a command-line argument that is given to the command head and the -n is an option which allows us to overwrite the 10-line default. Such command-line options are typically specified with a dash followed by a string, a space, and the value of the option (e.g. -n 25 ). However, often the options don’t require a value but instead are made for toggling a feature on or off, for example top -h shows the help page for the command top that shows off all the running process and apps.

Course Introduction

Project 1: Analyzing the 'US News' University Ranking Data

Project 2: Facebook Data Mining

Project 3: Australian Cities Crime Statistics

Project 4: Shakespearean-era plays and poems data mining

Bash Tutorials

REGEX Tutorials

AWK Tutorials

SED, GREP and Find Tutorials

Beyond the Text Files! Enter into the Big Data Landscape - Concepts

Conclusion

Learning objectives

Data download

Dataset preview (use the ‘head’ command)

Do you want to know more?