Reading The Data and Getting Top Rated movies

Learn how to read plaintext and merge data over multiple files.

We'll cover the following

The data in the movie lens database will be in plaintext separated by |, it will not have column headers like Excel files. This means that we’ll have to provide our own column headers. pandas has a function that reads plaintext files, which is similar to NumPy’s read file functions that we covered in a previous lesson.

Let’s look at the code.

user_columns = ['user_id', 'age', 'sex']

We’ll declare user_columns for the three entries we want to read from u.user.

Note: u.user has more than three entries per row, but we’ll ignore the others.

users = pd.read_csv('u.user', sep='|', names=user_columns, usecols=range(3))

We’ll use the read_csv() function to read the file, even though it isn’t technically a CSV (Comma Separated Values) file.

We have four arguments:

  1. File name.

  2. Our values are separated by |.

  3. The third argument merely passes in our user_columns as column names.

  4. The final argument says that we only need to read three values per row. Without this, pandas will try to read the whole line and cause a mess.

In exactly the same way, we’ll load the other values as well:

rating_columns = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=rating_columns, usecols=range(3))
movie_columns = ['movie_id', 'title']
movies = pd.read_csv('u.item', sep='|', names=movie_columns, usecols=range(2))

Now comes the critical part. The data in the movie_lens dataset is spread over multiple files. But that is not good for us. We need to merge it so that we can analyze it in one go. If we have used SQL, we’ll know it has a JOIN function to join tables. pandas has something similar.

movie_ratings = pd.merge(movies, ratings)
movie_data = pd.merge(movie_ratings, users)

The merge() function will do the same thing. It will take two different DataFrames and merge them. We can specify an index to merge them on, but pandas is smart enough to find the standard index and merge them there.

So, for the first merge:

movie_ratings = pd.merge(movies, ratings)

pandas will see that the movie_id is common between movies and ratings, so it will merge on that. We’ll then merge this newly created movie_ratings with users:

movie_data = pd.merge(movie_ratings, users)

Again, pandas will figure out what user_id is common and merge around that.

The final result is a dataset containing all the info that we need to start working on it.

Printing top rated-movies

The first thing that we’re going to do is find movies that have the most ratings.

## Top rated movies
print("Top rated movies (overall):\n", movie_data.groupby('title').size().sort_values(ascending=False)[:20]

The code above may look intimidating, but this is an actual feature of Python. It essentially allows us to chain multiple commands. Let’s break it down.

movie_data.groupby('title')

The groupby() function allows us to group the data by a chosen column.

Note: Remember, that the data is normally printed by index. Here, we’ll arrange the data by title, not index.

movie_data.groupby('title').size()

After we’ve ordered the movies by title, we’ll call the size() function to arrange them by size. Normally, this is in ascending order, so:

movie_data.groupby('title').size().sort_values(ascending=False)[:20]

We’ll call the sort_values(ascending=False) function, which arranges the data in descending order (so movies with more ratings appear at the top). Finally, we’ll use [:20]_ to only display the top 20 movies.

All of the above code has been added in the code widget below.

Get hands-on with 1200+ tech skills courses.