Creating and Updating Columns
Explore the ”one true way” to create and update columns in pandas.
We'll cover the following...
Loading the data
We’ll be looking at a dataset of Python users from
This is a pretty good dataset. It has over 50,000 rows and 264 columns. However, we’ll need to clean it up to perform exploratory analysis.
Some of the columns have a dummy-like encoding, like the columns starting with “database.” end with a database name. In the values for these columns, the database name is included. Since a user might use multiple databases, this is a mechanism to encode this. However, it also creates many columns, one per database. To keep the data manageable, we’re going to filter out columns like the database columns.
Below is code that determines whether a feature can have multiple values (like database) and removes those:
Note that these column names have a period in them. We’re going to replace those with an underscore to allow us to access the names of the columns via attributes (with a period).
Let’s look at the “age” column:
We’re going to pull out the first two characters from the “age” column and convert it to numbers. We’ll have to convert it to float because there are missing values in the column:
We can also write .str.slice(0,2) as .str[0:2].
Note: Currently, pandas can’t convert strings directly to
Int64...