Concatenation

pandas provides some useful utilities to combine data from multiple files. Let's see how to use them.

append() with same column

In real life, data comes from different sources, combining them together into a single DataFrame is very useful.

Let’s begin with a very simple scenario.

There are two DataFrame structures with the same column names. Your job is to combine these two DataFrames into one. Before trying it in pandas, do you remember how to do it using the native Python list? We would use append() to add items to a Python list. In pandas, the function is also append(). It concatenates the two DataFrame objects along axis=0, namely the index or column.

Notice:

  • In the example, the function only accepts one DataFrame. However, multiple DataFrames can be passed as a list.
  • The index can be duplicated.
  • ignore_index=True could be passed if the index of this DataFrame is meaningless.
append DataFrame
append DataFrame
import pandas as pd
d1 = {"a": [1, 2], "b": [2, 4]}
df1 = pd.DataFrame(d1)
print("The first DataFrame")
print(df1)
print("------------------------")
d2 = {"a": [3, 4], "b": [6, 8]}
df2 = pd.DataFrame(d2)
print("The second DataFrame")
print(df2)
df2 = pd.DataFrame(d2)
print("------------------------")
df1 = df1.append(df2)
print("Append the second DataFrame to the first one")
print(df1)
print("Append the second DataFrame to the first one and set ignore_index=True")
df1 = df1.append(df2, ignore_index=True)
print(df1)

Two DataFrame objects are created from line 3 to line 12.

Line 15 shows the basic usage of append. From the output of line 16, you may notice that both DataFrame objects keep the original indexes. You can see the duplicated index.

ignore_index=True is passed to append on Line 20, which would remove all indexex of original DataFrame objects.

append() with different column

Not all DataFrames have the same column names. If two DataFrames have different column names, what would happen if append() is called? These DataFrames would be joined where the column names are the same and the remaining positions would be filled by NaN.

Append with different columns
Append with different columns
import pandas as pd
d1 = {"b": [1, 2], "c": [2, 4]}
df1 = pd.DataFrame(d1)
print("The first DataFrame")
print(df1)
print("------------------------")
d2 = {"c": [3, 4], "d": [6, 8]}
df2 = pd.DataFrame(d2)
print("The second DataFrame")
print(df2)
df2 = pd.DataFrame(d2)
print("------------------------")
df1 = df1.append(df2)
print("Append the second DataFrame to the first one")
print(df1)

Two DataFrame objects are created from line 3 to line 12. Then df2 is appended to df1 on line 15.

From the output from line 17, you would find that the columns of these two objects are different.

concat with outer join

concat() is a much more powerful function to join DataFrames, which can combine DataFrame objects along both rows and columns, unlike append which does it only along the rows.

Its default behavior is to join DataFrames and columns, just like append(). axis=1 can be passed to join DataFrames along with rows.

join is a parameter that specifies how to handle indexes on another axis. "outer" for union and "inner" for the intersection.

Below is an example of an outer join. The first DataFrame has an index [1, 2], the second DataFrame has an index [2, 3], they have one overlapping index, 2. The rows with index=2 in both DataFrames would join. Meanwhile, other rows would be kept separate. The empty location would be filled by NaN. After concat, the total row number is 3.

Concat two DataFrame objects with axis=1 and outer join
Concat two DataFrame objects with axis=1 and outer join
import pandas as pd
d1 = {"a": [1, 2], "b": [2, 4]}
df1 = pd.DataFrame(d1, index=[1, 2])
print("The first DataFrame")
print(df1)
print("------------------------")
d2 = {"c": [3, 4], "d": [6, 8]}
df2 = pd.DataFrame(d2, index=[2, 3])
print("The second DataFrame")
print(df2)
print("------------------------")
print("The outer join DataFrame")
df3 = pd.concat([df1, df2], axis=1)
print(df3)

Two DataFrame objects are created from line 3 to line 12. On line 14, we see how to concat these two objects using outer join.

concat with inner join

inner join handles the index for intersection. In the example below, these two DataFrames share only one common index, 2. So, the final DataFrame has only one row.

Concat two DataFrame objects with inner join
Concat two DataFrame objects with inner join
import pandas as pd
d1 = {"a": [1, 2], "b": [2, 4]}
df1 = pd.DataFrame(d1, index=[1, 2])
print("The first DataFrame")
print(df1)
print("------------------------")
d2 = {"c": [3, 4], "d": [6, 8]}
df2 = pd.DataFrame(d2, index=[2, 3])
print("The second DataFrame")
print(df2)
print("------------------------")
print("The inner join DataFrame")
df3 = pd.concat([df1, df2], axis=1, join="inner")
print(df3)

Two DataFrame objects are created from line 3 to line 12. Then line 14 shows how to concat these two objects with inner join.

Comparison with SQL join

If you are familiar with SQL, you can check the comparison table below.

concat join type SQL
inner INNER JOIN
outer FULL OUTER JOIN
right RIGHT OUTER JOIN
left LEFT OUTER JOIN