Importing Geospatial Data

Learn how to import geospatial data into a GeoDataFrame.

Overview

The GeoDataFrame is the data structure provided by GeoPandas to store tabular and geographic data in a unified schema. It consists of a tabular data structure (a pandas DataFrame) with one or more columns of geometry data types, typically stored as a GeoSeries. The geometry column is what distinguishes a GeoDataFrame from a standard pandas DataFrame and enables the geospatial capabilities.

In this lesson, we'll learn how to load some data into a GeoDataFrame and see its internal structure.

The datasets

For this first example, we are going to use one of the datasets (Natural Earth) that comes with the geodatasets library. The geodatasets library provides several toy datasets for GIS processing:

  • Natural Earth lowres: A low-resolution representation of the world's countries (polygons).

  • Natural Earth cities: A sample with the center points of 243 major cities in the world (points).

  • New York Borough Boundaries: A high-resolution representation of the 5 boroughs of New York City (Bronx, Manhattan, Queens, Brooklyn, Staten Island).

The full list of datasets and their descriptions can be seen in the output table provided from the following code:

Press + to interact
import geodatasets
import pandas as pd
datasets = pd.DataFrame(geodatasets.data.flatten()).T
print(datasets[['name', 'geometry_type', 'description']].to_html())

To retrieve one dataset and save it locally, we can use the get_path function. Since the location will vary depending on the installation, environment, etc., the get_path function returns the full path where the dataset is locally stored. In the following code snippet, we retrieve two sample datasets (naturalearth.land and New York Borough Boundaries location) and store them locally:

Press + to interact
import geodatasets
natural_earth = geodatasets.get_path('naturalearth.land')
print(f'Natural Earth location: {natural_earth}\n')
nybb = geodatasets.get_path('ny.bb')
print(f'New York Borough Boundaries location: {nybb}')

As we can see, the datasets can refer to the main file (e.g., .shp) or be given in compressed (ZIP) format. In this case, GeoPandas will uncompress it and load the dataset automatically. If there are multiple datasets within the .zip file or multiple folders, we can specify the folder and the filename by appending !folder/filename to the path, like so:

zipfile = zip:///local_path/zippedfile.zip!folder/filename

The .read_file() method

In GeoPandas, we primarily use .read_file() to read geospatial data. This method can read various geospatial data file formats, including shapefiles (.shp), GeoJSON files (.geojson), and many others.

The .read_file() method reads the geospatial data file into a GeoDataFrame object, which is a specialized pandas DataFrame object that can store and manipulate geospatial data. The method automatically detects the file format and reads it accordingly.

For example, to read a shapefile into a GeoDataFrame using .read_file(), we can use the following code:

Press + to interact
import geopandas as gpd
# read the shapefile
gdf = gpd.read_file('path/to/shapefile.shp')

The GeoDataFrame structure

Let's open the Natural Earth dataset and analyze its structure:

Press + to interact
import geopandas as gpd
import geodatasets
# open the Natural Earth dataset
n_earth = geodatasets.get_path('naturalearth.land')
gdf = gpd.read_file(n_earth)
# apply the function to trim the geometry (for display purpose)
gdf['geometry_str'] = gdf.geometry.map(lambda x: str(x)[:50])
# preview the dataframe
print(gdf.head().to_html())

  • Lines 1–2: We import the geopandas and geodatasets libraries.

  • Line 5: We retrieve the file path for the Natural Earth dataset using the get_path() function.

  • Line 6: We read the contents of the file at the specified path using the read_file().

  • Line 9: We add a new column called geometry_str, which contains a truncated string representation of the geometry column for visualization purposes.

  • Line 12: We preview the dataset as HTML using the .to_html() function.

Here, we can observe that each row represents a feature (islands) with its corresponding attributes. The geometry of each record (rows) is stored in a special column called geometry, which stores Shapely geometries and makes it possible for GeoPandas to render and perform spatial operations on them.

Note: The geometry column could have any arbitrary name. In fact, the GeoDataFrame can have multiple columns with geographic information, but only one can be active at a time. To set a column as the geographic data for the GeoDataFrame we can use the command below.

gdf.set_geometry('column name')

Other data formats (FIONA)

Besides traditional shapefiles, GeoPandas is one of the most used data formats for vectorial geometries and is able to open other types of geographic data. For that, it uses FIONA underneath, which is built on the top of GDAL. The good news is that we don't need to learn GDAL's cumbersome API bindings. FIONA provides an elegant interface for reading and writing vectorial data in standard Python IO style.

Therefore, besides shapefiles, it can read several vector-based data formats without additional configuration. For a full list of FIONA pre-installed drivers, we can check the supported_drivers dictionary, like so:

Press + to interact
import fiona
# Get the supported drivers
print(fiona.supported_drivers)

The most important file formats are supported by default, such as:

  • GeoJSON

  • GeoPackage

  • ESRI file geodatabase

  • MapInfo TAB

  • DXF

All these file types will be treated automatically by the .read_file() function, as we will see in the following example. Additional formats can be supported, depending on the GDAL/OGR installation.

Reading from HTTP

One great feature of the .read_file() function, besides the ability to read distinct data formats, is its capacity to load data directly from the internet, through the HTTP protocol. As we don't have anything downloaded besides the internal datasets (that comes in shapefiles), let's grab something directly from the internet.

It's also possible to download remote assets with wget, for example, but we'll pass the URL directly to GeoPandas. Let's try opening the a dataset with US states boundaries. In this example the geometries are provided as .geojson:

Press + to interact
import geopandas as gpd
gdf = gpd.read_file('https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_admin_1_states_provinces_shp.geojson')
ax = gdf.plot(column='iso_3166_2', figsize=(7, 5))
ax.set_ylabel('Latitude (degrees)')
ax.set_xlabel('Longitude (degrees)')
ax.figure.savefig('output/states.png')

Line 3: We read the .geojson geometries from the US States.

Line 5: We plot the GeoDataFrame, specifying the column iso_3116_2 to automatically create a choropleth map with distinct colors by state.

Line 9: We save the figure to the output folder for visualization.