Search⌘ K
AI Features

Plotting the Data

Explore how to plot data by merging mental illness and GDP datasets using Python. Understand the visual relationships between wealth and the prevalence of anxiety, schizophrenia, and pancreatic cancer through scatter plots and regression trends.

After loading the libraries, we’ll need to generate a scatter plot; we’ll merge our anxiety disorder data frame (anx) with the GDP data using pd.merge. We would do the same to merge our schizophrenia data (sch).

Plot prevalence of anxiety disorder

Plotting the graph won’t involve anything we haven’t seen before in these projects. The labels argument is used to make the plot more readable by assigning descriptive axis labels.

And here’s what came out the other end:

Python 3.5
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
gdp = pd.read_csv('WorldBank_PerCapita_GDP.csv')
anx = pd.read_csv('anxiety.csv')
anx = anx[['Country','Val']]
merged_data_anx = pd.merge(gdp, anx, on='Country')
merged_data_anx = merged_data_anx[['Country', 'Valu\
e', 'Val']]
merged_data_anx.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=True)
fig = px.scatter(merged_data_anx, x="Val", y="Value\
",
trendline="ols", log_x=True,
labels={
"Value": "GDP (in dollars)",
"Val": "Prevalence of Anxiety \
Disorders (/100k)"
},
hover_data=["Country", "Val"])
fig.write_image("output/graph.png")

While some may find this surprising, the regression line is nearly flat, indicating that there’s minimal variance in prevalence between rich and developing nations. There’s quite a cluster of high-prevalence countries at the right edge of the x-axis along the very bottom (i.e., the least wealthy regions). This plot therefore suggests that anxiety disorders are more prevalent in countries with lower GDP.

Plot prevalence of schizophrenia

Here’s how we plot the prevalence of schizophrenia:

Python 3.5
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
gdp = pd.read_csv('WorldBank_PerCapita_GDP.csv')
sch = pd.read_csv('schizophrenia.csv')
sch = sch[['Country','Val']]
merged_data_sch = pd.merge(gdp, sch, on='Country')
merged_data_sch = merged_data_sch[['Country', 'Value', 'Val']]
merged_data_sch.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=True)
fig = px.scatter(merged_data_sch, x="Val", y="Value\
",
trendline="ols", log_x=True,
labels={
"Value": "GDP (in dollars)",
"Val": "Prevalence of Schizoph\
renia (/100k)"
},
hover_data=["Country", "Val"])
fig.write_image("output/graph.png")

Here we can see another trend that some may find surprising; we’re seeing a significant increase in the prevalence of schizophrenia as we move up the y-axis towards higher GDP rates. The lowest rate among “developed” nations was Denmark, which reported many cases ( 286 per 100,000).

1.

How many populations of the U.S. are affected by schizophrenia?

Show Answer
Did you find this helpful?

Plot prevalence pancreatic cancer

To confirm our findings, let’s take a medical disease that should have — at best — minimal cultural connections. Perhaps this condition will show minimal variation between countries. Well, using all the same code, here’s how that turned out for pancreatic cancer:

Python 3.5
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
gdp = pd.read_csv('WorldBank_PerCapita_GDP.csv')
pan = pd.read_csv('pancreatic.csv')
pan = pan[['Country','Val']]
merged_data_pan = pd.merge(gdp, pan, on='Country')
merged_data_pan = merged_data_pan[['Country', 'Value', 'Val']]
merged_data_pan.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=True)
fig = px.scatter(merged_data_pan, x="Val", y="Value",
trendline="ols", log_x=True,
labels={
"Value": "GDP (in dollars)",
"Val": "Prevalence of Pancreatic Cancer (/100k)"
},
hover_data=["Country", "Val"])
fig.write_image("output/graph.png")

Another result that some may find surprising. That clear upward slope of the regression line (or trendline) shows us an unmistakable correlation between higher wealth and higher incidence of pancreatic cancer. What’s going on here? We’ll see this in the next lesson.

1.

Write a code to plot a scatter graph with a regression line.

Show Answer
Did you find this helpful?

Jupyter notebook in action

To see the above Python scripts in a notebook, click to launch the application.

Please login to launch live app!
1.

What is the use of trendline="ols"?

Show Answer
Did you find this helpful?