A variety of data exists in data science. Some data is in the form of numbers, while some exist as categories. Different types of data need to be represented differently.
Data can be broadly classified into two main categories: quantitative data and qualitative data.
Quantitative data exists in the form of quantities or numbers. This includes the population of a country, the weight of a person, or the number of days in a week.
Qualitative data exists as categories. It is non-numerical in nature. This includes categories of gender, daily weather, or types of degrees in a university.
Both qualitative and quantitative data can be sub-divided into further categories.
Quantitative data can be sub-divided into two main categories: discrete and continuous data.
Discrete data refers to data in whole numbers. They can take certain fixed values only. These include the number of days in a month, the age of a person, or the number of siblings.
Continuous data spans a range of values. It is not fixed and can have decimal numbers as well. This includes the GPA of a student or the speed of a car.
Qualitative data can be sub-divided into two main categories: nominal and ordinal data.
Nominal data does not have an order amongst it. It cannot be ranked in any way. This includes categories of gender or race.
Ordinal data has some order within it. It can be ranked from high to low, good to bad, or vice-versa. This includes levels of education in school, survey responses on a Likert scale, or yelp ratings.
The illustration below summarizes types of data:
Data can be represented using visualizations. Visualizations help in providing an overview of the data along with summary statistics. Different types of data represent using different visualizations.
We will discuss some prominent visualizations below:
A bar chart can be used to represent the counts of qualitative data. It can also be used to represent quantitative data if it belongs to some category.
A bar chart has categories on the x-axis and counts or values on the y-axis. It is used to compare different values, items, and categories of data.
For example, bar charts can show the number of students of different genders in a university. Genders will be on the x-axis as categories. Counts will be on the y-axis. It can also be used to show voting results for a particular questionnaire, as shown on the right.
A pie chart is used to represent proportions of different categories of qualitative data. A pie (circle) is divided into different segments where each segment represents a category. The size of the segment is based on the proportion of actual data.
Pie charts show what percentage of the whole is made up of each category. It is used to indicate the spread of data.
Pie charts can be used to represent the percentage of male and female students in a class. It can also be used to show proportion of responses in a survey questionnaire, as shown on the right.
A histogram is used to represent quantitative continuous data. It represents a distribution, which means the total proportion of columns equals the total number of values in the data. The figure on the right shows the distribution of heights of students. We can count the number of students by taking the sum of counts of each column.
Since histograms represent quantitative continuous data, data exists as ranges. Each column has a lower bound and an upper bound. For example, the figure on the right shows height within a range of 5 cm. The length of each column shows the scaled value occupied by each range.
A histogram can be used to show the heights or weights of a group of students.
A scatter plot is used to represent quantitative data. It is used to show a trend.
A scatter plot consists of two variables. It shows the trend of the second variable when the first variable increases. Similarly, it can be used to show the trend over time. In this case, time is our first variable. Each circle represents a subject.
A scatter plot can be used to show the population growth with time or the trend of units sold with revenue.
A box plot is used to highlight summary statistics of quantitative data. A box plot shows the percentiles, median, and outliers in a data set.
Outliers refer to anomalies in data. They can be caused by incorrect measurement or recording of data values.
For example, box plots can analyze summary statistics of baby weights, heights of trees, or heartbeat rates.
View all Courses