What Is Data Science?

Explore the fundamentals of data science, understanding how data from various sources is transformed through a pipeline into actionable insights. This lesson covers problem definition, data collection and cleaning, exploratory analysis, modeling, deployment, and the role of data science in industries such as tech, healthcare, and retail. Gain foundational knowledge to start working with data science projects effectively.

We'll cover the following...

Data science
Data science pipeline
Applications of data science
What will we build in this course?

Data science

Data science is a multidisciplinary field that involves studying and analyzing large sets of data to uncover valuable insights for businesses. It combines principles from mathematics, statistics, artificial intelligence, and computer engineering. Data scientists use various methods and technologies to process both structured and unstructured data, seeking findings based on past events to get future predictions and recommendations. It’s all about extracting meaningful knowledge from data through a scientific approach.

In simple terms, data science means digging deep into a huge pool of data to discover valuable insights. For example, say we have textual data in its raw form. We can use data science to create further text by creating bigram language models.

Data science pipeline

So, how does data science transform the data into insights? Let’s learn about this process with the data science pipeline. The data science pipeline or data science workflow is a systematic process to extract insights and value from data. The pipeline consists of problem formulation, data collection, data cleaning, data exploration, modeling and evaluation, getting interpretations, and deployment of models. Here’s a typical data science pipeline broken down into key steps.

Problem definition

We start with a clear mindset for any data science project, knowing what problem we want to solve and which questions should be answered using our data. It helps to define the objectives, scopes, and constraints of the project. The problem definition provides a foundation for establishing key performance indicators (KPIs).

Data collection

Data collection involves gathering different types of data from various places. This data can be structured (like customer information) or unstructured (like videos and social media). It can be extracted manually or using scripts to automate the process. The data can be historical (containing the oldest records available) or current (even live data). It can be text, images, videos, audio, transcriptions, reviews, survey findings, and so on.

Data cleaning

After data collection, data cleaning and preprocessing are crucial parts of data science projects. We clean and transform raw data to get it ready for analysis. Several preprocessing techniques can be applied to raw data, including cleaning out mistakes, combining data from different sources, normalization, fixing formatting issues, handling missing values, dimension reduction (reducing features to be analyzed), and more.

Exploratory data analysis

Exploratory data analysis involves examining data to uncover patterns, trends, and data characteristics. This process is critical for generating hypotheses and assessing suitability for modeling and analysis. It involves visualization and statistical analysis of data to identify outliers, missing values, skewness, and so on. Ultimately, it helps us make informed business decisions and spot potential growth opportunities.

Modeling and analysis

Modeling and analysis is itself a pipeline that consists of feature engineering, appropriate model selection and training, model evaluation, and interpretation. First, we transform the clean and processed data into features to train and test our selected models. Then, we evaluate the model’s performance with various statistical techniques to ensure the generalization. Furthermore, we can also interpret and visualize the important results.

Model deployment

If the model performs well and meets the project objectives that we defined in the problem definition, we deploy it for real-world use. This step involves integration into different production systems, such as web and mobile applications, and decision support tools. We also continuously monitor the model’s performance and update it as necessary.

Communication of insights

We document and present findings, insights, and recommendations to stakeholders. We can convey the information effectively by using data visualization and storytelling techniques. It also helps to collaborate with nontechnical team members to ensure understanding.

Feedback and revision

Finally, we collect feedback from stakeholders and end users. Based on it, we iterate on the data, model, or analysis and do the revisions accordingly. The specific steps and tools used in a data science project may vary depending on its complexity, the availability of data, and the goals of the analysis. Data science is an iterative process, and data scientists often iterate back through these stages as new insights and challenges arise.

Applications of data science

Big companies like Google, Facebook, and Amazon use data science techniques to improve their services and products. Google examines our search behavior to show us ads that are most likely to match our interests. This benefits both us and the businesses. Facebook and Netflix also use data science to make our experience better. They look at what we do on their platform to show us things we’ll be interested in.

Amazon utilizes data science in various ways. They keep an eye on our purchases and predict what we might want in the future, making our online shopping experience more personalized. Data is also used to make sure products are always available and to make their delivery process more efficient. The following illustration describes the example of a recommender system that learns from the user’s behavior and suggests the products to other similar users.

Data science is used in many different fields because most industries have a lot of data they can use. As businesses grow, or new businesses start, they need to use data to create personalized and direct approaches, so they need data science. This is because data science help machines make smart choices and make many processes more efficient. We’ll expand on the industry usage of data science in the next lesson.

Important industries, like healthcare, finance, online shopping, and automotive, already use data science to improve their operations. People have been using data to make predictions for a while, like in business and with time series data. However, generative AI is the new buzz around the town. Tools such as ChatGPT are used in many digital tasks. Furthermore, as time passes, data science continues to expand and develop, becoming more important and useful in various areas.

What will we build in this course?

As we proceed in the course, we’ll explore different aspects of data science. Going ahead, we’ll tackle the challenge of predicting restaurant tips using a public dataset with details, such as total bill, the time and day of the bill payment, and so on. We’ll use various data science techniques and compare the results.

Here, we have included a sneak peek of some of the visualizations we derived using various data science techniques:

Python 3.10.4

#plotting heatmap of data
fig1 = sns.heatmap(data.corr()).get_figure()
plt.figure(figsize=(10, 5))
plt.title("Heatmap of data features")
#plotting histogram of total bill
fig2 = sns.histplot(data["total_bill"]).get_figure()
plt.title("Histogram of total_bill feature")
sns.lmplot(x='Actual Tip', y='Predicted Tip (NN)', data=nn_df, hue='Time', markers=['o', 'x'])
plt.title('Neural Network Predictions')
plt.xlabel('Actual Tip')
plt.ylabel('Predicted Tip')
fig3 = plt.gcf()
plt.figure(figsize=(12, 4))
# First subplot for Linear Regression
plt.subplot(131)
sns.histplot(lr_predictions, label="Predictions") # Plot the predictions
sns.histplot(y_test, label="Actual") # Plot the actual values
plt.title("Linear Regression")
plt.legend()
# Second subplot for Random Forest
plt.subplot(132)
sns.histplot(rf_predictions, label="Predictions") # Plot the predictions
sns.histplot(y_test, label="Actual") # Plot the actual values
plt.title("Random Forest")
plt.legend()
# Third subplot for Neural Network
plt.subplot(133)
sns.histplot(nn_predictions, label="Predictions") # Plot the predictions
sns.histplot(y_test, label="Actual") # Plot the actual values
plt.title("Neural Network")
plt.legend()
# Adjust the layout of the subplots to avoid overlapping
plt.tight_layout()
fig4 = plt.gcf()
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
# Create a bar chart for MSE with custom colors
models = ['Linear Regression', 'Random Forest', 'Neural Network']
mse_scores = [lr_mse, rf_mse, nn_mse]
plt.figure(figsize=(10, 5))
bar = sns.barplot(x=models, y=mse_scores, palette=colors)
plt.title('Mean Squared Error (MSE) for Different Models')
plt.ylabel('MSE')
# Add data labels on top of each bar
for i, v in enumerate(mse_scores):
    bar.text(i, v, f'{v:.4f}', ha='center', va='bottom', fontsize=12)
#plt.savefig("output/plot1.png", bbox_inches='tight')
fig5 = plt.gcf()

1. Introduction to Data Science

2.Fundamentals of Data Science

3.Applications and Careers in Data Science

4.Mastering Data Science

Project