Datasets and preprocessing in Julia
Julia is a high-level language designed for efficient computation. It is easy to use, like Python, and is performance-wise as efficient as compiled languages like C++ and Fortran. It contains many packages related to multiprocessing, machine learning, and data preprocessing. In this Answer, we will discuss some of the built-in Julia (illustrated below) datasets and how we can perform preprocessing on the datasets.
RDatasets
RDatasets is the library in Julia that contains different datasets. It is not possible to cover all of them in one Answer. Therefore, we will only work with the datasets mentioned below.
MNIST
The MNIST digit dataset consists of an extensive collection of handwritten digits. It is a popular dataset for classification tasks. This dataset is present in the MLDatasets package. Follow the below lines to use it.
using MLDatasetstrain_x, train_y = MNIST.traindata()test_x, test_y = MNIST.testdata()
Iris
This dataset contains the different species of iris flowers. We use this data set in machine learning to understand the concept of regression and classification.
using RDatasetsiris = dataset("datasets", "iris")
Boston Housing
The Boston Housing dataset provides information about the prices of houses in the Boston area with the features that affect the prices of houses. Follow the below code to use this data set.
using RDatasetsboston = dataset("MASS", "Boston")
Titanic
As can be told by its name, this dataset contains information regarding the passengers aboard the Titanic. It includes features like age, gender, and survival status.
using RDatasetstitanic = dataset("datasets", "Titanic")
Breast cancer
We use this dataset to train a machine learning model to predict breast cancer. The attributes of this dataset are breast cancer cell nuclei, with information on their corresponding diagnosis, such as benign or malignant.
using RDatasetscancer = dataset("mlbench", "BreastCancer")
So far, we have discussed various datasets present in the packages of Julia. Now we are going to discuss data preprocessing in Julia.
Data preprocessing in Julia
Data preprocessing mainly consists of several processes, which are:
Data cleaning
Feature scaling
Encoding categorical values
Dimensionality reduction
Let us discuss all these processes individually.
Data cleaning
Data cleaning is preparing and transforming the raw data into a structured and clean format to perform operations on the data. The raw data may contain missing values and duplicates. We can drop the data points with missing values or replace them with appropriate ones. Below is the code where we are cleaning the iris dataset.
using RDatasets, DataFrames# Load the Iris datasetiris = dataset("datasets", "iris")# Display the first few rows of the datasetprintln("Before Data Cleaning:")println("The dimensions of the data set is : ",size(iris))println(first(iris, 5))# Handling Missing Valuescleaned_data = dropmissing(iris)# Removing Duplicatescleaned_data = unique(cleaned_data)# Display the cleaned datasetprintln("\nAfter Data Cleaning:")println("The dimensions of the data set is : ",size(iris))println(first(cleaned_data, 5))
The output of cleaning the data is as follows.
You observe that the data set size before and after cleaning remains the same because the data is already in the appropriate form.
Feature scaling
Feature scaling is the process of normalizing the data. It standardizes the range of features by mapping them to a standard scale, and doing this helps machine learning algorithms to converge efficiently. There are various methods of doing feature scaling, like standardization, min-max scaling, robust scaling, etc.
Look at the following code to perform feature scaling to the dataset.
using ScikitLearnusing DataFrames# Create a sample DataFramedf = DataFrame(age = [25, 30, 35, 40, 45],income = [50000, 60000, 70000, 80000, 90000],credit_score = [600, 700, 750, 800, 850])# Import the MinMaxScaler from ScikitLearn.jl@sk_import preprocessing: MinMaxScaler# Create an instance of the MinMaxScalerscaler = MinMaxScaler()# Fit and transform the data using the scalerscaled_data = fit_transform!(scaler, df)
Encoding categorical values
Encoding the categorical values transforms the categorical or textual data into a representation that machine learning algorithms can use to interpret the information better. Some techniques include one-hot encoding, label encoding, ordinal encoding, etc. The code to perform one-hot encoding on a dataset is as follows:
using MLJusing DataFrames# Create a sample DataFrame with categorical variablesdf = DataFrame(category = ["A", "B", "C", "A", "B", "C"],gender = ["Male", "Female", "Male", "Female", "Male", "Female"])# Specify the categorical columns to encodecategorical_features = [:category, :gender]# Create a machine for one-hot encodingencoder = machine(OneHotEncoder(), df, categorical_features)# Fit and transform the data using the encodertransformed_data = fit_transform!(encoder)# Extract the transformed DataFrametransformed_df = MLJ.transform(encoder, df)
Dimensionality reduction
Dimensionality reduction is the process of reducing the number of features or variables from the data set. We use those features that contain more information and help our model to learn efficiently. There are two phases of dimensionality reduction: feature selection and feature extraction.
using MultivariateStatsusing DataFrames# Create a sample dataset with floating-point valuesdf = DataFrame(X1 = [1.0, 2.0, 3.0, 4.0, 5.0], X2 = [2.0, 4.0, 6.0, 8.0, 10.0], X3 = [3.0, 6.0, 9.0, 12.0, 15.0])# Convert DataFrame to matrix of Float64X = Matrix{Float64}(df)# Perform PCApca_result = fit(PCA, X, maxoutdim = 2)# Transform the data using the PCA modeltransformed_data = MultivariateStats.transform(pca_result, X)
Conclusion
In this Answer, we discussed different built-in datasets in Julia. These datasets help us make machine learning models for understanding logistic regression, linear regression, classification, and image recognition-based algorithms. At the same time, it is important to learn how to manipulate a dataset by understanding various concepts like data cleaning, feature scaling, dimensionality reduction, etc.
How does feature scaling help machine learning algorithms?
It normalizes data, making it easier for algorithms to converge efficiently.
It transforms textual data into a binary representation.
It reduces the number of features in the dataset.
Free Resources