Datasets and preprocessing in Julia

Julia is a high-level language designed for efficient computation. It is easy to use, like Python, and is performance-wise as efficient as compiled languages like C++ and Fortran. It contains many packages related to multiprocessing, machine learning, and data preprocessing. In this Answer, we will discuss some of the built-in Julia (illustrated below) datasets and how we can perform preprocessing on the datasets.

RDatasets

RDatasets is the library in Julia that contains different datasets. It is not possible to cover all of them in one Answer. Therefore, we will only work with the datasets mentioned below.

Data  sets
Data sets

MNIST

The MNIST digit dataset consists of an extensive collection of handwritten digits. It is a popular dataset for classification tasks. This dataset is present in the MLDatasets package. Follow the below lines to use it.

using MLDatasets
train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata()

Iris

This dataset contains the different species of iris flowers. We use this data set in machine learning to understand the concept of regression and classification.

using RDatasets
iris = dataset("datasets", "iris")

Boston Housing

The Boston Housing dataset provides information about the prices of houses in the Boston area with the features that affect the prices of houses. Follow the below code to use this data set.

using RDatasets
boston = dataset("MASS", "Boston")

Titanic

As can be told by its name, this dataset contains information regarding the passengers aboard the Titanic. It includes features like age, gender, and survival status.

using RDatasets
titanic = dataset("datasets", "Titanic")

Breast cancer

We use this dataset to train a machine learning model to predict breast cancer. The attributes of this dataset are breast cancer cell nuclei, with information on their corresponding diagnosis, such as benign or malignant.

using RDatasets
cancer = dataset("mlbench", "BreastCancer")

So far, we have discussed various datasets present in the packages of Julia. Now we are going to discuss data preprocessing in Julia.

Data preprocessing in Julia

Data preprocessing mainly consists of several processes, which are:

  • Data cleaning

  • Feature scaling

  • Encoding categorical values

  • Dimensionality reduction

Let us discuss all these processes individually.

Data cleaning

Data cleaning is preparing and transforming the raw data into a structured and clean format to perform operations on the data. The raw data may contain missing values and duplicates. We can drop the data points with missing values or replace them with appropriate ones. Below is the code where we are cleaning the iris dataset.

using RDatasets, DataFrames
# Load the Iris dataset
iris = dataset("datasets", "iris")
# Display the first few rows of the dataset
println("Before Data Cleaning:")
println("The dimensions of the data set is : ",size(iris))
println(first(iris, 5))
# Handling Missing Values
cleaned_data = dropmissing(iris)
# Removing Duplicates
cleaned_data = unique(cleaned_data)
# Display the cleaned dataset
println("\nAfter Data Cleaning:")
println("The dimensions of the data set is : ",size(iris))
println(first(cleaned_data, 5))

The output of cleaning the data is as follows.

You observe that the data set size before and after cleaning remains the same because the data is already in the appropriate form.

Feature scaling

Feature scaling is the process of normalizing the data. It standardizes the range of features by mapping them to a standard scale, and doing this helps machine learning algorithms to converge efficiently. There are various methods of doing feature scaling, like standardization, min-max scaling, robust scaling, etc.

Look at the following code to perform feature scaling to the dataset.

using ScikitLearn
using DataFrames
# Create a sample DataFrame
df = DataFrame(age = [25, 30, 35, 40, 45],
income = [50000, 60000, 70000, 80000, 90000],
credit_score = [600, 700, 750, 800, 850])
# Import the MinMaxScaler from ScikitLearn.jl
@sk_import preprocessing: MinMaxScaler
# Create an instance of the MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the data using the scaler
scaled_data = fit_transform!(scaler, df)

Encoding categorical values

Encoding the categorical values transforms the categorical or textual data into a representation that machine learning algorithms can use to interpret the information better. Some techniques include one-hot encoding, label encoding, ordinal encoding, etc. The code to perform one-hot encoding on a dataset is as follows:

using MLJ
using DataFrames
# Create a sample DataFrame with categorical variables
df = DataFrame(category = ["A", "B", "C", "A", "B", "C"],
gender = ["Male", "Female", "Male", "Female", "Male", "Female"])
# Specify the categorical columns to encode
categorical_features = [:category, :gender]
# Create a machine for one-hot encoding
encoder = machine(OneHotEncoder(), df, categorical_features)
# Fit and transform the data using the encoder
transformed_data = fit_transform!(encoder)
# Extract the transformed DataFrame
transformed_df = MLJ.transform(encoder, df)

Dimensionality reduction

Dimensionality reduction is the process of reducing the number of features or variables from the data set. We use those features that contain more information and help our model to learn efficiently. There are two phases of dimensionality reduction: feature selection and feature extraction.

using MultivariateStats
using DataFrames
# Create a sample dataset with floating-point values
df = DataFrame(X1 = [1.0, 2.0, 3.0, 4.0, 5.0], X2 = [2.0, 4.0, 6.0, 8.0, 10.0], X3 = [3.0, 6.0, 9.0, 12.0, 15.0])
# Convert DataFrame to matrix of Float64
X = Matrix{Float64}(df)
# Perform PCA
pca_result = fit(PCA, X, maxoutdim = 2)
# Transform the data using the PCA model
transformed_data = MultivariateStats.transform(pca_result, X)

Conclusion

In this Answer, we discussed different built-in datasets in Julia. These datasets help us make machine learning models for understanding logistic regression, linear regression, classification, and image recognition-based algorithms. At the same time, it is important to learn how to manipulate a dataset by understanding various concepts like data cleaning, feature scaling, dimensionality reduction, etc.

Q

How does feature scaling help machine learning algorithms?

A)

It normalizes data, making it easier for algorithms to converge efficiently.

B)

It transforms textual data into a binary representation.

C)

It reduces the number of features in the dataset.

Free Resources

Copyright ©2026 Educative, Inc. All rights reserved