T-distributed stochastic neighbor embedding (t-SNE), is a machine learning model that helps us see and understand data better. It was made by Laurens van der Maaten and Geoffrey Hinton in 2008. This program turns high-dimensional data information into a simpler picture, usually in 2D or 3D. The goal is to make the data visually simple while keeping the important connections between the points.
The algorithm works by modeling each high-dimensional data point as a probability distribution in the lower-dimensional space. Each piece of information is like a dot on a map in the simple picture. The program then tries to ensure that the distances between these dots in the simple picture match the original distances between the data in the detailed version. This way, we can look at the data more simply and still see the important groups or patterns.
T-SNE is particularly useful for visualizing high-dimensional data such as images, text, and audio. It is used in many fields, including natural language processing, computer vision, and bioinformatics.
If we use the Iris dataset, which is available in scikit-learn, it applies t-SNE to reduce the data’s dimensionality to two components and then plots the result. A different color represents each class of the Iris dataset.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.manifold import TSNEfrom sklearn.preprocessing import StandardScaler# Load the Iris datasetiris = datasets.load_iris()X = iris.datay = iris.target# Standardize the feature matrixX_std = StandardScaler().fit_transform(X)# Apply t-SNE to reduce the data to two componentstsne = TSNE(n_components=2, random_state=42)X_tsne = tsne.fit_transform(X_std)# Plot the resultsplt.figure(figsize=(8, 6))# Scatter plot with different colors for each classfor i in range(len(np.unique(y))):plt.scatter(X_tsne[y == i, 0], X_tsne[y == i, 1], label=f'Class {i}')plt.title('t-SNE Visualization of Iris Dataset')plt.xlabel('t-SNE Component 1')plt.ylabel('t-SNE Component 2')plt.legend()plt.show()
Lines 1–5: We import the important libraries.
Line 8: We load the dataset, which is a part of the sklearn
library.
Line 9–10: We assign value to the x
and y
components.
Line 13: We standardize the feature matrix to have zero mean and unit variance.
Line 16: We create a t-SNE model with two components (dimensions) in the lower-dimensional space. The random_state
parameter ensures the reproducibility of the results.
Line 17: We fit the t-SNE model to the original data (X
) and transform it into the lower-dimensional space (X_tsne
).
Lines 23–24: We plot the data points in the lower-dimensional space (X_tsne
). The data points are colored according to their corresponding class labels (y
) using the colors specified in the colors
list.
Lines 27–28: We visualize high-dimensional data in lower-dimensional spaces. The results can vary with different random seeds and perplexity values.