DBSCAN in R

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together data points that are close to each other and separate regions with lower point density. In R, we can use the dbscan package to implement DBSCAN.

The steps to implement DBSCAN in R are discussed below:

Step 1: Create a sample dataset

Firstly, we’ll generate some data points to make a sample dataset from which we can apply the DBSCAN algorithm:

# Create some sample data
set.seed(123)
data <- matrix(rnorm(100), ncol = 2)
data

Here, we’ll use the rnorm method to generate 100 normally distributed random numbers in 2D space by setting the number of columns in the dataset equal to two. (i.e ncols = 2).

Step 2: Run DBSCAN

Next, we move on to the major step of this Answer, which is to run the DBSCAN algorithm itself using the dbscan library. For this coding example, we'll use parameter values eps = 0.5 (radius for searching of neighboring points around a certain data point) and minPts = 5 (i.e., the minimum number of points to form a dense region).

# Set up DBSCAN parameters
eps <- 0.5 # Radius for neighborhood search
minPts <- 5 # Minimum number of points to form a dense region
# Run DBSCAN
dbscan_result <- dbscan(data, eps = eps, minPts = minPts)
# Convert the result to a data frame
result_df <- data.frame(data, cluster = as.factor(dbscan_result$cluster))
#Print the generated results
result_df

In the end, we convert the result of the DBSCAN algorithm into a data frame so that it can be plotted using the ggplot2 package in the step below. The printed results will show the label of each data point assigned by dbscan.

Step 3: Plot the generated clusters

Finally, we'll visualize the results of DBSCAN via the ggplot2 package, which will show all of the clusters being formed in the form of a scatter plot as a result of this algorithm:

# Plot the results using ggplot2
plot <- ggplot(result_df, aes(x = data[, 1], y = data[, 2], color = cluster)) +
geom_point(size = 3) +
labs(title = "DBSCAN Clustering in R",
x = "Variable 1", y = "Variable 2") +
theme_minimal() +
scale_color_discrete(name = "Cluster")

Code explanation

The line-by-line explanation of the code above is given below:

  • Line 2: Here, we take the result of DBSCAN, which is a data frame object, and plot it with the x-axis set to the first column of the dataset and the y-axis set to the second column of the dataset (inside aes).

  • Line 3: We take the size of each point to 3 via geom_point(size=3) for greater legibility.

  • Lines 4-5: We set the title of the graph to DBSCAN Clustering in R as well as setting the axes labels to Variable 1 for the x-axis and Variable 2 for the y-axis.

  • Line 6: Here, the theme of the plot is set to a minimal theme, which typically removes background gridlines and other non-essential elements.

  • Line 7: This sets the color scale for the cluster variable, with the name of the color legend to Cluster.

Conclusion

Overall, we learned how the DBSCAN algorithm is performed on a random dataset and how the generated data is visualized with the help of a scatter plot. The parameters and styling for DBSCAN can be adjusted as needed for our specific dataset and preferences.


Free Resources

Copyright ©2024 Educative, Inc. All rights reserved