DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together data points that are close to each other and separate regions with lower point density. In R, we can use the dbscan
package to implement DBSCAN.
# Create some sample dataset.seed(123)data <- matrix(rnorm(100), ncol = 2)data
Here, we’ll use the rnorm
method to generate 100 normally distributed random numbers in 2D space by setting the number of columns in the dataset equal to two. (i.e ncols = 2
).
Next, we move on to the major step of this Answer, which is to run the DBSCAN algorithm itself using the dbscan
library. For this coding example, we'll use parameter values eps = 0.5
(radius for searching of neighboring points around a certain data point) and minPts = 5
(i.e., the minimum number of points to form a dense region).
# Set up DBSCAN parameterseps <- 0.5 # Radius for neighborhood searchminPts <- 5 # Minimum number of points to form a dense region# Run DBSCANdbscan_result <- dbscan(data, eps = eps, minPts = minPts)# Convert the result to a data frameresult_df <- data.frame(data, cluster = as.factor(dbscan_result$cluster))#Print the generated resultsresult_df
In the end, we convert the result of the DBSCAN algorithm into a data frame so that it can be plotted using the ggplot2
package in the step below. The printed results will show the label of each data point assigned by dbscan
.
Finally, we'll visualize the results of DBSCAN via the ggplot2
package, which will show all of the clusters being formed in the form of a scatter plot as a result of this algorithm:
# Plot the results using ggplot2plot <- ggplot(result_df, aes(x = data[, 1], y = data[, 2], color = cluster)) +geom_point(size = 3) +labs(title = "DBSCAN Clustering in R",x = "Variable 1", y = "Variable 2") +theme_minimal() +scale_color_discrete(name = "Cluster")
The line-by-line explanation of the code above is given below:
Line 2: Here, we take the result of DBSCAN, which is a data frame object, and plot it with the x-axis set to the first column of the dataset and the y-axis set to the second column of the dataset (inside aes
).
Line 3: We take the size of each point to 3
via geom_point(size=3)
for greater legibility.
Lines 4-5: We set the title of the graph to DBSCAN Clustering in R
as well as setting the axes labels to Variable 1
for the x-axis and Variable 2
for the y-axis.
Line 6: Here, the theme of the plot is set to a minimal theme, which typically removes background gridlines and other non-essential elements.
Line 7: This sets the color scale for the cluster
variable, with the name of the color legend to Cluster
.
Overall, we learned how the DBSCAN algorithm is performed on a random dataset and how the generated data is visualized with the help of a scatter plot. The parameters and styling for DBSCAN can be adjusted as needed for our specific dataset and preferences.
Free Resources