Advanced Customer Segmentation: A Comprehensive Comparison of HDBSCAN, DBSCAN, and K-Means

Customer segmentation is a cornerstone of modern business strategy. By dividing a diverse customer base into distinct, homogeneous groups, companies can tailor their marketing efforts, product development, and customer service to maximize impact and efficiency. While traditional methods like K-Means are popular, they often struggle with the messy reality of real-world data, particularly when customer segments exhibit varying densities or contain significant outliers. Density-based methods like DBSCAN and HDBSCAN offer powerful alternatives.

In this comprehensive guide, we’ll walk through an end-to-end customer segmentation project using the popular Kaggle Mall Customer dataset, applying three prominent clustering algorithms: K-Means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise).

Dataset Overview: Mall Customer Dataset

The Mall Customer Dataset contains 200 customer entries with the following attributes:

CustomerID: Unique identifier for each customer
Genre: Gender of the customer
Age: Age of the customer
Annual Income (k$): Annual income in thousands of dollars
Spending Score (1–100): A score assigned by the mall, reflecting customer behavior and spending patterns

Our primary objective is to identify distinct customer groups based on Age, Annual Income, and Spending Score.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score, silhouette_samples from sklearn.cluster import KMeans, DBSCAN from sklearn.neighbors import NearestNeighbors import hdbscan from kneed import KneeLocator import warnings import matplotlib.cm as cm warnings.filterwarnings('ignore') # Load and examine the data df = pd.read_csv('Mall_Customers.csv') print("Original Data Head:") print(df.head()) print("\nData Info:") print(df.info()) print("\nMissing Values:") print(df.isnull().sum())

Output from initial data inspection:

Initial Findings: The dataset is clean with no missing values, making it ideal for clustering analysis.

Feature Scaling: Essential for Distance-Based Clustering

Clustering algorithms that rely on distance metrics (K-Means, DBSCAN, and HDBSCAN) are highly sensitive to feature scales. Features with larger numerical ranges can disproportionately influence distance calculations. To ensure equal contribution from each feature, we standardize our numerical features.

features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)'] X = df[features] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print("\nScaled Data Sample (first 5 rows):") print(pd.DataFrame(X_scaled, columns=features).head())

Output from scaling:

K-Means Clustering: The Centroid-Based Approach

K-Means partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids based on cluster means.

Key Concepts:

Centroid: The center of a cluster, calculated as the mean of all points in that cluster
k: The number of clusters (must be specified beforehand)
init=’k-means++’: Intelligent initialization strategy for better convergence

Determining Optimal k: The Elbow Method

Since K-Means requires pre-specifying k, we use the Elbow Method to find the optimal number of clusters by plotting Within-Cluster Sum of Squares (WCSS) against k values.

# Determine optimal number of clusters using Elbow Method wcss = [] k_range = range(1, 11) for k in k_range: kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init=10) kmeans.fit(X_scaled) wcss.append(kmeans.inertia_) # Plot Elbow Method plt.figure(figsize=(10, 6)) plt.plot(k_range, wcss, marker='o', linestyle='--') plt.title('K-Means Elbow Method for Optimal Number of Clusters') plt.xlabel('Number of clusters (k)') plt.ylabel('WCSS (Within-Cluster Sum of Squares)') plt.grid(True, linestyle='--', alpha=0.6) plt.show()

Interpretation: The elbow typically occurs around k=5 for this dataset, indicating five distinct customer segments.

K-Means Implementation and Evaluation

# Perform K-Means with optimal clusters optimal_kmeans_clusters = 5 kmeans_model = KMeans(n_clusters=optimal_kmeans_clusters, init='k-means++', random_state=42, n_init=10) kmeans_clusters = kmeans_model.fit_predict(X_scaled) # Evaluate clustering quality kmeans_silhouette_avg = silhouette_score(X_scaled, kmeans_clusters) print(f"K-Means Silhouette Score (k={optimal_kmeans_clusters}): {kmeans_silhouette_avg:.4f}") # Add cluster labels to original data df['KMeans_Cluster'] = kmeans_clusters # Calculate cluster characteristics kmeans_cluster_means = df.groupby('KMeans_Cluster')[features].mean() print("\nK-Means Cluster Characteristics:") print(kmeans_cluster_means) print(f"\nCluster Distribution:\n{df['KMeans_Cluster'].value_counts()}")

Example Output from K-Means Clustering:

Results: Silhouette Score of 0.4166 indicates reasonably good cluster separation.

K-Means Visualizations

K-Means Customer Segmentation Analysis

Based on comprehensive analysis of age, income, and spending behavior patterns, the K-means clustering algorithm identified five distinct customer segments with clear behavioral and demographic characteristics.

Cluster Profiles

Cluster 0 — “Financially Constrained Middle-Aged” Age: 46.2 years | Income: $26.8k | Spending Score: 18.3

This segment represents middle-aged customers facing significant financial constraints. With the lowest income level and minimal spending scores, these individuals likely include those experiencing economic hardship, working in lower-wage positions, or living on fixed incomes. Their purchasing behavior is driven primarily by necessity rather than discretionary wants, focusing on essential goods and services while avoiding non-essential expenditures.

Cluster 1 — “Young Lifestyle Enthusiasts” Age: 25.2 years | Income: $41.1k | Spending Score: 62.2

Young adults who demonstrate high spending propensity despite moderate income levels characterize this segment. These early-career professionals and young consumers prioritize lifestyle, experiences, and brand-conscious purchases. They represent a paradoxical spending pattern where discretionary purchases take precedence over traditional financial prudence, likely influenced by social media, peer pressure, and investment in personal image and experiences.

Cluster 2 — “Affluent High-Performers” Age: 32.9 years | Income: $86.1k | Spending Score: 81.5

This segment consists of successful young professionals in their prime earning years who combine high income with equally high spending behavior. These individuals have achieved significant career success early and possess both substantial disposable income and the confidence to spend freely. They represent ideal customers for premium products, luxury services, and high-value discretionary purchases across multiple categories.

Cluster 3 — “Disciplined High Earners” Age: 39.9 years | Income: $86.1k | Spending Score: 19.4

Despite having identical high income levels to Cluster 2, this segment demonstrates remarkably conservative spending patterns. These middle-aged high earners prioritize financial security, long-term planning, and wealth accumulation over immediate consumption. Their restrained spending likely reflects responsibilities such as mortgage payments, children’s education savings, retirement planning, or debt reduction strategies.

Cluster 4 — “Mature Pragmatic Consumers” Age: 55.6 years | Income: $54.4k | Spending Score: 48.9

The oldest demographic segment displays balanced, measured consumption patterns that reflect mature financial decision-making. With moderate income and spending levels, these consumers demonstrate practical purchasing behavior that balances current lifestyle needs with long-term financial stability. Their spending patterns likely reflect established preferences, quality-focused purchases, and consideration of approaching retirement.

Strategic Implications

These five clusters effectively segment the customer base along three critical dimensions: life stage progression, financial capacity, and spending behavioral patterns. Each segment represents distinct consumer archetypes with unique motivations, constraints, and purchasing triggers, providing valuable insights for targeted marketing strategies, product development, and customer relationship management initiatives.

The segmentation reveals that spending behavior cannot be predicted solely by income level, as evidenced by the stark contrast between Clusters 2 and 3, both high earners with completely different spending philosophies. Age and life stage considerations play equally important roles in determining consumer behavior patterns.

DBSCAN: Density-Based Clustering with Noise Detection

DBSCAN identifies clusters based on density and can discover arbitrarily shaped clusters while automatically detecting outliers.

Key Parameters:

eps (ε): Maximum distance between two samples to be considered neighbors
min_samples: Minimum number of samples required to form a dense region

Point Classifications:

Core Point: Has at least min_samples within eps radius
Border Point: Within eps of a core point but not core itself
Noise Point: Neither core nor border point

Determining Optimal eps: The Knee Method

# Find optimal eps using k-distance graph min_samples = 5 neigh = NearestNeighbors(n_neighbors=min_samples) nbrs = neigh.fit(X_scaled) distances, indices = nbrs.kneighbors(X_scaled) # Sort distances to k-th nearest neighbor distances = np.sort(distances[:, min_samples-1], axis=0) plt.figure(figsize=(10, 6)) plt.plot(distances) plt.xlabel('Points ordered by distance') plt.ylabel(f'Distance to {min_samples}th nearest neighbor') plt.title('Knee Method for Optimal Epsilon (DBSCAN)') # Find knee point knee = KneeLocator(range(len(distances)), distances, curve='convex', direction='increasing') plt.axvline(x=knee.knee, color='r', linestyle='--', label=f'Knee at index {knee.knee:.0f}') plt.legend() plt.show() eps_optimal = distances[knee.knee] print(f"Optimal Epsilon: {eps_optimal:.4f}")

Optimal epsilon for minimus samples of 5.

Initial DBSCAN Clustering with Knee Method Epsilon

Let’s apply DBSCAN with these parameters and observe the initial results.

# Perform DBSCAN clustering with Knee Method suggested parameters dbscan_knee = DBSCAN(eps=eps_knee, min_samples=5) clusters_knee = dbscan_knee.fit_predict(X_scaled) # Add the cluster labels to the original data df['DBSCAN_Knee_Cluster'] = clusters_knee # Calculate the mean values for each cluster dbscan_knee_cluster_means = df.groupby('DBSCAN_Knee_Cluster')[features].mean() # Display the cluster means print("\nDBSCAN Cluster Means (Knee Method based):") print(dbscan_knee_cluster_means) print(f"\nDBSCAN Knee Method Cluster Distribution:\n{df['DBSCAN_Knee_Cluster'].value_counts()}") # Calculate Silhouette Score for Knee Method DBSCAN (excluding noise) X_dbscan_knee_clustered = X_scaled[clusters_knee != -1] labels_dbscan_knee_clustered = clusters_knee[clusters_knee != -1] dbscan_knee_silhouette_score = np.nan if len(np.unique(labels_dbscan_knee_clustered)) >= 2: dbscan_knee_silhouette_score = silhouette_score(X_dbscan_knee_clustered, labels_dbscan_knee_clustered) print(f"Silhouette Score for DBSCAN (Knee Method, excluding noise): {dbscan_knee_silhouette_score:.4f}") else: print("Cannot calculate Silhouette Score for DBSCAN (Knee Method): Less than 2 valid clusters found (excluding noise).")

Interestingly, using the knee method, DBSCAN identifies only one cluster and a noise cluster (-1). The noise count is 14. Remaining points are segmented in cluster 0.

DBSCAN Hyperparameter Optimization

# Grid search for optimal DBSCAN parameters eps_values = np.linspace(0.3, 1.0, num=15) min_samples_values = range(3, 15) best_score = -1 best_params = {} results = [] for eps in eps_values: for min_samples in min_samples_values: dbscan = DBSCAN(eps=eps, min_samples=min_samples) clusters = dbscan.fit_predict(X_scaled) # Calculate silhouette score (excluding noise) valid_clusters = clusters[clusters != -1] if len(np.unique(valid_clusters)) >= 2: score = silhouette_score(X_scaled[clusters != -1], valid_clusters) results.append({ 'eps': eps, 'min_samples': min_samples, 'silhouette_score': score, 'num_clusters': len(np.unique(valid_clusters)), 'noise_points': np.sum(clusters == -1) }) if score > best_score: best_score = score best_params = {'eps': eps, 'min_samples': min_samples} print(f"Best DBSCAN Parameters: {best_params}") print(f"Best Silhouette Score: {best_score:.4f}")

Example Output from DBSCAN Grid Search:

Optimal Results: eps=0.35, min_samples=8, Silhouette Score=0.7970

The number of clusters identified as 2 and a noise cluster. The noise cluster is having 175 points.

The cluster means as show below.

Looking at the DBSCAN clustering results, I can see a clear segmentation of the data into three distinct groups plus noise points. Here’s the analysis:

Cluster Characteristics

Cluster -1 (Noise/Outliers)

Age: 39.3 years (middle-aged)
Income: $60.8k (moderate income)
Spending Score: 50.4 (moderate spending)
Distribution: Very wide age range (18–70) with high variability
Interpretation: These are outlier customers who don’t fit the main behavioral patterns

Cluster 0 (Middle-aged Conservatives)

Age: 48.5 years (older demographic)
Income: $58.3k (moderate income)
Spending Score: 46.5 (below-average spending)
Distribution: Tightly clustered around age 47–50
Interpretation: Established, financially conservative customers who spend cautiously despite having decent income

Cluster 1 (Young High-Spenders)

Age: 21.5 years (young demographic)
Income: $60.1k (moderate income)
Spending Score: 51.2 (above-average spending)
Distribution: Very tight age clustering around 18–25
Interpretation: Young customers with relatively high disposable income who are willing to spend

Key Insights

Age is the Primary Differentiator: The clusters are primarily separated by age rather than income. Both main clusters have similar income levels (~$58–60k) but very different ages and spending behaviors.
Spending vs. Age Relationship: There’s a clear inverse relationship — younger customers (Cluster 1) spend more despite similar incomes, while older customers (Cluster 0) are more conservative spenders.
Income Consistency: Interestingly, all clusters have similar income levels, suggesting that spending behavior is more influenced by life stage (age) than earning capacity.

Business Implications

Cluster 0: Target with value-focused, practical products and conservative marketing
Cluster 1: Target with trendy, lifestyle products and bold marketing campaigns
Noise Points: Require individual analysis or specialized micro-segmentation strategies

The clustering successfully identified two distinct customer personas based primarily on age-driven spending behaviors rather than income levels.

HDBSCAN: Hierarchical Density-Based Clustering

HDBSCAN extends DBSCAN by building a hierarchy of clusters and selecting the most stable ones, making it robust to varying cluster densities.

Key Parameters:

min_cluster_size: Minimum size for a cluster
min_samples: Controls clustering conservativeness (defaults to min_cluster_size)

HDBSCAN Hyperparameter Tuning

# HDBSCAN parameter optimization param_grid = { 'min_cluster_size': [5, 10, 15, 20], 'min_samples': [None, 3, 5, 10] } best_score = -1 best_params = {} results = [] for min_cluster_size in param_grid['min_cluster_size']: for min_samples in param_grid['min_samples']: hdbscan_model = hdbscan.HDBSCAN( min_cluster_size=min_cluster_size, min_samples=min_samples, prediction_data=True ) clusters = hdbscan_model.fit_predict(X_scaled) valid_clusters = clusters[clusters != -1] if len(np.unique(valid_clusters)) >= 2: score = silhouette_score(X_scaled[clusters != -1], valid_clusters) results.append({ 'min_cluster_size': min_cluster_size, 'min_samples': min_samples, 'silhouette_score': score, 'num_clusters': len(np.unique(valid_clusters)) }) if score > best_score: best_score = score best_params = {'min_cluster_size': min_cluster_size, 'min_samples': min_samples} print(f"Best HDBSCAN Parameters: {best_params}") print(f"Best Silhouette Score: {best_score:.4f}")

Optimal Results: min_cluster_size=10, min_samples=None, Silhouette Score=0.6319.

Applying the Best HDBSCAN Model and Initial Cluster Overview

# --- Apply HDBSCAN with Best Parameters and Get Final Clusters --- if hdbscan_best_clusters_labels is not None: final_hdbscan_clusters = hdbscan_best_clusters_labels else: # Fallback: re-run the model with the best_params if for some reason best_clusters_labels wasn't captured. print("\n--- Re-running HDBSCAN with best parameters to get final clusters (fallback) ---") final_hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=hdbscan_best_params['min_cluster_size'], min_samples=hdbscan_best_params['min_samples'] if hdbscan_best_params['min_samples'] is not None else hdbscan_best_params['min_cluster_size'], prediction_data=True) final_hdbscan_clusters = final_hdbscan_model.fit_predict(X_scaled) # Add cluster labels to the original DataFrame df['HDBSCAN_Cluster'] = final_hdbscan_clusters print(f"\nNumber of HDBSCAN clusters found (including noise -1): {len(np.unique(final_hdbscan_clusters))}") print(f"Number of HDBSCAN noise points (-1 label): {np.sum(final_hdbscan_clusters == -1)}") print(f"HDBSCAN Cluster distribution:\n{df['HDBSCAN_Cluster'].value_counts()}") # --- Explicitly calculate and display the Silhouette Score for the FINAL HDBSCAN clusters --- X_hdbscan_clustered = X_scaled[final_hdbscan_clusters != -1] labels_hdbscan_clustered = final_hdbscan_clusters[final_hdbscan_clusters != -1] final_hdbscan_silhouette_score = np.nan if len(np.unique(labels_hdbscan_clustered)) >= 2: final_hdbscan_silhouette_score = silhouette_score(X_hdbscan_clustered, labels_hdbscan_clustered) print(f"\nSilhouette Score for the FINAL HDBSCAN clusters: {final_hdbscan_silhouette_score:.4f}") else: print("\nCannot calculate Silhouette Score for final HDBSCAN clusters: Less than 2 valid clusters found (excluding noise).")

Here, we observe 3 distinct clusters (0, 1, 2 in actual output) and a significant -1 cluster, which represents noise. The final Silhouette Score (0.6319) matches our best score, confirming consistency.

HDBSCAN clustering analysis, can see four distinct clusters have been identified based on age, annual income, and spending score. Let‘s’ break down what each cluster represents:

Cluster Characteristics

Cluster 0 (Blue) — “High Spenders”

Age: ~33 years (youngest average)
Income: ~$79k (highest income)
Spending Score: ~81 (highest spending)
This represents young, high-income individuals with very high spending behavior

Cluster 1 (Orange) — “Young Moderate Spenders”

Age: ~23 years (youngest group)
Income: ~$59k (moderate income)
Spending Score: ~50 (moderate spending)
Young adults with moderate income and spending patterns

Cluster 2 (Green) — “Mature Conservative Spenders”

Age: ~52 years (oldest group)
Income: ~$54k (moderate income)
Spending Score: ~48 (moderate-low spending)
Older individuals with moderate income but lower spending

Noise Points (Gray):

The gray points scattered throughout are outliers that don’t fit the main patterns — they could be:

Customers with unusual combinations of age/income/spending
Data entry errors
Genuinely unique cases that don’t follow typical patterns

Key Insights

Age-Spending Relationship: There’s a clear inverse relationship between age and spending score. Younger customers tend to spend more freely.
Three Distinct Customer Segments Identified:

Cluster 0: Premium customers (high income, high spending) — most valuable segment
Cluster 1: Young moderate spenders — potential growth segment with room for increased engagement
Cluster 2: Mature conservative spenders — underutilizing their purchasing power despite reasonable income

Target Segments for Marketing:

Cluster 0 represents your premium customers requiring retention strategies
Clusters 1 and 2 could be targeted for spending increase campaigns

Cluster Quality: The scatter plot shows well-separated clusters with clear boundaries, indicating HDBSCAN successfully identified meaningful customer segments. The presence of noise points (gray) suggests either outliers with unique behaviors or potential data quality issues worth investigating.

This segmentation provides a solid foundation for targeted marketing strategies, with different approaches needed for each distinct customer group based on their age, income, and spending behaviors.

Conclusion

This comprehensive comparison demonstrates that algorithm choice significantly impacts segmentation outcomes. While HDBSCAN achieved the highest silhouette score, K-Means provided the most interpretable and actionable customer segments for business applications. The key is aligning algorithmic strengths with business objectives and data characteristics.

For customer segmentation projects, we start with K-Means for baseline insights, then exploring density-based methods when dealing with complex customer behaviors or significant outlier populations.

Github link: https://github.com/rumsinha/Clustering

Advanced Customer Segmentation: A Comprehensive Comparison of HDBSCAN, DBSCAN, and K-Means was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dataset Overview: Mall Customer Dataset

Feature Scaling: Essential for Distance-Based Clustering

K-Means Clustering: The Centroid-Based Approach

Determining Optimal k: The Elbow Method

K-Means Implementation and Evaluation

K-Means Customer Segmentation Analysis

Cluster Profiles

Strategic Implications

DBSCAN: Density-Based Clustering with Noise Detection

Determining Optimal eps: The Knee Method

Initial DBSCAN Clustering with Knee Method Epsilon

DBSCAN Hyperparameter Optimization

Cluster Characteristics

Key Insights

Business Implications

HDBSCAN: Hierarchical Density-Based Clustering

HDBSCAN Hyperparameter Tuning

Cluster Characteristics

Noise Points (Gray):

Key Insights

Conclusion

Similar Posts