
Customer segmentation is a cornerstone of modern business strategy. By dividing a diverse customer base into distinct, homogeneous groups, companies can tailor their marketing efforts, product development, and customer service to maximize impact and efficiency. While traditional methods like K-Means are popular, they often struggle wi...

Customer segmentation is a cornerstone of modern business strategy. By dividing a diverse customer base into distinct, homogeneous groups, companies can tailor their marketing efforts, product development, and customer service to maximize impact and efficiency. While traditional methods like K-Means are popular, they often struggle with the messy reality of real-world data, particularly when customer segments exhibit varying densities or contain significant outliers. Density-based methods like DBSCAN and HDBSCAN offer powerful alternatives.
In this comprehensive guide, we’ll walk through an end-to-end customer segmentation project using the popular Kaggle Mall Customer dataset, applying three prominent clustering algorithms: K-Means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise).
Dataset Overview: Mall Customer Dataset
The Mall Customer Dataset contains 200 customer entries with the following attributes:
- CustomerID: Unique identifier for each customer
- Genre: Gender of the customer
- Age: Age of the customer
- Annual Income (k$): Annual income in thousands of dollars
- Spending Score (1–100): A score assigned by the mall, reflecting customer behavior and spending patterns
Our primary objective is to identify distinct customer groups based on Age, Annual Income, and Spending Score.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score, silhouette_samples from sklearn.cluster import KMeans, DBSCAN from sklearn.neighbors import NearestNeighbors import hdbscan from kneed import KneeLocator import warnings import matplotlib.cm as cm warnings.filterwarnings('ignore') # Load and examine the data df = pd.read_csv('Mall_Customers.csv') print("Original Data Head:") print(df.head()) print("\nData Info:") print(df.info()) print("\nMissing Values:") print(df.isnull().sum())Output from initial data inspection:

Initial Findings: The dataset is clean with no missing values, making it ideal for clustering analysis.
Feature Scaling: Essential for Distance-Based Clustering
Clustering algorithms that rely on distance metrics (K-Means, DBSCAN, and HDBSCAN) are highly sensitive to feature scales. Features with larger numerical ranges can disproportionately influence distance calculations. To ensure equal contribution from each feature, we standardize our numerical features.
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)'] X = df[features] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print("\nScaled Data Sample (first 5 rows):") print(pd.DataFrame(X_scaled, columns=features).head())Output from scaling:

K-Means Clustering: The Centroid-Based Approach
K-Means partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids based on cluster means.
Key Concepts:
- Centroid: The center of a cluster, calculated as the mean of all points in that cluster
- k: The number of clusters (must be specified beforehand)
- init=’k-means++’: Intelligent initialization strategy for better convergence
Determining Optimal k: The Elbow Method
Since K-Means requires pre-specifying k, we use the Elbow Method to find the optimal number of clusters by plotting Within-Cluster Sum of Squares (WCSS) against k values.
# Determine optimal number of clusters using Elbow Method wcss = [] k_range = range(1, 11) for k in k_range: kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init=10) kmeans.fit(X_scaled) wcss.append(kmeans.inertia_) # Plot Elbow Method plt.figure(figsize=(10, 6)) plt.plot(k_range, wcss, marker='o', linestyle='--') plt.title('K-Means Elbow Method for Optimal Number of Clusters') plt.xlabel('Number of clusters (k)') plt.ylabel('WCSS (Within-Cluster Sum of Squares)') plt.grid(True, linestyle='--', alpha=0.6) plt.show()
Interpretation: The elbow typically occurs around k=5 for this dataset, indicating five distinct customer segments.
K-Means Implementation and Evaluation
# Perform K-Means with optimal clusters optimal_kmeans_clusters = 5 kmeans_model = KMeans(n_clusters=optimal_kmeans_clusters, init='k-means++', random_state=42, n_init=10) kmeans_clusters = kmeans_model.fit_predict(X_scaled) # Evaluate clustering quality kmeans_silhouette_avg = silhouette_score(X_scaled, kmeans_clusters) print(f"K-Means Silhouette Score (k={optimal_kmeans_clusters}): {kmeans_silhouette_avg:.4f}") # Add cluster labels to original data df['KMeans_Cluster'] = kmeans_clusters # Calculate cluster characteristics kmeans_cluster_means = df.groupby('KMeans_Cluster')[features].mean() print("\nK-Means Cluster Characteristics:") print(kmeans_cluster_means) print(f"\nCluster Distribution:\n{df['KMeans_Cluster'].value_counts()}")Example Output from K-Means Clustering:

Results: Silhouette Score of 0.4166 indicates reasonably good cluster separation.
K-Means Visualizations









K-Means Customer Segmentation Analysis
Based on comprehensive analysis of age, income, and spending behavior patterns, the K-means clustering algorithm identified five distinct customer segments with clear behavioral and demographic characteristics.
Cluster Profiles
Cluster 0 — “Financially Constrained Middle-Aged” Age: 46.2 years | Income: $26.8k | Spending Score: 18.3
This segment represents middle-aged customers facing significant financial constraints. With the lowest income level and minimal spending scores, these individuals likely include those experiencing economic hardship, working in lower-wage positions, or living on fixed incomes. Their purchasing behavior is driven primarily by necessity rather than discretionary wants, focusing on essential goods and services while avoiding non-essential expenditures.
Cluster 1 — “Young Lifestyle Enthusiasts” Age: 25.2 years | Income: $41.1k | Spending Score: 62.2
Young adults who demonstrate high spending propensity despite moderate income levels characterize this segment. These early-career professionals and young consumers prioritize lifestyle, experiences, and brand-conscious purchases. They represent a paradoxical spending pattern where discretionary purchases take precedence over traditional financial prudence, likely influenced by social media, peer pressure, and investment in personal image and experiences.
Cluster 2 — “Affluent High-Performers” Age: 32.9 years | Income: $86.1k | Spending Score: 81.5
This segment consists of successful young professionals in their prime earning years who combine high income with equally high spending behavior. These individuals have achieved significant career success early and possess both substantial disposable income and the confidence to spend freely. They represent ideal customers for premium products, luxury services, and high-value discretionary purchases across multiple categories.
Cluster 3 — “Disciplined High Earners” Age: 39.9 years | Income: $86.1k | Spending Score: 19.4
Despite having identical high income levels to Cluster 2, this segment demonstrates remarkably conservative spending patterns. These middle-aged high earners prioritize financial security, long-term planning, and wealth accumulation over immediate consumption. Their restrained spending likely reflects responsibilities such as mortgage payments, children’s education savings, retirement planning, or debt reduction strategies.
Cluster 4 — “Mature Pragmatic Consumers” Age: 55.6 years | Income: $54.4k | Spending Score: 48.9
The oldest demographic segment displays balanced, measured consumption patterns that reflect mature financial decision-making. With moderate income and spending levels, these consumers demonstrate practical purchasing behavior that balances current lifestyle needs with long-term financial stability. Their spending patterns likely reflect established preferences, quality-focused purchases, and consideration of approaching retirement.
Strategic Implications
These five clusters effectively segment the customer base along three critical dimensions: life stage progression, financial capacity, and spending behavioral patterns. Each segment represents distinct consumer archetypes with unique motivations, constraints, and purchasing triggers, providing valuable insights for targeted marketing strategies, product development, and customer relationship management initiatives.
The segmentation reveals that spending behavior cannot be predicted solely by income level, as evidenced by the stark contrast between Clusters 2 and 3, both high earners with completely different spending philosophies. Age and life stage considerations play equally important roles in determining consumer behavior patterns.
DBSCAN: Density-Based Clustering with Noise Detection
DBSCAN identifies clusters based on density and can discover arbitrarily shaped clusters while automatically detecting outliers.
Key Parameters:
- eps (ε): Maximum distance between two samples to be considered neighbors
- min_samples: Minimum number of samples required to form a dense region
Point Classifications:
- Core Point: Has at least min_samples within eps radius
- Border Point: Within eps of a core point but not core itself
- Noise Point: Neither core nor border point
Determining Optimal eps: The Knee Method
# Find optimal eps using k-distance graph min_samples = 5 neigh = NearestNeighbors(n_neighbors=min_samples) nbrs = neigh.fit(X_scaled) distances, indices = nbrs.kneighbors(X_scaled) # Sort distances to k-th nearest neighbor distances = np.sort(distances[:, min_samples-1], axis=0) plt.figure(figsize=(10, 6)) plt.plot(distances) plt.xlabel('Points ordered by distance') plt.ylabel(f'Distance to {min_samples}th nearest neighbor') plt.title('Knee Method for Optimal Epsilon (DBSCAN)') # Find knee point knee = KneeLocator(range(len(distances)), distances, curve='convex', direction='increasing') plt.axvline(x=knee.knee, color='r', linestyle='--', label=f'Knee at index {knee.knee:.0f}') plt.legend() plt.show() eps_optimal = distances[knee.knee] print(f"Optimal Epsilon: {eps_optimal:.4f}")
Optimal epsilon for minimus samples of 5.

Initial DBSCAN Clustering with Knee Method Epsilon
Let’s apply DBSCAN with these parameters and observe the initial results.
# Perform DBSCAN clustering with Knee Method suggested parameters dbscan_knee = DBSCAN(eps=eps_knee, min_samples=5) clusters_knee = dbscan_knee.fit_predict(X_scaled) # Add the cluster labels to the original data df['DBSCAN_Knee_Cluster'] = clusters_knee # Calculate the mean values for each cluster dbscan_knee_cluster_means = df.groupby('DBSCAN_Knee_Cluster')[features].mean() # Display the cluster means print("\nDBSCAN Cluster Means (Knee Method based):") print(dbscan_knee_cluster_means) print(f"\nDBSCAN Knee Method Cluster Distribution:\n{df['DBSCAN_Knee_Cluster'].value_counts()}") # Calculate Silhouette Score for Knee Method DBSCAN (excluding noise) X_dbscan_knee_clustered = X_scaled[clusters_knee != -1] labels_dbscan_knee_clustered = clusters_knee[clusters_knee != -1] dbscan_knee_silhouette_score = np.nan if len(np.unique(labels_dbscan_knee_clustered)) >= 2: dbscan_knee_silhouette_score = silhouette_score(X_dbscan_knee_clustered, labels_dbscan_knee_clustered) print(f"Silhouette Score for DBSCAN (Knee Method, excluding noise): {dbscan_knee_silhouette_score:.4f}") else: print("Cannot calculate Silhouette Score for DBSCAN (Knee Method): Less than 2 valid clusters found (excluding noise).")
Interestingly, using the knee method, DBSCAN identifies only one cluster and a noise cluster (-1). The noise count is 14. Remaining points are segmented in cluster 0.
DBSCAN Hyperparameter Optimization
# Grid search for optimal DBSCAN parameters eps_values = np.linspace(0.3, 1.0, num=15) min_samples_values = range(3, 15) best_score = -1 best_params = {} results = [] for eps in eps_values: for min_samples in min_samples_values: dbscan = DBSCAN(eps=eps, min_samples=min_samples) clusters = dbscan.fit_predict(X_scaled) # Calculate silhouette score (excluding noise) valid_clusters = clusters[clusters != -1] if len(np.unique(valid_clusters)) >= 2: score = silhouette_score(X_scaled[clusters != -1], valid_clusters) results.append({ 'eps': eps, 'min_samples': min_samples, 'silhouette_score': score, 'num_clusters': len(np.unique(valid_clusters)), 'noise_points': np.sum(clusters == -1) }) if score > best_score: best_score = score best_params = {'eps': eps, 'min_samples': min_samples} print(f"Best DBSCAN Parameters: {best_params}") print(f"Best Silhouette Score: {best_score:.4f}")Example Output from DBSCAN Grid Search:


Optimal Results: eps=0.35, min_samples=8, Silhouette Score=0.7970
The number of clusters identified as 2 and a noise cluster. The noise cluster is having 175 points.
The cluster means as show below.






Looking at the DBSCAN clustering results, I can see a clear segmentation of the data into three distinct groups plus noise points. Here’s the analysis:
Cluster Characteristics
Cluster -1 (Noise/Outliers)
- Age: 39.3 years (middle-aged)
- Income: $60.8k (moderate income)
- Spending Score: 50.4 (moderate spending)
- Distribution: Very wide age range (18–70) with high variability
- Interpretation: These are outlier customers who don’t fit the main behavioral patterns
Cluster 0 (Middle-aged Conservatives)
- Age: 48.5 years (older demographic)
- Income: $58.3k (moderate income)
- Spending Score: 46.5 (below-average spending)
- Distribution: Tightly clustered around age 47–50
- Interpretation: Established, financially conservative customers who spend cautiously despite having decent income
Cluster 1 (Young High-Spenders)
- Age: 21.5 years (young demographic)
- Income: $60.1k (moderate income)
- Spending Score: 51.2 (above-average spending)
- Distribution: Very tight age clustering around 18–25
- Interpretation: Young customers with relatively high disposable income who are willing to spend
Key Insights
- Age is the Primary Differentiator: The clusters are primarily separated by age rather than income. Both main clusters have similar income levels (~$58–60k) but very different ages and spending behaviors.
- Spending vs. Age Relationship: There’s a clear inverse relationship — younger customers (Cluster 1) spend more despite similar incomes, while older customers (Cluster 0) are more conservative spenders.
- Income Consistency: Interestingly, all clusters have similar income levels, suggesting that spending behavior is more influenced by life stage (age) than earning capacity.
Business Implications
- Cluster 0: Target with value-focused, practical products and conservative marketing
- Cluster 1: Target with trendy, lifestyle products and bold marketing campaigns
- Noise Points: Require individual analysis or specialized micro-segmentation strategies
The clustering successfully identified two distinct customer personas based primarily on age-driven spending behaviors rather than income levels.
HDBSCAN: Hierarchical Density-Based Clustering
HDBSCAN extends DBSCAN by building a hierarchy of clusters and selecting the most stable ones, making it robust to varying cluster densities.
Key Parameters:
- min_cluster_size: Minimum size for a cluster
- min_samples: Controls clustering conservativeness (defaults to min_cluster_size)
HDBSCAN Hyperparameter Tuning
# HDBSCAN parameter optimization param_grid = { 'min_cluster_size': [5, 10, 15, 20], 'min_samples': [None, 3, 5, 10] } best_score = -1 best_params = {} results = [] for min_cluster_size in param_grid['min_cluster_size']: for min_samples in param_grid['min_samples']: hdbscan_model = hdbscan.HDBSCAN( min_cluster_size=min_cluster_size, min_samples=min_samples, prediction_data=True ) clusters = hdbscan_model.fit_predict(X_scaled) valid_clusters = clusters[clusters != -1] if len(np.unique(valid_clusters)) >= 2: score = silhouette_score(X_scaled[clusters != -1], valid_clusters) results.append({ 'min_cluster_size': min_cluster_size, 'min_samples': min_samples, 'silhouette_score': score, 'num_clusters': len(np.unique(valid_clusters)) }) if score > best_score: best_score = score best_params = {'min_cluster_size': min_cluster_size, 'min_samples': min_samples} print(f"Best HDBSCAN Parameters: {best_params}") print(f"Best Silhouette Score: {best_score:.4f}")
Optimal Results: min_cluster_size=10, min_samples=None, Silhouette Score=0.6319.
Applying the Best HDBSCAN Model and Initial Cluster Overview
# --- Apply HDBSCAN with Best Parameters and Get Final Clusters --- if hdbscan_best_clusters_labels is not None: final_hdbscan_clusters = hdbscan_best_clusters_labels else: # Fallback: re-run the model with the best_params if for some reason best_clusters_labels wasn't captured. print("\n--- Re-running HDBSCAN with best parameters to get final clusters (fallback) ---") final_hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=hdbscan_best_params['min_cluster_size'], min_samples=hdbscan_best_params['min_samples'] if hdbscan_best_params['min_samples'] is not None else hdbscan_best_params['min_cluster_size'], prediction_data=True) final_hdbscan_clusters = final_hdbscan_model.fit_predict(X_scaled) # Add cluster labels to the original DataFrame df['HDBSCAN_Cluster'] = final_hdbscan_clusters print(f"\nNumber of HDBSCAN clusters found (including noise -1): {len(np.unique(final_hdbscan_clusters))}") print(f"Number of HDBSCAN noise points (-1 label): {np.sum(final_hdbscan_clusters == -1)}") print(f"HDBSCAN Cluster distribution:\n{df['HDBSCAN_Cluster'].value_counts()}") # --- Explicitly calculate and display the Silhouette Score for the FINAL HDBSCAN clusters --- X_hdbscan_clustered = X_scaled[final_hdbscan_clusters != -1] labels_hdbscan_clustered = final_hdbscan_clusters[final_hdbscan_clusters != -1] final_hdbscan_silhouette_score = np.nan if len(np.unique(labels_hdbscan_clustered)) >= 2: final_hdbscan_silhouette_score = silhouette_score(X_hdbscan_clustered, labels_hdbscan_clustered) print(f"\nSilhouette Score for the FINAL HDBSCAN clusters: {final_hdbscan_silhouette_score:.4f}") else: print("\nCannot calculate Silhouette Score for final HDBSCAN clusters: Less than 2 valid clusters found (excluding noise).")
Here, we observe 3 distinct clusters (0, 1, 2 in actual output) and a significant -1 cluster, which represents noise. The final Silhouette Score (0.6319) matches our best score, confirming consistency.






HDBSCAN clustering analysis, can see four distinct clusters have been identified based on age, annual income, and spending score. Let‘s’ break down what each cluster represents:
Cluster Characteristics
Cluster 0 (Blue) — “High Spenders”
- Age: ~33 years (youngest average)
- Income: ~$79k (highest income)
- Spending Score: ~81 (highest spending)
- This represents young, high-income individuals with very high spending behavior
Cluster 1 (Orange) — “Young Moderate Spenders”
- Age: ~23 years (youngest group)
- Income: ~$59k (moderate income)
- Spending Score: ~50 (moderate spending)
- Young adults with moderate income and spending patterns
Cluster 2 (Green) — “Mature Conservative Spenders”
- Age: ~52 years (oldest group)
- Income: ~$54k (moderate income)
- Spending Score: ~48 (moderate-low spending)
- Older individuals with moderate income but lower spending
Noise Points (Gray):
The gray points scattered throughout are outliers that don’t fit the main patterns — they could be:
- Customers with unusual combinations of age/income/spending
- Data entry errors
- Genuinely unique cases that don’t follow typical patterns
Key Insights
- Age-Spending Relationship: There’s a clear inverse relationship between age and spending score. Younger customers tend to spend more freely.
- Three Distinct Customer Segments Identified:
- Cluster 0: Premium customers (high income, high spending) — most valuable segment
- Cluster 1: Young moderate spenders — potential growth segment with room for increased engagement
- Cluster 2: Mature conservative spenders — underutilizing their purchasing power despite reasonable income
Target Segments for Marketing:
- Cluster 0 represents your premium customers requiring retention strategies
- Clusters 1 and 2 could be targeted for spending increase campaigns
Cluster Quality: The scatter plot shows well-separated clusters with clear boundaries, indicating HDBSCAN successfully identified meaningful customer segments. The presence of noise points (gray) suggests either outliers with unique behaviors or potential data quality issues worth investigating.
This segmentation provides a solid foundation for targeted marketing strategies, with different approaches needed for each distinct customer group based on their age, income, and spending behaviors.
Conclusion
This comprehensive comparison demonstrates that algorithm choice significantly impacts segmentation outcomes. While HDBSCAN achieved the highest silhouette score, K-Means provided the most interpretable and actionable customer segments for business applications. The key is aligning algorithmic strengths with business objectives and data characteristics.
For customer segmentation projects, we start with K-Means for baseline insights, then exploring density-based methods when dealing with complex customer behaviors or significant outlier populations.
Github link: https://github.com/rumsinha/Clustering
Advanced Customer Segmentation: A Comprehensive Comparison of HDBSCAN, DBSCAN, and K-Means was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.