From Clusters to Customers: Supercharging Segmentation with Generative AI

The “Consultant’s Confession” If you’ve spent any time in pharma consulting, you know the drill. We live in a world of high-stakes “Patient Journeys” and “HCP Target Lists.” I’ve spent more hours than I care to admit staring at spreadsheets of patient data, trying to find that “Aha!” moment. (The article uses pharma examples throughout, but this method easily adapts to any consulting domain.) In the “old days” — about two years ago — a segmentation project meant running a K-Means clustering model and surfacing four or five neat little groups. But let’s be honest: nothing kills a “Data-Driven Culture” faster than a slide labeled Cluster 0 through Cluster 4 . You’ve spent weeks cleaning data, tuning hyperparameters, and debating “Elbow vs. Silhouette” curves. You proudly present your findings, only for the Head of Marketing to ask the one question that makes your heart sink: “Great, but… what do I actually say to the people in Cluster 2?” Suddenly, your precise mathematical groups feel like abstract art — pretty to look at, but impossible to use. We’ve been stuck in this gap between Math and Meaning for years. But what if the “missing link” isn’t more data, but a better translator? Imagine replacing the overworked analyst (currently vibrating on overpriced espresso and billable-hour anxiety) with an LLM sidekick that turns cold centroids into a warm, actionable business playbook. In this article, I’m pulling back the curtain on a project I built in Google Colab. I’ll show you how I moved from raw (synthetic) Migraine patient data to a full-blown strategic playbook using a blend of K-Means clustering and Google’s Gemini. We aren’t just grouping data anymore; we’re automating the “Strategic Soul” of the business. Step 1: The Foundation (Synthetic Data) I’d probably be sacked faster than a non-compliant sales rep if I used actual IQVIA data for this article, so let’s stick to the safe side. To keep my career intact while proving the point, we’ll create some synthetic data that mimics real-world dynamics without the legal paperwork. In the code snippet below, I’ll avoid generating patient- or claim-level data. Instead, I’ll create practical aggregated features — directly derivable from patient data — that are highly relevant to segmentation. # Step 1: GenAI created code for generating synthetic dataset for the segmentation process import numpy as np import pandas as pd np.random.seed(42) n_hcps = 50000 # 1. Core identifiers and attributes hcp_ids = [f“HCP_{i+1}“ for i in range(n_hcps)] specialties = np.random.choice( [“Neurologist”, “PCP”, “Other”], size=n_hcps, p=[0.25, 0.5, 0.25] # tweak as needed ) states = np.random.choice( [“CA”, “TX”, “NY”, “FL”, “IL”, “PA”, “OH”, “GA”, “NC”, “MI”], size=n_hcps ) top_payer_type = np.random.choice( [“Commercial”, “Medicare”, “Medicaid”, “Other”], size=n_hcps, p=[0.5, 0.25, 0.15, 0.10] # commercial‑heavy mix ) # 2. Access score (1–10, higher = better) # Let Neurologists have slightly better access on average base_access = np.random.normal(loc=6.5, scale=2.0, size=n_hcps) base_access += np.where(specialties == “Neurologist”, 0.8, 0.0) base_access += np.where(top_payer_type == “Medicaid”, -0.7, 0.0) access_score = np.clip(np.round(base_access), 1, 10).astype(int) # 3. Class TRx (100–500 migraine market TRx, higher for Neuro) base_class_trx = np.random.normal(loc=260, scale=60, size=n_hcps) base_class_trx += np.where(specialties == “Neurologist”, 60, 0) base_class_trx += np.where(specialties == “PCP”, -30, 0) class_trx = np.clip(np.round(base_class_trx), 100, 500).astype(int) # 4. Brand share and brand TRx # Target ~45% overall market share, modulated by access & specialty raw_share = ( 0.45 + 0.03 * (access_score - 5) # better access -> higher share + np.where(specialties == “Neurologist”, 0.05, 0.0) + np.where(top_payer_type == “Medicaid”, -0.05, 0.0) + np.random.normal(0, 0.05, n_hcps) # noise ) brand_share = np.clip(raw_share, 0.05, 0.9) brand_trx = np.round(class_trx * brand_share).astype(int) # Check overall market share (optional sanity check) overall_share = brand_trx.sum() / class_trx.sum() print(f“Overall synthetic market share: {overall_share:.3f}“) # 5. Channel engagement: digital (emails etc.) and rep calls # Let access & specialty influence channel mix digital_base = 4 + 0.6 * (access_score - 5) digital_base += np.where(specialties == “Neurologist”, 2, 0) digital_base += np.where(top_payer_type == “Commercial”, 1, 0) digital_engagement = np.random.poisson(lam=np.clip(digital_base, 0.5, 20)) rep_base = 3 + 0.4 * (access_score - 5) rep_base += np.where(specialties == “Neurologist”, 1.5, 0) rep_base += np.where(top_payer_type == “Medicaid”, -0.5, 0) rep_calls = np.random.poisson(lam=np.clip(rep_base, 0.2, 20)) # New vs existing patients mix (0–1; higher = more new starts) new_start_ratio = np.clip( np.random.beta(a=2, b=3, size=n_hcps) + 0.05 * (access_score - 5) / 5, 0, 1 ) # Chronic burden index (proxy for comorbidities severity 0–5) chronic_burden_index = np.clip( np.random.normal(loc=2.5, scale=1.0, size=n_hcps) + np.where(top_payer_type == “Medicare”, 0.8, 0.0), 0, 5 ) # Adherence proxy (proportion of patients with MPR > 80%) adherence_rate = np.clip( 0.7 + 0.02 * (access_score - 5) + np.random.normal(0, 0.05, n_hcps), 0.3, 0.95 ) # Build final DataFrame df_hcp = pd.DataFrame({ “hcp_id”: hcp_ids, “specialty”: specialties, “access_score”: access_score, “class_trx”: class_trx, “brand_trx”: brand_trx, “digital_engagement”: digital_engagement, “rep_calls”: rep_calls, “state”: states, “top_payer_type”: top_payer_type, “brand_share”: brand_trx / class_trx, “new_start_ratio”: new_start_ratio, “chronic_burden_index”: chronic_burden_index, “adherence_rate”: adherence_rate }) df_hcp.head() Step 2: The Logic (K-Means & The Mathematical Anchor) This is the “classic” part of the process — the ML foundation that remains the bedrock of any solid analysis. In pharma, segmentation isn’t just about grouping; it’s about finding distinct, non-overlapping patient profiles that respond differently to treatment or messaging. We use K-Means , an unsupervised learning algorithm that groups data points by minimizing the distance between each point and its cluster center (the centroid). However, K-Means is a bit like a GPS: it will take you wherever you ask, but you have to tell it how many stops to make. To avoid picking a number out of thin air, we use two diagnostic tools: The Elbow Method: We look for the “bend” where adding more clusters stops providing significant improvements in tightness (Inertia). The Silhouette Score: This measures how well a patient fits their own group versus the neighboring one. We want high scores — meaning “tight” clusters and “clear” boundaries. The following snippet runs this diagnostic marathon to find the “Goldilocks” number of clusters: ## Importing basic stuff which would be required for the K Means clustering algo import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns sns.set(style=“whitegrid”) # Select numeric features for clustering # These capture HCP behavior, intensity, and potential # Exclude IDs and pure categoricals numeric_features = [ “access_score”, # Access to therapy (1-10) “class_trx”, # Total migraine TRx volume “brand_trx”, # Brand‑specific TRx “brand_share”, # Brand penetration (0-1) “digital_engagement”, # Email/webinar engagement “rep_calls”, # Field rep interactions “new_start_ratio”, # New prescriptions vs refills “chronic_burden_index”, # Patient complexity proxy “adherence_rate” # Persistence quality ] print(f“Selected {len(numeric_features)} features for clustering“) # Extract features as numpy array X = df_hcp[numeric_features].values print(“Feature matrix shape:”, X.shape) print(“Sample feature values (first 5 HCPs):”) print(X[:5]) # Standardize features (CRITICAL for clustering) # K-means uses Euclidean distance; unscaled features will dominate from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(“Scaling complete. Mean and std of scaled features:”) print(pd.DataFrame(X_scaled, columns=numeric_features).describe().round(3)) # Step 4: Elbow method + Silhouette score to choose k from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt ks = range(2, 11) # Test k=2 to 10 inertias = [] sil_scores = [] for k in ks: # Fit K-means for this k km = KMeans(n_clusters=k, random_state=42, n_init=“auto”) labels = km.fit_predict(X_scaled) # Store metrics inertias.append(km.inertia_) # Within-cluster sum of squares sil_scores.append(silhouette_score(X_scaled, labels)) # Plot both diagnostics fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) ax1.plot(ks, inertias, marker=“o”, linewidth=2) ax1.set_xlabel(“Number of clusters (k)”) ax1.set_ylabel(“Inertia (within-cluster SSE)”) ax1.set_title(“Elbow Method”) ax1.grid(True, alpha=0.3) ax2.plot(ks, sil_scores, marker=“o”, linewidth=2, color=“orange”) ax2.set_xlabel(“Number of clusters (k)”) ax2.set_ylabel(“Average Silhouette Score”) ax2.set_title(“Silhouette Analysis”) ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Print results table results_table = pd.DataFrame({ “k”: ks, “inertia”: np.round(inertias, 0), “silhouette”: np.round(sil_scores, 3) }) print(“Diagnostic results:”) print(results_table) Output of the code snippet which would help us finalize the k With the optimal “K” value identified as 4, we processed our 50,000 HCP data points through the K-means model. This transformed our raw data into the structured cluster visualization shown below. So, what’s next? Am I expected to run descriptive analytics on each cluster across all the features we used, invest hours crafting marketing-friendly buzzwords for them, and then also develop a strategic playbook tailored to each cluster? WHAT IF! All the heavy lifting — descriptive analytics, buzzword crafting, cluster playbooks — could be jump‑started by one code snippet. That means less time on mechanics, more time on strategy. You start with a strong foundation and then tailor it into a roadmap for each client. ## Setting up the gemini LLM models !pip install -q google-generativeai import google.generativeai as genai from google.colab import userdata genai.configure(api_key=userdata.get(‘Default’)) ## Lets try the LLM way now def run_clustering_with_gemini_descriptions(df_hcp, k_final=4): “”“ Full pipeline: K-means clustering → profiles → Gemini Pro descriptions “”“ from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display, Markdown # 1. CLUSTERING numeric_features = [ “access_score”, “class_trx”, “brand_trx”, “brand_share”, “digital_engagement”, “rep_calls”, “new_start_ratio”, “chronic_burden_index”, “adherence_rate” ] X = df_hcp[numeric_features].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Fit K-means kmeans = KMeans(n_clusters=k_final, random_state=42, n_init=“auto”) df_hcp[“cluster_ml”] = kmeans.fit_predict(X_scaled) # 2. PROFILES cluster_profile = df_hcp.groupby(“cluster_ml”)[numeric_features].mean().round(2) cluster_profile[“size”] = df_hcp[“cluster_ml”].value_counts().sort_index().values cluster_profile[“pct_total”] = (cluster_profile[“size”] / len(df_hcp) * 100).round(1) print(“✅ ML Clusters created!”) print(cluster_profile) # 3. PCA VISUALIZATION pca = PCA(n_components=2) pcs = pca.fit_transform(X_scaled) df_hcp[“pc1”] = pcs[:, 0] df_hcp[“pc2”] = pcs[:, 1] plt.figure(figsize=(8, 6)) sns.scatterplot( data=df_hcp.sample(5000, random_state=42), x=“pc1”, y=“pc2”, hue=“cluster_ml”, palette=“tab10”, alpha=0.7, s=30 ) plt.title(f“HCP Clusters (k={k_final})“) plt.legend(title=“Cluster”) plt.show() # 4. GEMINI PRO DESCRIPTIONS descriptions = generate_gemini_cluster_stories(cluster_profile, df_hcp) display(Markdown(“## 🤖 Gemini Pro: HCP Cluster Personas”)) display(Markdown(descriptions)) return df_hcp, cluster_profile def generate_gemini_cluster_stories(cluster_profile, df_hcp): “”“ Use Gemini Pro to generate business-ready cluster descriptions “”“ # Prepare data for Gemini table_md = cluster_profile.to_markdown() # Get categorical insights specialty_dist = pd.crosstab(df_hcp[“cluster_ml”], df_hcp[“specialty”], normalize=“index”).round(2).to_markdown() payer_dist = pd.crosstab(df_hcp[“cluster_ml”], df_hcp[“top_payer_type”], normalize=“index”).round(2).to_markdown() prompt = f““” You are a pharma commercial excellence consultant analyzing HCP segments for a migraine brand. ML CLUSTER PROFILES (means per cluster): {table_md} SPECIALTY MIX: {specialty_dist} PAYER MIX: {payer_dist} TASK: For each cluster (0-{len(cluster_profile)-1}), create: ## Cluster X: [2-3 word business name] Profile (2-3 bullets): • Key characteristics in plain business language • What makes this HCP unique Priority (High/Med/Low): • Why this segment matters for brand growth Engagement playbook (3 tactics): • Rep strategy • Digital strategy • Access/messaging focus Format as clean markdown. Make it actionable for field force leaders. “”“ # Generate with Gemini model = genai.GenerativeModel(‘gemini-2.5-flash’) ## Please make sure the credentials/Keys are added and ## you select the model which is available for your user response = model.generate_content(prompt) return response.text # RUN THE FULL PIPELINE df_with_clusters, cluster_profiles = run_clustering_with_gemini_descriptions(df_hcp, k_final=4) Voila ! In minutes, raw claims data turned into actionable HCP personas like “Elite Advocates” and “High-Potential Expanders” — work that used to take weeks now runs instantly in a free Colab notebook. This gives you a strong starting point with clear field playbooks, but as the analyst, you still need to validate each cluster’s story: does “Elite Advocates” really show high brand loyalty and engagement penetration? Do “High-Potential Expanders” have the class volume and access to justify the growth opportunity? Cross-check with revenue concentration, payer mix alignment, and field feedback before deployment. AI handles the heavy lifting while you bring the expertise to make it production-ready. Check out my colab notebook to understand better — https://colab.research.google.com/drive/1wuit_8PXH2ykGhDubG65poYIBLRo2Igq?usp=sharing — From someone who knows exactly how it feels when your carefully planned three‑week segmentation timeline becomes a 72‑hour fire drill (This may or may not be an exaggeration) From Clusters to Customers: Supercharging Segmentation with Generative AI was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Similar Posts