don’t always get the credit they deserve. Methods like k-nearest neighbors (k-NN) and kernel density estimators are sometimes dismissed as simple or old-fashioned, but their real strength is in estimating conditional relationships directly from data, without imposing a fixed functional form. This flexibility makes them interpretable and powerful, especially when data are limited or when we want to incorporate domain knowledge.
In this article, I’ll show how nonparametric methods provide a unified foundation for conditional inference, covering regression, classification, and even synthetic data generation. Using the classic Iris dataset as a running example, I’ll illustrate how to estimate conditional distributions in practice and how they can support a wide range of data science…
don’t always get the credit they deserve. Methods like k-nearest neighbors (k-NN) and kernel density estimators are sometimes dismissed as simple or old-fashioned, but their real strength is in estimating conditional relationships directly from data, without imposing a fixed functional form. This flexibility makes them interpretable and powerful, especially when data are limited or when we want to incorporate domain knowledge.
In this article, I’ll show how nonparametric methods provide a unified foundation for conditional inference, covering regression, classification, and even synthetic data generation. Using the classic Iris dataset as a running example, I’ll illustrate how to estimate conditional distributions in practice and how they can support a wide range of data science tasks.
Estimating Conditional Distributions
The key idea is simple: instead of predicting just a single number or class label, we estimate the full range of possible outcomes for a variable given some other information. In other words, rather than focusing only on the expected value, we capture the entire probability distribution of outcomes that could occur under similar conditions.
To do this, we look at data points close to the situation we are interested in; that is, those with conditioning variables near our query point in feature space. Each point contributes to the estimate, with its influence weighted by similarity: points closer to the query have more impact, while more distant points count less. By aggregating these weighted contributions, we obtain a smooth, data-driven estimate of how the target variable behaves across different contexts.
This approach allows us to go beyond point predictions to a richer understanding of uncertainty, variability, and structure in the data.
Continuous Target: Conditional Density Estimation
To make this concrete, let’s take two continuous variables from the Iris dataset: sepal length (x1) as the conditioning variable and petal length (y) as the target. For each value of x1, we look at nearby data points and form a density over their y-values by centering small, weighted kernels on them, with weights reflecting proximity in sepal length. The result is a smooth estimate of the conditional density p(y ∣ x1).
Figure 1 shows the resulting conditional distribution. At each value of x1, a vertical slice through the color map represents p(y ∣ x1). From this distribution we can compute statistics such as the mean or mode; we can also sample a random value, a key step for synthetic data generation. The figure also shows the mode regression curve, which passes through the peaks of these conditional distributions. Unlike a traditional least-squares fit, this curve comes directly from the local conditional distributions, naturally adapting to nonlinearity, skew, or even multimodal patterns.
**Figure 1. **Conditional distribution and mode regression curve of petal length given sepal length for the Iris dataset (Image by Author).
What if we have more than one conditioning variable? For example, suppose we want to estimate p(*y *∣ x1, x2).
Rather than treating (x1,* x*2) as a single joint input and applying a two-dimensional kernel, we can construct this distribution sequentially:
p(*y *∣ x1, x2) ∝ p(*y ∣ x2) p(**x2 *∣ x1),
which effectively assumes that once x2 is known, y depends primarily on x2 rather than directly on x1. This step-by-step approach captures the conditional structure gradually: dependencies among the predictors are modeled first, and these are then linked to the target.
Similarity weights are always computed in the subspace of the relevant conditioning variables. For example, if we were estimating p(x3* *∣ x1, x2), similarity would be determined using x1 and x2. This ensures that the conditional distribution adapts precisely to the chosen predictors.
Categorical Target: Conditional Class Probabilities
We can apply the same principle of conditional estimation when the target variable is categorical. For example, suppose we want to predict the species y of an Iris flower given its sepal length (x1) and petal length (x2). For each class y = c, we use sequential estimation to estimate the joint distribution p(x1, x2 | y = c). These joint distributions are then combined using Bayes’ theorem to obtain the conditional probabilities p(y = c ∣ x1, x2), which can be used for classification or stochastic sampling.
Figure 2, panels 1–3, show the estimated joint distributions for each species. From these, we can classify by selecting the most probable species or generate random samples according to the estimated probabilities. The fourth panel displays the predicted class boundaries, which appear smooth rather than abrupt, reflecting uncertainty where species overlap.
Figure 2. Class probability landscape for the Iris dataset. Panels 1–3 show the estimated joint distributions for each species: Setosa, Versicolor, and Virginica. Panel 4 displays the predicted class boundaries. (Image by Author)
Synthetic Data Generation
Nonparametric conditional distributions do more than support regression or classification. They also let us generate entirely new datasets that preserve the structure of the original data. In the sequential approach, we model each variable based on the ones that come before it, then draw values from these estimated conditional distributions to build synthetic records. Repeating this process gives us a full synthetic dataset that maintains the relationships among all the attributes.
The procedure works as follows:
- Start with one variable and sample from its marginal distribution.
- For each subsequent variable, estimate its conditional distribution given the variables already sampled.
- Draw a value from this conditional distribution.
- Repeat until all variables have been sampled to form a complete synthetic record.
Figure 3 shows the original (left) and synthetic (right) Iris datasets in the original measurement space. Only three of the four continuous attributes are displayed to fit the 3D visualization. The synthetic dataset closely reproduces the patterns and relationships in the original, showing that nonparametric conditional distributions can effectively capture multivariate structure.
Figure 3. Original and synthetic Iris data in original space (three continuous attributes shown) (Image by Author).
Although we’ve illustrated the approach with the small, low-dimensional Iris dataset, this nonparametric framework scales naturally to much larger and more complex datasets, including those with a mix of numerical and categorical variables. By estimating conditional distributions step by step, it captures rich relationships among many features, making it broadly useful across modern data science tasks.
Handling Mixed Attributes
So far, our examples have considered conditional estimation with continuous conditioning variables, even though the target may be either continuous or categorical. In these cases, Euclidean distance works well as a measure of similarity. In practice, however, we often need to condition on mixed attributes, which requires a suitable distance metric. For such datasets, measures like Gower distance can be used. With an appropriate similarity metric, the nonparametric framework applies seamlessly to heterogeneous data, maintaining its ability to estimate conditional distributions and generate realistic synthetic samples.
Advantages of the Sequential Approach
An alternative to sequential estimation is to model distributions jointly over all conditioning variables. This can be done using multidimensional kernels centered at the data points, or through a mixture model, for example representing the distribution with N Gaussians, where N is much smaller than the number of data points. While this works in low dimensions (it would work for the Iris dataset), it quickly becomes data-intensive, computationally costly, and sparse as the number of variables increases, especially when predictors include both numeric and categorical types. The sequential approach sidesteps these issues by modeling dependencies step by step and computing similarity only in the relevant subspace, improving efficiency, scalability, and interpretability.
Conclusion
Nonparametric methods are flexible, interpretable, and efficient, making them ideal for estimating conditional distributions and generating synthetic data. By focusing on local neighborhoods in the conditioning space, they capture complex dependencies directly from the data without relying on strict parametric assumptions. You can also bring in domain knowledge in subtle ways, such as adjusting similarity measures or weighting schemes to emphasize important features or known relationships. This keeps the model primarily data-driven while guided by prior insights, producing more realistic outcomes.
💡 Interested in seeing these ideas in action? I’ll be sharing a brief LinkedIn post in the coming days with key examples and insights. Connect with me here: https://www.linkedin.com/in/andrew-skabar/