A Machine Learning Approach for Predicting Health Scores from Lifestyle Data

Press enter or click to view image in full size

How AI learns from your sleep, diet, and exercise to measure wellness before you even visit a doctor.

15 min readJust now

–

Each morning, millions of individuals wear fitness trackers, tally steps, and log their sleep, all in pursuit of a simple question:

“How healthy am I, really?”

Life in the modern time places more health information than ever into our lives — calories, steps, hours slept — but few people understand the meaning of it. Numbers that lack context can be overwhelming, rather than empowering.

This curiosity led to this project:

Could we tr…

Press enter or click to view image in full size

Photo by Gabin Vallet on Unsplash

How AI learns from your sleep, diet, and exercise to measure wellness before you even visit a doctor.

15 min readJust now

–

Each morning, millions of individuals wear fitness trackers, tally steps, and log their sleep, all in pursuit of a simple question:

“How healthy am I, really?”

This curiosity led to this project:

Could we train a machine to recognize how lifestyle habits, the daily decisions we make to dictate our overall health?

The vision was not to replace doctors and healthcare providers but address the space between lifestyle data and actionable insight. A model that was making learning about behavior and translating to actionable insight in a single Health Score.

Describing the Problem

Our rapid lifestyles are more and more defined by unhelpful schedules, poor eating patterns, variable sleep, and lack of exercise. Although healthcare will always be paramount in disease diagnosis and treatment, preventative health assessment — or understanding how day-to-day behaviours impact long-term health — is more or less absent from healthcare.

People can collect a myriad of health-related data and assessment from fitness devices, apps, and standardized questionnaires ( but use it in vacouum and without much purpose or meaning). Without analytic capability, it is hard to know what we’re giving up for example, when we are trying to figure out exactly how habits (exercise, diet, sleep) get us a measurable health outcome.

The problem this project chooses to define is:

How can we leverage machine learning to predict a person’s overall health score based on their daily lifestyle factors?

By building a predictive model, the project aims to see if we can bridge the gap between raw lifestyle information and true health-related information that results in action we take (or don’t take) in our purchasing decision with the focus on enhancing health. The project is aimed to quantify habits as they become health, and providing the user and health care disciplines take advantage.

Dataset

Dataset collected from Kaggle(source: https://www.kaggle.com/datasets/pratikyuvrajchougule/health-and-lifestyle-data-for-regression/data). Dataset has 1000 rows and 8 columns.

The dataset represented common lifestyle attributes found in health surveys or wearable records:

Press enter or click to view image in full size

Each feature carries both physiological and behavioral meaning. Together, they describe the tension between what we can control (habits) and what we can’t (age, baseline biology).

Visualizing the Story of the Data

Visualization serves as the common ground between machine insights and human understanding.

Health Score vs Age:

Press enter or click to view image in full size

While the negative slope illustrates age-related decline, there were some strong Health Scores for individuals 60–80. This does highlight that some positive behaviors can at least attenuate at some level biological aging.

BMI vs Health Score

Press enter or click to view image in full size

The correlation line has a slope downward, meaning a lower BMI indicates higher health score, thus upholding energy balance.

Health Score vs Exercise Frequency

Press enter or click to view image in full size

A smooth ridge on the plot illustrates the improvements in sleep on diet positive for beneficial outcomes and reminds us of the restorative state of nutrition as well.

Diet quality vs Health Score

Press enter or click to view image in full size

The scatter plot indicates a strong relationship between score and dietary quality. As dietary quality goes up, health score consistently increases. An overwhelming majority of individuals who received a high diet quality score, also scored an individual wellness score of above 80. This trend indicates that individuals will achieve greater overall health by continuing to eat a balanced, nutritious diet.

Sleep vs Health score

Press enter or click to view image in full size

The scatter plot indicates moderate support for a relationship between health score and number of hours slept. People sleep around 7 to 8 hours will again obtain higher health scores, which indicates a level of expectation for adequate rest. A few individuals who did not sleep enough scored slightly lower, along with individuals who slept too much. Poor sleep quality, and sleep inconsistency are more powerful indicators of lower wellness than sleep length.

Smoking hours vs Health Score

Press enter or click to view image in full size

The diagram presents an obvious separation between smokers and non-smokers for health outcomes. Non-smokers (0) typically achieve higher health scores, sitting in the higher range, while smokers (1) generally demonstrated lower total health. This accentuates the negative influence of smoking on health and is still a considerably strong predictor of wellness declines.

Alcohol Consumption vs Health Score

Press enter or click to view image in full size

As I consider the scatter plot, I see that consumption of alcohol can be a factor associated with lower health scores, although it is not a perfect linear association. There are certainly individuals with low to moderate alcohol intake that tend to have higher health scores, meaning they are clustered at the upper end. However, once you get into the higher levels of drinking, there is much more variability in health scores, and a clear downward trend in the health scores indicating that there may be some negative health effects from high levels of drinking overall.

Correlation Heatmap Analysis

Press enter or click to view image in full size

Correlation Heatmap

The correlation heatmap shows average relationships between lifestyle factors and the Health Score, while also providing visual representation. I found that Diet Quality (0.68) and Exercise Frequency (0.25) both had a strong positive relationship. This means that a better diet and consistent physical activity are associated with higher health score (i.e. the higher the health score the better the diet or frequency of exercise). Additionally, the two factors with a negative correlation (therefore lower health outcomes when higher) is recorded as BMI (-0.42) and Alcohol Consumption (-0.14). Other potential influences on well-being may fall in the middle, despite anyone sleep patterns (Sleep Hours 0.27) acknowledging balanced sleep on health, which is added to two middle factors which total largely weigh in on their influence on overall health scores and outcomes. Overall, the heatmap simply reinforces what habits are most impactful on overall wellness.

Transformation of Lifestyle To Machine Readable Signals

Prior to a model being able to learn, the data needs to be prepared by going through cleaning processes, encoding it, and balancing the dual form fields.

1. Missingness:

Lifestyle data is notoriously sparse, with humans either forgetting to enter data or reporting daily habits inaccurately. For numerical fields, entering in the realm of the median reassured the average would not be over or underinflated. For categorical fields, entering in mode will restore balance.

2. Feature Scaling:

There are fields that have different values, such as BMI and Sleep Hours. By standardizing they were able to be theoretical in the same range while linking both behaviors equally in value.

3. Numerical Feature:

I have outlined a list of numerical columns that include seven features: ‘Age,’ ‘BMI,’ ‘Exercise_Frequency,’ ‘Diet_Quality,’ ‘Sleep_Hours,’ and ‘Alcohol_Consumption,’ The numerical_cols list identifies all the columns in the dataset that contain numerical data that will be processed.

4. Categorical features

In contrast, categorical features refer to variable values of a qualitative or label-based nature. The Smoking_Status feature is categorical in this dataset, indicating whether an individual smokes or not (e.g., Yes/No or 0/1).

I also built a numeric_transformer pipeline using the Pipeline class from scikit-learn. The numeric_transformer pipeline has a single transformation step called ‘scaler.’ In the Scaling step, I will utilize StandardScaler, which transforms all numerical features using the underlying StandardScaler methodology. Specifically, the StandardScaler will transform each numerical column to have a mean value of zero and a standard deviation of one, thereby ensuring that all features are on a common scale for the various machine learning algorithms.

This phase — which is often neglected — is where the majority of the science takes place. Preprocessing data in the proper manner will ensure the model learns the health relationships, and not the by-regions of inconsistent data on lifestyle.

Splitting the Dataset

Once I prepared and cleaned the data, I divided it into a training set and a testing (or validation) set. This was done for the reason that the purpose of machine learning is to have the model learn about patterns from one dataset (the training set), and then apply that learning to a separate or unseen data set (the testing or validation set). In the end, I wanted to know how well the model generalized to new situations. I utilized the train_test_split() function in scikit-learn and created a split that used 70% of the data for training and 30% for testing and set the random_state=42 so that the results were reproducible each time. So when I split the data, for example, the results dataset looked like:

X_train: 700 rows and 7 columns X_test: 300 rows and 7 columns y_train: 700 entries y_test: 300 entries

This way, I am able to have the model learn about a robust amount of the data, but still test the model on new or unseen examples. This is how I am able to build a more accurate and unbiased model of health score based on lifestyle factors.

Selecting the Ideal Predictive Engine

Human health relationships aren’t perfect linearities. Sleep is beneficial until sleep becomes excessive. Exercise is advantageous until it can no longer be performed at a sustainable level.

To explore the complexity of these relationships, three models with different philosophies were applied.

Linear Regression — The Transparent Thinker

There are proportional relationships at play with this model, so if you double the exercise you will double the improvement. The math is simple and it provides ease of understanding, acting as a reference point for comparison.

Random Forest Regressor — The Ensemble Learner

The ensemble of decision trees allows capture of non-linear and conditional effects. For example, it can identify that the diet matters when the frequency of exercise is low or when sleep quality can compensate for moderate levels of alcohol consumption.

Support Vector Regression — The Margin Optimizer

This method fits to the best possible line (regression) within a margin of error. Through its design, support vector regression is robust to outliers — a beneficial consideration when analyzing health behaviors where by definition, some of the habits have a wide variability.

Assessment of Models

Models were created using 80 % of the data and tested on the other 15 % validation dataset. Performance was assessed using two methods:

Mean Squared Error (MSE): How far off predictions were from actual scores.

R² Score: How much variance in Health Score was predicted by the model.

Linear RegressionMean Squared Error: 31.91R² Score: 0.815Random ForestMean Squared Error: 27.83R² Score: 0.839Support VectorMean Squared Error: 36.51R² Score: 0.789

The Random Forest Regresor produced the best results — that is the lowest MSE and the highest R².

This observation mimics reality: health is not a straight line; it is a system of overlapping consequences. Tree-based methods are very effective when interactions are more predictive than single variables.

Evaluating Final Model Performance on Unseen Test Dataset

To evaluate the performance of my Random Forest Model on a completely unseen dataset, I conducted testing using the reserved test dataset. The first plot, Residuals vs. Predicted Values, demonstrates how the predicted health scores are distributed indicated by prediction errors (i.e., residuals). Most of the residuals cluster around the zero residual value, indicating accurate predictions for the model; notably, outliers are expected in real world data.

Press enter or click to view image in full size

Residuals vs. Predicted Values

Press enter or click to view image in full size

Distribution of Residuals

The second plot displays the Distribution of Residuals, or how the error is distributed. The residuals do have an approximate normal distribution centered around zero, indicating that the model does not predict to over or under predict health scores consistently. Both results show that the Random Forest model generalizes and shows stability on predicting health on a new and unseen individual.

Anticipated Health Scores based on Test Data

Once the Random Forest model has been trained and evaluated, I was able to produce predictions on the previously unseen test dataset. The illustration above, provides the Anticipated Health Scores for individuals based upon their lifestyle characteristics including age, BMI, diet, exercise, sleep, alcohol consumption, and smoking.

Press enter or click to view image in full size

Health score result on unseen test data

Most of the anticipated scores fall within a believable range of 60–100 indicating the model captures general health pattern. The values change in magnitude according to the independent individual lifestyle such that generally the higher scores are likely to represent individuals with positive lifestyle choices (balanced diet, exercise, sleep) and the lower scores are likely to represent individuals with negative lifestyle choices (increased alcohol consumption, smoking, poor diet). This prediction confirms that the model is able to reasonably estimate health outcomes even for new individuals with a related lifestyle based input.

Interpreting What the Model Learned

Once trained, the model revealed which features mattered most. Feature-importance analysis ranked predictors as follows:

Press enter or click to view image in full size

Exercise: The Foundation

Consistent exercise was dominant over all other influences. Even low levels of activity had significantly better predicted scores than no activity at all.

Diet Quality: The Multiplier

Diet quality magnified the benefits of exercise. Poor nutritional choices suppressed any gains from exercise and reinforced the notion that health is synergistic.

Sleep Hours: The Hidden Moderator

7–8 hours of sleep was the sweet-spot; less or more decreased the predicted scores slightly — validating the “sweet-spot” of recovery that our biology prefers.

BMI and Age: The Moderators

While age and BMI were also non-behavioral influences, they were able to aggregate some of the baseline wellness. Lifestyle choices were the moderators; active older adults occasionally had better predict scores than sedentary younger adults.

Alcohol and Smoking: The Detractors

While moderate alcohol consumption had a negligible soft effect, smoking had a substantial negative influence on predicted scores. Overall, the model served to accelerate what 100s of years of medicine has learned.

What the Numbers Explain

In addition to having improved accuracy, the model taught us three important lessons:

1. Health is explainable: We are able to use data to measure what we intuitively know — when our habits interact to create health benefits (or not).

2. **Interdependence is better than independence: **A singular behavior (e.g., exercise) without balance (e.g., sleep, diet) will not yield a substantial result.

3. **Behavioral data is predictive: **Even absent clinical measures of traditional biomarkers, such as cholesterol or blood pressure, lifestyle data alone was predictive of meaningful outcomes.

This provides an avenue for early, data-informed assessments of health and wellness — particularly for individuals or communities with limited access to health care facilities.

The Larger AI Context

Predicting health scores is an application of supervised learning with systems learning from labeled examples.

However, health data can bring unique challenges:

· **Noise & Subjectivity: **“Diet quality” or “exercise frequency” typically relies on self-reported measures, and therefore model needs to learn how to handle this uncertainty.

· **Non-stationarity: **Habits may change, and what is predictive today may not be tomorrow.

· Bias & Fairness: When training models on a population with characteristics, the model may underperform if applied to other populations with different characteristics.

· Interpretability: In a healthcare scenario, transparency is not optional; a prediction is subject to the user being able to explain to the patient (who inherently trips the social contract) the rationale.

This means model accuracy is a step and only a step; responsible AI in health requires at minimum, interpretability, fairness, and ethical stewardship.

The Importance of Machine Learning in Health Analytics

Conventional health assessments are based on infrequent medical tests.

Machine learning, on the other hand, provides a complementary perspective — assessing health continuously based on behavior.

This will enable:

· Detection of abnormalities before symptoms appear.

· Encouragement of preventive versus reactive care.

· Personalized feedback to individuals.

· Contextual information to clinicians beyond clinical data — lifestyle information.

Imagine a system that reads your weekly data on movement, diet, and sleep, and alerts, “Your predicted health score dropped 5 % this week. Your sleep decreased, and alcohol consumption increased.”

This level of micro-feedback could positively move public health from crisis-oriented to wellbeing-oriented.

Insights from a Data Science Perspective

There were also some lessons that were more general when developing the model which I felt fit more with the craft of the practice of machine learning.

1. Data Context is Everything

If the model loses the biological significance of those features, it may also learn the features to create erroneous correlations. The role of data scientist within health care is important to tap into the biology behind reasons and leverage and algorithmically trained set of skills and abilities.

2. Graphs Provide Harder Intuition

I though exploratory plots seemed to represent relations I was not able to see by merely taking metrics. Because of this, the visualizations became the most significant form of interpreting.

3. Simplicity Scales Better

More complicated models may not be the best model either. A modeling method of linear regression model may feel too simple but it gives ust a lot of interpretability, and when we shared our findings with non-technical individuals that worth mentioning.

4. Metrics Needs Meaning

An R² of .84 was only impressive if you could do something with the model findings. The model needed to have predictive power that transfers to human impact.

Practical Use across a variety of fields would include:

· **Preventative Health: **Hospitals could utilize AI screening based on lifestyle characteristics, to identify patients at-risk.

· Wearable Devices: Smartwatches could report a Health Score based on all daily activity, combined with sleep data that would be personalized on a day-to-day basis.

· Employer Wellbeing: Employers could offer a dashboard that anonymously reports healthy characteristics and behaviors.

· Personal Advisors: Platforms driven by AI could provide tailored health recommendations based on data and the patient’s Health Score.

These are not all extreme possibilities. They are quite feasible with responsible applications of machine learning, as previously described in this project.

Future Outlook

This was only a pilot project; there was greater vision.

· Incorporate Real Data: Leverage large-scale public-facing data (e.g. NHANES, WHO) to test the findings.

· Broaden Variables: Include variables such as stress, hydration, heart rate variability, or social activities.

· Temporal Models: Include recurrent neural networks (RNNs/LSTMs) to study outcomes across timeframes.

· Built as a dashboard: Design an interactive application in which the user inputs their lifestyle, and the model disseminates information.

· Create an Explainable AI (XAI): Using SHAP or LIME to study the effect of each variable in informing any of the outputs.

Ultimately, it is about producing personalized health intelligence, a living model of sorts that continues to learn in tandem with each unique person or patient.

Ethical Considerations

When you utilize predictive health analytics, you have an obligation.

· Privacy: Health information is incredibly sensitive. Systems must be designed so that anonymization is automatic and that data will remain secure.

· Bias: a model trained on one demographic may not anticipate the next one accurately. Therefore, checking for bias should be an ongoing task.

· Over-reliance: predictions should inform and not drive decisions covid actions in healthcare for humans — it should always supplement the judgement of human professionals rather than replace it.

Ethical AI is not only about protecting the individual — it’s about ensuring the public maintains its trust in the role of technology in healthcare.

Final Thoughts

Health is an intensely personal experience; however, data, while it is individualized, can reveal something called collective or shared experiences that surpass the individual lens.

In my work, I came to realize that every data point — every missed meal, every hour of sleep — conveyed a piece of a bigger story. Machine learning is at its best when it converts the story to knowledge.Machine learning does not replace human wisdom, it enhances it.

When we provide AI with the right data, AI does not just predict human health- it supports understanding how to sustain human health.

The future of wellness will be articulated by this emerging partnership about developing human intuition + support augment with machine precision.

You can find this full project on https://github.com/puspitachy/Health-score-prediction-from-Lifestyle-data. Thanks for reading.

References:

Khera, A. V., et al. (2016). Genetic Risk, Adherence to a Healthy Lifestyle, and Coronary Disease. New England Journal of Medicine, 375(24), 2349–2358. https://www.nejm.org/doi/full/10.1056/NEJMoa1605086
Schwingshackl, L., et al. (2021). Adherence to a Healthy Lifestyle and Mortality in Adults: A Systematic Review and Meta-Analysis. BMJ, 375, e068302. https://www.bmj.com/content/375/bmj-2021-068302
Kaggle Dataset: Lifestyle and Health Data for Predictive Modeling https://www.kaggle.com/

How AI learns from your sleep, diet, and exercise to measure wellness before you even visit a doctor.

How AI learns from your sleep, diet, and exercise to measure wellness before you even visit a doctor.

Describing the Problem

Dataset

Visualizing the Story of the Data

Transformation of Lifestyle To Machine Readable Signals

Splitting the Dataset

Selecting the Ideal Predictive Engine

Assessment of Models

Evaluating Final Model Performance on Unseen Test Dataset

Anticipated Health Scores based on Test Data

Interpreting What the Model Learned

What the Numbers Explain

The Larger AI Context

The Importance of Machine Learning in Health Analytics

Insights from a Data Science Perspective

Practical Use across a variety of fields would include:

Future Outlook

Ethical Considerations

Final Thoughts

Similar Posts