Introduction
The study of diet and its relationship with health outcomes is fundamental in epidemiological research. Evidence consistently shows that adults adhering to a healthy diet enjoy longer lifespans and reduced risks of chronic diseases1,2,[3](https://www.nature.com/articles/s44482-025-00001-7#ref-CR3 “Althoff, T., Nilforoshan, H., Hua, J. & Leskovec, J. Large-scale diet tracking data reveal disparate associations between food environment and diet. Nat. C…
Introduction
The study of diet and its relationship with health outcomes is fundamental in epidemiological research. Evidence consistently shows that adults adhering to a healthy diet enjoy longer lifespans and reduced risks of chronic diseases1,2,3. Traditionally, dietary data have been captured using methods like Food Frequency Questionnaires or 24-h dietary recalls. However, these approaches are prone to underreporting, as they rely on memory and self-estimation, which can lead to inaccuracies in portion size assessment and recall bias4.
In recent years, dietary assessment has increasingly turned to technology, particularly diet-tracking mobile apps, which have gained traction in both academic and clinical settings5,6. These tools allow for real-time diet logging and have demonstrated promising results in improving health outcomes7. Quality-focused diet logging has been linked to cardiovascular benefits and the reliable collection of dietary data for analysis against energy expenditures5,7.
These methods rely on connecting a person’s food intake to nutritional information. Food composition databases (FCDBs) are a common tool leveraged for this purpose, providing detailed nutrient profiles8,9. These databases provide standardized nutrient values for thousands of food items, enabling researchers to analyze dietary patterns with greater precision. Intake-logging platforms such as ASA24 automate the step from self-reported 24-h recalls to nutrient totals by mapping each food entry onto several reference FCDBs, streamlining large-scale dietary assessment for researchers10. Similarly, FoodRepo integrates barcoded food logging with FCDBs for large-scale digital nutrition studies11.
However, a major hurdle in this integration is the incompleteness of FCDBs12. Many of these databases lack comprehensive nutrient data, requiring researchers to infer missing values from similar foods in other FCDBs13. Additionally, regional variability in nutrient values poses another challenge, as differences in soil composition, climate, agricultural practices, and fortification policies can lead to inconsistencies in nutrient profiles14. Despite these challenges, previous research has demonstrated that, for the majority of nutrient features, intake estimates derived from FCDBs across different regions remain largely comparable, supporting their use in cross-regional dietary assessments15,16,17.
Several approaches have been developed to address the challenge of missing or incomplete nutrient data in FCDBs. The United States Department of Agriculture (USDA) employs standardized procedures to estimate nutrient values when analytical data are unavailable, ensuring more complete profiles for dietary assessment18. Other efforts include statistical frameworks such as MIGHT, which selects optimal entries across FCDBs to improve imputation reliability, and machine learning methods like denoising autoencoders, which predict missing values with high accuracy12,19. A notable example is the use of word embedding models to align food composition data between FCDBs, capturing semantic relationships between food items to address inconsistencies in naming and description20. Building on these ideas, we introduce NutriMatch, a method that similarly leverages text embeddings and similarity metrics to align food entries across FCDBs, but offers key advantages over prior approaches. First, NutriMatch supports multi-language alignment, enabling harmonization of food databases in different languages without requiring translation. Second, while previous approaches rely on manual validation by nutrition experts20, NutriMatch incorporates a large language model (LLM) to serve as an automated judge, verifying the contextual equivalence of matched items.
NutriMatch semantically aligns food items across multiple FCDBs in parallel, merging complementary information from every source. When one database omits a nutrient, the method taps partner databases that do report it, directly imputing the gap and expanding each item’s nutrient panel. We applied NutriMatch to several FCDBs, including USDA SR Legacy, USDA FNDDS, Tzameret (Israel), MEXT (Japan), the Bahrain Food Composition Table, and AUSNUT (Australia)21,22,23,24,25,26. By enriching the reference FCDB through semantic alignment, NutriMatch supports more complete nutrient profiling and enables robust, cross-regional diet-health analyses.
To demonstrate the utility of NutriMatch in a large, real-world setting, we first applied it to the Human Phenotype Project (HPP)—a longitudinal study of >10,000 Israeli adults (40–70 y) who log their diet continuously via a mobile app linked to an Israel-specific FCDB. NutriMatch expanded the database from 21 to 151 nutrients, enabling finer-grained analyses of diet–phenotype relationships. As HPP now recruits additional cohorts in Japan, Abu Dhabi, and the United States, NutriMatch will be used prospectively to harmonize local FCDBs and fill nutrient gaps, thereby preserving comparability as the project becomes multinational.
We then evaluated how well NutriMatch generalizes by reproducing the entire pipeline in the Australian PREDICT trial27—a randomized study of 138 adults with prediabetes or early-stage type 2 diabetes. Specifically, we (i) enriched PREDICT diet logs with the same 151 nutrients, (ii) quantified imputation accuracy against AUSNUT reference values, and (iii) tested whether prediction models trained in HPP transfer to PREDICT without retraining. This external validation provides a stringent assessment of NutriMatch’s portability across both geography and clinical context.
Results
Novel alignment methodology
To enable enrichment and cross-database harmonization, NutriMatch uses a semantically aware alignment methodology that captures the contextual meaning of food items. It begins with an extraction phase, gathering descriptive data—names, categories, descriptions, and nutrient profiles—in multiple languages and formats from various FCDBs (Fig. 1a). Next, in the standardization phase, these data are processed by an LLM that produces structured outputs, unifying database formats and ensuring consistent naming, categorization, and descriptions across datasets (Fig. 1b). This consistency reduces errors in subsequent matching.
Fig. 1: Overview of the NutriMatch methodology.
a Extraction of food item descriptions, including names, categories and nutrient profiles, from multiple food composition databases (FCDBs). b Standardization of extracted data using a large language model (LLM) to ensure consistency across sources. c Embedding of food items into a semantic space, visualized using UMAP, showing clustering of conceptually similar items (for example, “Courgette” and “Zucchini”). d Further zooming in on plot C. Matching and nutrient imputation based on cosine similarity between embeddings, validated by an LLM. A missing Vitamin D₂ value for “Zucchini” is imputed from its match “Courgette,” resulting in a more complete nutrient profile. e Illustrative example of daily mean nutrient intakes for four individual participants from different cohorts (nutrients were drawn from the Tzameret FCDB), where darker blue indicates a higher absolute intake of that nutrient by the participant, while lighter blue/white indicates a lower intake; nutrients imputed by NutriMatch are highlighted in green. f SHAP beeswarm plot for the prediction of folic acid levels in blood tests from the HPP cohort (Methods). Each dot is one participant, where dots to the right raise the prediction, dots to the left lower it, and blue-to-red shows low-to-high intake. Nutrients labeled in green were imputed by NutriMatch rather than measured directly.
Following standardization, the next step converts the standardized textual information for each food item into numerical embeddings that represent the semantic meaning of the item (Fig. 1c). These embeddings enable comparisons based on broader descriptions rather than exact text matches. Afterwards, in the matching and validation phase, embeddings are compared using cosine similarity to identify similar food items across databases. An LLM is employed again to validate these matches, ensuring their contextual relevance and accuracy. For example, “Courgette” is matched with “Zucchini,” and missing nutritional information (e.g., Vitamin D2) is imputed (Methods) based on the matched item’s profile (Fig. 1d).
This multi-step pipeline aligns food items across databases and enriches their profiles through semantic imputation. We applied NutriMatch to augment FCDBs linked to dietary records, yielding extended nutrient features per participant. Figure 1e shows an illustrative heatmap of daily mean nutrient intakes for four participants, with imputed values highlighted in green to denote NutriMatch-enriched entries. These expanded profiles are then used to predict clinical phenotypes—such as folic acid levels in blood tests—from the HPP Israeli cohort dietary data, with a SHAP beeswarm plot (Fig. 1f). Displaying the relative importance of both measured and imputed nutrients, Vitamin C emerges as the dominant feature in the SHAP beeswarm. Low intakes (blue, negative SHAP values) markedly depress the model-predicted serum folate, whereas higher intakes (pink/red, positive SHAP values) elevate it, consistent with both observational and randomized evidence that vitamin-C supplementation increases circulating folate28,29. Conversely, elevated dihydrophylloquinone (Vitamin K (dK)) intake tends to associate with reduced predicted serum folate in the mode.
Validation of imputation and nutrient concordance across databases
We validated NutriMatch’s imputation quality and assessed nutrient concordance across matched foods from multiple FCDBs. Initially, we masked each nutrient in the AUSNUT database, imputed its value with NutriMatch, and then compared the imputed values to the original measurements, demonstrating strong correlation (Pearson R 0.83 ± 0.12) and minimal deviation (Supplementary Fig. S1). We further evaluated nutrient agreement across three databases—AUSNUT, Tzameret, and USDA SR Legacy—by computing correlations for the 37 nutrients they share, specifically analyzing matched food items without applying imputation (Supplementary Fig. S2). Pair-wise nutrient-intake correlations (Fig. S2, left) were modest: ρ = 0.64 for PREDICT ↔ Tzameret, 0.83 for Tzameret ↔ SR Legacy, and 0.70 for SR Legacy ↔ PREDICT. All three fall below an empirically derived “best-case” reproducibility band (Fig. S2, right; ρ ≈ 0.81–0.99). based on replicate analyses from USDA’s Foundation Foods dataset (Methods). Such discrepancies arise naturally due to inherent analytical variability (e.g., differences in laboratory detection limits), methodological variations in database compilation (e.g., recipe formulations, salt inclusion), and biological differences across food samples (e.g., cultivar variations, animal feed, regional soil and climate)15,16.
Two-dimensional density plots illustrate one notable scenario leading to nutrient mismatches—differences in detection limits between databases (Supplementary Fig. S3). Foods rich in specific nutrients exhibit strong agreement across different databases. However, AUSNUT reports nutrient values down to lower detection thresholds compared with Tzameret, causing discrepancies at the low end of the nutrient range for EPA and vitamin B₁₂.
To further investigate extreme discrepancies, we identified matched foods exhibiting the largest absolute nutrient differences between AUSNUT and Tzameret, cross-validating with USDA SR Legacy. Supplementary Tables S4–S6 summarize examples of top discrepancies for EPA, vitamin B₁₂, and iron. Most discrepancies reflect either human transcription errors, genuine variations due to regional recipes and fortification differences, or analytical inconsistencies. Occasionally, manual inspection by dieticians can detect incorrect automatic matches, but given the high correlation coefficients (Supplementary Fig. S2) they are relatively minor in effect.
Validation of 2-week diet logging as a reliable indicator of average nutrient intake
Participants in the HPP attend clinic visits every 2 years for comprehensive clinical assessments. After each visit, 2 weeks of continuous glucose monitoring (CGM), sleep data, and diet logging data are collected via a dedicated app (Fig. 2a), which links dietary data to the HPP FCDB. Using NutriMatch, we align the HPP FCDB with three external databases (SR Legacy, Tzameret, and FNDDS) to enrich nutrient profiles. Before using the nutrient intake data from participants in prediction tasks, we first validate that the 2-week dietary snapshot reliably reflects participants’ long-term dietary patterns.
Fig. 2: Patterns and correlations in Diet Logging data.
a Daily diet logging patterns for an individual participant, showing food items recorded, logging timestamps, and nutrient composition breakdown. b Radar plot showing high correlation of nutrient intake across days within the same participant (green, same visit) and across visits in the same participant (orange) highlighting the stability of individual dietary patterns. The series for different participants is omitted here because correlations are effectively zero and not visually distinguishable. c Density histograms depict Pearson correlation coefficients calculated between pairs of vectors of normalized mean daily intakes for all assayed nutrients. Blue bars correspond to comparisons between two different participants, orange bars to repeat study visits from the same participant, and green bars to different reporting weeks within a single visit for the same participant. Median correlations are highest for week-to-week comparisons within a visit (r = 0.67), intermediate across visits in the same individual (r = 0.48), and are virtually uncorrelated across different individuals (r = 0.0007), illustrating how effective 2 weeks of dietary logging is at capturing reproducible nutrient-intake patterns within individuals.
We analyzed dietary data from 10,197 participants, focusing on their intake of 20 common nutrients over a continuous 14-day period (Supplementary Table S1). We segmented the data for each participant into two groups: week 1 and week 2. This segmentation allowed us to assess the consistency of nutrient intake within individuals over the 2 weeks, compare it across different visits, and examine it relative to the intake of other participants. Our analysis revealed high consistency in the intake of specific nutrients (including protein, carbohydrate, lipid, sodium, dietary fiber, and alcohol) for the same participant in the same visit, as illustrated by the radar plot (Fig. 2b). This stability extended to years apart, indicating enduring dietary behaviors. Complementing this nutrient-level view, correlations between dietary profiles, represented by overall nutrient intake, were significantly higher within the same participant (both within a visit and across visits years apart) compared to correlations between different participants, as shown by the density histograms (Fig. 2c). Collectively, these findings indicate that two-week dietary loggings provide a robust snapshot of typical nutrient intake, effectively capturing stable long-term dietary behaviors.
Application of methodology: prediction with extrapolated nutrients
Using NutriMatch, we expanded the HPP FCDB from its original 21 nutrients to a total of 151 nutrients by harmonizing features from multiple external sources (148 from SR Legacy, 65 from FNDDS, 77 from Tzameret, and 43 from AUSNUT), thereby substantially enhancing its nutritional granularity. To assess the impact of this enrichment, we developed predictive models to examine the relationship dietary intake and a range of phenotypes, testing three model configurations: (1) a baseline model using only age as a predictor, (2) an extended model incorporating basic nutrients (macronutrients and sodium) available in the original HPP FCDB, and (3) a fully enriched model integrating the additional nutrient features introduced through NutriMatch. This comparative framework allowed us to quantify the contribution of NutriMatch-derived features in improving dietary intake predictions.
Prediction accuracy improved significantly as more nutrients were included, with the largest gains observed for body-fat indices, waist circumference, and serum folate concentration (Fig. 3a). We also analyzed the correlation between Nightingale blood biomarkers and relative nutrient consumption30. HDL biomarkers correlated with various fat-related nutrients, ketone bodies correlated strongly with total fat intake, and amino acids and renal function metabolites correlated strongly with protein groups of nutrients (Fig. 3b).
Fig. 3: Validation with extrapolated nutrients.
a Predictive model performance across multiple phenotypic traits, stratified by gender. Three models are compared: age-only (blue), age + basic nutrients (orange), and age + all nutrients (green). b Correlation between Nightingale blood biomarkers (X axis) and relative nutrient consumption (Y axis), values below 0.1 were masked. c Obesity prediction at 2-year follow-up. ROC curves compare the age-only model (AUC = 0.55 ± 0.02), basic nutrient model (AUC = 0.63 ± 0.03), and full nutrient model (AUC = 0.67 ± 0.03).
Additionally, we evaluated the ability to predict overweight or obesity status two years after the baseline meeting using nutrient intake data. Predictive performance improved when incorporating all nutrients (ROC AUC: 0.67 ± 0.02) versus basic nutrients only (ROC AUC: 0.63 ± 0.03; Fig. 3c). These findings highlight the value of the enriched dataset in improving predictive accuracy and uncovering novel insights into diet-health relationships.
Validation in the Australian PREDICT Cohort
To assess the generalizability of our methodology, we applied NutriMatch to the Australian PREDICT cohort, a randomized controlled trial examining personalized diet interventions in individuals with prediabetes or early-stage type 2 diabetes (T2DM) commencing metformin treatment. This cohort (N = 138) includes detailed dietary logging and clinical measurements27. Using NutriMatch, we harmonized and extended the nutrient features in the PREDICT cohort’s diet loggings, creating an enriched nutritional profile. Consistent with earlier validation in this cohort, NutriMatch-imputed values remained highly concordant with their AUSNUT-reported counterparts (Fig. 4, Fig. S1, Table S7). This allowed for direct comparison with the HPP cohort, as the datasets share common phenotypic and dietary intake information.
Fig. 4: Correlation between imputed and original nutrient values for 5277 food items in the AUSNUT database.
Spearman rank correlations were calculated between the imputed nutrient values (obtained from external FCDBs with NutriMatch, excluding AUSNUT during imputation) and the original AUSNUT values for nutrients that appeared in ≥ 20 food items. Only correlations that were statistically significant (two-tailed p < 0.05) and had a Spearman coefficient |ρ | ≥ 0.40 are displayed.
Then we compared the distributions of enriched nutrient data between the matched Israeli HPP cohort and the Australian PREDICT cohort (Methods). We observed high agreement in macronutrient distributions with minor variations between populations (Fig. 5a).
Fig. 5: Cross-cohort nutrient comparisons and model performance.
a Ternary plot comparing the distribution of protein, fat, and carbohydrate intake in the Israeli HPP cohort (blue) and the Australian PREDICT cohort (red). b Box plot comparing fatty acid intake as a percentage of total energy between the Israeli (blue) and Australian (red) cohorts (p-values 0.008, 0.002 and <0.000001, respectively). c Pearson correlations between predicted and observed values for various body composition traits in male and female participants.
Next, we analyzed the dietary fatty acid composition (Fig. 5b) and found notable differences between the two cohorts, reflecting expected variations in culturally-related dietary fat sources preferences (i.e., more animal sources rich in saturated fat and less plant based MUFA in the Australian compared with the Israeli cohort). Although cohorts were matched by age, sex, and BMI, the PREDICT cohort consists of individuals with prediabetes and early-stage diabetes, and the HPP cohort is primarily healthy (Methods).
Finally, we applied HPP-trained predictive models to the Australian PREDICT cohort, adjusting for age and gender. The extended nutritional profile outperformed models using only basic nutrients (Methods) and covariates, demonstrating higher Pearson correlations across multiple phenotypic predictions (Fig. 4c). These results indicate that including a broader set of nutrients enhances model performance and improves generalizability across populations while maintaining key diet-health associations.
These findings highlight the feasibility of cross-cohort nutrient harmonization and the potential of integrated dietary models in diverse populations. The results suggest that the nutrient extension captures some generalizable underlying biological or behavioral signals.
Discussion
FCDBs are crucial resources for nutritional epidemiology, yet issues such as missing data, incompatible formats, and variations in nutrient measurement methodologies limit their effectiveness. NutriMatch addresses these challenges through a twofold use of LLMs: semantic alignment across multiple languages and databases, and validation of matched food items before nutrient imputation. This approach also enables NutriMatch to handle linguistic and conceptual nuances in food descriptions, challenges typically unresolvable by keyword-based matching, such as subtle linguistic variations, synonyms, and culturally specific food terms.
Two-week diet loggings demonstrated high reproducibility within individuals across weeks and clinic visits, allowing us to confidently use short-term dietary data as reliable proxies for long-term dietary patterns. Leveraging this stable estimation, combined with the expanded nutrient profile (from 21 to 151 nutrients) provided by NutriMatch, enabled models to significantly outperform simpler models that used only basic nutrients (macronutrients and sodium, adjusted for age and gender). Enhanced predictions were particularly evident for body-composition indices, continuous glucose monitoring metrics, and blood biomarkers.
Generalizability was demonstrated by applying HPP-trained models directly to the Australian PREDICT cohort without retraining. Despite differences in dietary habits, cultural contexts, and health conditions, NutriMatch-enriched models maintained or improved predictive accuracy across multiple phenotypes. This highlights that harmonized nutrient profiles can capture generalizable biological and behavioral signals across diverse populations.
While NutriMatch enhances FCDB interoperability and enriches existing dietary studies, its effectiveness depends on both the quality of NutriMatch’s automated matches, which may occasionally result in mismatches where expert oversight could be beneficial, and the underlying quality of the source FCDBs.
Factors such as analytical detection limits, regional fortification practices, agricultural policies, food processing, differences in recipes (quantities and precise ingredients), and seasonal variations in nutrient content can introduce biases. Another core limitation in food logging that can introduce bias is the over simplification of food items that is a necessity for having an effective diet log (e.g Pizza/Lasagna without the ingredients and their proportions). Addressing these inherent biases through targeted chemical validations or independent validation cohorts remains an important area for future research. The rapid advancements in large language models suggest that the accuracy of semantic matching and nutrient imputations will continually improve, further enhancing the integration and accuracy across international FCDBs.
The transparent and interpretable framework of NutriMatch can enhance existing dietary studies, facilitate cross-cohort analyses, and support precision-nutrition pipelines by providing harmonized and comprehensive nutrient data. Improved local FCDBs resulting from NutriMatch implementation could bolster nutritional surveillance, refine dietary guidelines, and assist public health nutrition policies.
Overall, NutriMatch offers a practical and evolving approach for harmonizing dietary databases, helping strengthen the foundation for diet–health research and informing future public health efforts.
Methods
Data collection
We utilized data from the 10K Project, a prospective human cohort study involving over 10,000 healthy participants aged 40–70 at recruitment. The study focuses on in-depth clinical, physiological, behavioral, and multi-omic profiling. Specific exclusion criteria were applied to maintain the relevance and homogeneity of cohort31.
Dietary data were collected via continuous real-time diet logging. Participants recorded daily food and beverage consumption using a dedicated mobile app for a continuous 2-week period. The HPP FCDB linked to this app contains 7765 unique food items, categorized into 33 distinct food categories and associated with 718 short food names for high-level grouping.
As part of our external validation, we utilized data from the Australian PREDICT cohort27. It is a randomized controlled trial of personalized diet interventions in individuals with prediabetes or early-stage T2DM on metformin (N = 138). Detailed dietary logging and clinical measurements were collected using a dedicated mobile app, as previously described27.
Ethical approval
All participants signed an informed consent form upon arrival at the research site. All identifying details of the participants were removed prior to the computational analysis. The 10K cohort study is conducted according to the principles of the Declaration of Helsinki and was approved by the Institutional Review Board of the Weizmann Institute of Science.
External databases
Our alignment process involved matching the HPP FCDB with several key external FCDB. These databases were selected to provide comprehensive coverage of regional and global dietary habits:
USDA SR Legacy, a comprehensive source of nutritional data for U.S. foods, providing detailed profiles of macronutrients, vitamins, minerals, and bioactive compounds. Widely used in diet-related research21;
USDA FNDDS, primarily used for dietary intake surveys in the U.S., offers nutrient content, serving sizes, and food descriptions, frequently updated for public health research21;
Tzameret, an Israeli FCDB focused on nutrient data for locally consumed foods, essential for studying Israeli dietary patterns23;
MEXT (Japan) provides nutrient profiles of Japanese foods, reflecting regional dietary habits, and commonly used in studies of Japanese diets24;
Bahrain Food Database, developed by Bahrain’s Ministry of Health, provides essential nutritional data on local foods to support public health and dietary research25;
AUSNUT, the Australian food composition database, was developed for the 2011–2013 Australian Health Survey (AHS), providing detailed nutrient profiles for foods and dietary supplements consumed in Australia26.
Alignment methodology
Our alignment methodology follows four stages:
Dataset Standardization: We used structured outputs from LLMs to classify food item names and categories consistently across all datasets. This ensured uniformity in food classifications.
Embedding Projections: We converted food items into semantic embeddings using a model from Open AI (https://platform.openai.com/docs/guides/embeddings). We have used the “text-embedding-3-large” model to represent each food item as a vector of 3072 dimensions.
Matching: We employed cosine similarity as the distance metric to compare and match food items from different databases.
Validation with LLM: Finally, we used a prompt-based approach with an LLM to validate that the matched food items were indeed equivalent. The validation focused on ensuring that nutrients from one food item could be accurately imputed to the matched item.
Imputation methodology
To address missing nutrient data in FCDBs, NutriMatch employs a structured imputation strategy that integrates embedding-based matching and LLM-assisted validation. This approach ensures that missing nutrients are inferred based on the most robust and validated sources while maintaining transparency in decision-making.
Embedding-Based Candidate Selection: for each food item requiring nutrient imputation, we first identify the top 5 closest matches based on their embeddings. These embeddings, derived from a deep-learning model trained on food descriptions and nutrient compositions, enable semantic comparisons beyond simple keyword matching.
LLM Validation of Food Equivalence: the match between the original food item and the closest matches is then evaluated using an LLM. The LLM is prompted with structured queries to determine whether the candidate foods are nutritionally equivalent to the target food item (In our case, our standardized food item structure is that of the SR Legacy containing description and category). If the LLM confirms equivalence, these matches are flagged as valid references for nutrient imputation.
While this automated validation reduces the need for manual expert review and enables greater scalability, occasional mismatches may still arise in edge cases where domain expertise could offer added value.
Hierarchical Dataset Ranking for Selection: we prioritize FCDBs based on their validation rigor and data robustness. Databases with stringent quality control measures—such as USDA Standard Reference (SR Legacy) and USDA FNDDS—are given higher priority over sources with less validation, such as Tzameret. This ranking ensures that imputed values are derived from the most reliable sources whenever possible.
Selecting the Closest Match for Nutrient Imputation: once the top LLM-validated match is identified, nutrient values are imputed sequentially, starting from the highest-ranked database. If a match is found within a highly validated FCDB, its nutrient composition is directly transferred. Otherwise, the best available match in the embedding space is selected to provide the missing values.
Post-Imputation Matching for Unresolved Cases: for food items without an exact LLM-confirmed match, we leverage the embedding space to identify the most similar food and assign its nutrient values. This ensures that all food items receive a complete nutrient profile, even when exact database matches are unavailable.
This systematic imputation methodology makes NutriMatch fully explainable, as every imputed nutrient can be traced back to a specific food item in a known FCDB. By combining semantic embeddings, LLM validation, and dataset prioritization, we enhance the completeness and reliability of dietary data while maintaining methodological transparency.
Quantifying intra‑ and inter‑FCDB nutrient variability
We accessed the inter-database correlations using the shared nutrients. The three study databases AUSNUT (PREDICT cohort), Tzameret, and SR Legacy, share 37 nutrients (non-imputed). After NutriMatch alignment, we retained every nutrient represented by at least 50 food items in each comparison (all 37 met this criterion). Match counts were 1964 foods for AUSNUT ↔ SR Legacy, 4132 for AUSNUT ↔ Tzameret, and 3409 for Tzameret ↔ SR Legacy. Log Pearson (minimum clipping of 1e-5) correlations were computed nutrients‑wise for each two‑way combination and are displayed in Extended Data Fig. S2 due to the large zero tail of some of the nutrients in question.
To estimate the upper bound of reproducibility expected under ideal conditions, we used the Foundation Foods subset of USDA FoodData Central21, which includes repeated analytical measurements for the same food items, within the same country, while measured by the same laboratory methods. Even in this best-case scenario—where all external sources of variability are minimized—nutrient values still show variation due to intrinsic measurement noise. Within the 37 nutrients analyzed in our inter-database comparison, 25 were represented in Foundation Foods with ≥4 replicate determinations, yielding 10,076 food–nutrient pairs. Since only summary statistics (minimum, maximum, etc.) were available, we approximated the within-food standard deviation as σ ≈ (max–min)/4. This value corresponds to the theoretical σ of a uniform (rectangular) distribution, a widely used range-based estimator when an empirical variance is unavailable. We drew 100 pseudo‑observations from N(mean, σ²) for each pair and calculated log Spearman correlations across all non-identical food pairs. The 0.05/0.95 percentiles of this distribution (ρ ≈ 0.81–0.99) define an empirical “best-case” reproducibility band against which inter-database correlations were compared.
Machine learning models
For regression and classification tasks, we utilized the LightGBM library, implementing a fivefold cross-validation approach to evaluate model performance. Dietary log data was preprocessed by including only days with a recorded intake of at least 800 kcal.
We compared three hierarchical feature subsets in our predictive models: (1) age and sex only, (2) basic nutrients (macronutrients and sodium) along with age and sex, and (3) all nutrients, including the basic set, expanded by NutriMatch imputation. Each subsequent subset fully contains the previous one, allowing clear assessment of incremental predictive value from additional nutrient features.
To compare macronutrient and micronutrient consumption between the Australian and Israeli cohorts, participants were matched based on age, gender, and BMI using propensity score matching.
Propensity score matching
Propensity score matching balances baseline covariates by pairing participants with similar estimated probabilities of group assignment based on age, gender, and BMI. Matching was carried out via nearest-neighbor selection without replacement to create comparable groups. The matched cohort, with aligned distributions of age, gender, and BMI, was then used for downstream effect estimation.
SHAP
For model interpretability, SHAP (SHapley Additive exPlanations) decomposes individual predictions into per-feature contributions, quantifying the extent to which each variable shifts the prediction from its baseline. Positive and negative SHAP values indicate upward or downward effects on the model output, respectively. Contribution distributions are summarized with a beeswarm plot: features are ordered by mean absolute SHAP value, each point represents a sample’s SHAP value for that feature, horizontal position denotes effect size and direction, and color encodes the raw feature value. This visualization simultaneously conveys feature importance and inter-sample variability in effect magnitude and direction.
Data availability
The individual user data used in this paper is part of the Human Phenotype Project (HPP) and is accessible to researchers from universities and other research institutions at: https://humanphenotypeproject.org/data-access. Interested bona fide researchers should contact info@pheno.ai to obtain instructions for accessing the data. The embeddings for the food items and the matched items across the different FCDBs are available in the Github repo.
Code availability
Implementation of NutriMatch is available at: https://github.com/TalShor/NutriMatch.
References
Neuhouser, M. L. The importance of healthy dietary patterns in chronic disease prevention. Nutr. Res. 70, 3–6 (2019).
Article CAS PubMed Google Scholar 1.
Bermingham, K. M. et al. Effects of a personalized nutrition program on cardiometabolic hea