ABSTRACT
The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users’ queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve …
ABSTRACT
The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users’ queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth’s effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human–AI collaborative systems supporting data extraction tasks.
1 Introduction
The rapid advancement of scientific research has led to unprecedented growth in research literature across various disciplines. Extracting and synthesizing structured knowledge from this vast information landscape has become increasingly crucial for advancing scientific understanding and informing evidence-based decision-making. Within this process, data extraction—the identification and structuring of relevant information from scientific literature—is a critical stage where efficiency and precision are paramount (Taylor et al. 2021), particularly in time-sensitive domains. A relevant example is the early COVID-19 pandemic, where researchers urgently needed to determine the safety of breastfeeding for women with COVID-19 (World Health Organization 2020). This required the rapid and accurate extraction of data on experimental conditions (e.g., population demographics, study settings) and health outcomes from a rapidly expanding body of literature.
The structured data resulting from this process, often organized into tables, is essential for systematic comparison across studies, quantitative meta-analyses, and drawing comprehensive conclusions from diverse sources of evidence. Such data is crucial for organizations like the World Health Organization (WHO) in developing and disseminating timely, evidence-based guidelines (World Health Organization 2014).
Despite its importance, data extraction remains a cognitively demanding and time-consuming task. Researchers often need to manually distill relevant information from multiple papers, switching between different documents and data entry tools. This process is not only inefficient but also prone to inconsistencies and errors, highlighting a critical need to streamline the data extraction process. Addressing this need presents several challenges: (1) Multimodal information in literature. Scientific papers often contain diverse modalities of information, such as text, tables, and figures. The multimodality adds complexity to identifying the relevant information within each modality scattered throughout a paper and integrating it into a coherent and structured format. (2) Variation and inconsistencies across literature. The style, structure, and presentation of the papers can significantly vary from one to another. The variation and inconsistencies make it difficult to standardize the information. For example, the same concepts may be described using different terminologies or measurement units. (3) Flexibility and domain adaptation. Users may have varying research questions for a collection of papers, and these papers can span across different domains. Therefore, the system must be flexible enough to adapt to the diverse data needs of different users and domains.
Existing approaches to address these challenges have shown promise but face several key limitations. Tools and methods (Lo et al. 2023; Beltagy et al. 2019; Kang, Sun, et al. 2023; Nye et al. 2020; Lehman et al. 2019) focusing on extracting keywords, tables, and figures from documents help narrow the scope of relevant information but lack the flexibility to accommodate diverse extraction needs. Users still need to manually explore, select, and integrate relevant data. Question-answering systems (Open AI 2024; Anthropic 2024) have improved flexibility by allowing users to formulate information needs as queries about document content. However, these systems often produce unstructured text outputs, requiring significant effort to organize into desired structures. While some systems (Elicit 2023) present information in tabular formats, they fall short in standardizing the data and resolving cross-document inconsistencies.
To address these limitations, we present SciDaSynth, an interactive system designed to empower researchers to efficiently and reliably extract and structure data from the scientific literature, especially after completing a paper search and screening. By leveraging large language models (LLMs) within a retrieval-augmented generation framework (RAG) (Lewis, Perez, et al. 2020), the system interprets users’ queries, extracts relevant information from diverse modalities in scientific documents, and generates structured tabular output. Unlike standard prompting, which relies solely on a model’s pretrained knowledge (and can be outdated due to the LLM’s training cutoff), RAG dynamically retrieves and integrates up-to-date, domain-specific information into prompts. By injecting the retrieved information into the generation process, RAG reduces hallucinations and improves factual accuracy (Ji et al. 2023). To further ensure data accuracy, the system incorporates a user interface that establishes and maintains connections between the extracted data and the original literature sources, enabling users to iteratively validate, correct, and refine the data. Additionally, SciDaSynth offers multi-faceted visual summaries of data dimensions and subsets, highlighting variations and inconsistencies across both qualitative and quantitative data. The system also supports flexible data grouping based on semantics and quantitative values, enabling users to standardize data by manipulating these groups and performing data coding or editing at the group level. In addition, follow-up query instructions can be applied to specific data groups for further refinement. We conduct a within-subject study with researchers from nutrition and natural language processing (NLP) domains to evaluate the efficiency and accuracy of SciDaSynth for data extraction from research literature. The quantitative analyses show that using SciDaSynth, participants could produce high-quality data in a much shorter time than the baseline methods. Moreover, we discuss user-perceived benefits and limitations.
In summary, our major contributions are:
SciDaSynth,1 an interactive system that integrates LLMs to assist researchers in extracting and structuring multimodal scientific data from the extensive literature. The system combines flexible data queries, multi-faceted visual summaries, and semantic grouping in a cohesive workflow, enabling efficient cross-document data validation, inconsistency resolution, and refinement.
The quantitative and qualitative results of our user study reveal the effectiveness and usability of SciDaSynth for data extraction from the scientific literature.
Implications for future system designs of human–AI interaction for data extraction and structuring.
2 Related Work
2.1 Structured Information Extraction From Literature
The exponential growth of scientific papers has generated large-scale data resources for building LLMs and applications for information extraction tasks, such as named entity recognition and relation extraction in scientific domains. These models fall into two broad categories: encoder-only (non-generative) LLMs and generative models (autoregressive LLMs). Encoder-only models (often called “autoencoding” models) are trained to produce compact vector representations of input text. These vectors are useful for downstream classification tasks (Lewis, Ott, et al. 2020). For example, SciBERT (Beltagy et al. 2019) uses the BERT architecture by pre-training on millions of scientific abstracts and full-text papers; it excels at classification, entity recognition, and retrieval tasks but is not good at generating new text. Variants like OpticalBERT and OpticalTable-SQA (Zhao, Huang, et al. 2023) fine-tune the models on domain-specific corpora (e.g., biomedical or materials science) to boost performance on specialized extraction tasks. Generative, or autoregressive, LLMs predict the next word in a sequence, enabling them to create fluent text and even structured outputs directly from user prompts. Models such as GPT-4 (OpenAI 2024) (and earlier InstructGPT) have hundreds of billions of parameters and have been further refined with techniques like instruction tuning and reinforcement learning from human feedback. This training paradigm allows zero-shot or few-shot prompting: users can describe an extraction task in natural language and receive structured results—JSON, CSV—without any additional fine-tuning. In materials science, Dagdelen et al. (Dagdelen et al. 2024) demonstrated how GPT-4 could extract entities and relations and output them as JSON records with high fidelity. In this paper, we chose GPT-4 series generative models over open-weight alternatives (e.g., Llama 3/4) for three key reasons. They are more reliable and accurate in following users’ instructions (e.g., producing structured output) and handling complex, domain-specific queries. Moreover, unlike open-weight models that usually need data-intensive fine-tuning for domain adaptation, commercial LLMs like GPT-4 work well for low-resource settings across diverse domains out-of-the-box with great flexibility.
Besides, data in scientific literature is another particular focus for extraction. The data is usually stored in tables and figures in PDFs of research papers, and many toolkits are available to parse PDF documents, such as PaperMage (Lo et al. 2023), GROBID (GROBID 2008), Adobe Extract API (Adobe Inc. n.d.), GERMINE (Tkaczyk et al. 2015), GeoDeepShovel (Zhang et al. 2023), PDFFigures 2.0 (Clark and Divvala 2016). Here, we leverage the off-the-shelf tool to parse PDF text, tables, and figures. Besides the tools in the research community, Elicit (Elicit 2023) is a commercial software that facilitates systematic review. It enables users to describe what data to be extracted and create a data column to organize the results. However, it does not provide an overview of the extracted knowledge to help users handle variation and inconsistencies across different research literature. Here, we also formulate the knowledge as structured data tables. Moreover, we provide multi-faceted visual and text summaries of the data tables to help users understand the research landscape, inspect nuances between different papers, and verify and refine the data tables interactively.
2.2 Tools for Literature Reading and Comprehension
Research literature reading and comprehension is cognitively demanding, and many systems have been developed to facilitate this process (Head et al. 2021; August et al. 2023; Lee et al. 2016; Fok et al. 2023, 2024; Kang et al. 2022; Kang, Wu, et al. 2023; Kim et al. 2018; Chen et al. 2023; Peng et al. 2022; Jardim et al. 2022). One line of research studies aims to improve the comprehension and readability of individual research papers. To reduce barriers to domain knowledge, ScholarPhi (Head et al. 2021) provided in-situ support for definitions of technical terms and symbols within scientific papers. PaperPlain (August et al. 2023) helped healthcare consumers to understand medical research papers by AI-generated questions and answers and in-situ text summaries of every section. EvidenceMap (Kang, Sun, et al. 2023) leverages three-level abstractions to support medical evidence comprehension. Some work (Ponsard et al. 2016; Chau et al. 2011) designed interactive visualizations to summarize and group different papers and guide the exploration. Some systems support fast skimming of paper content. For example, Spotlight (Lee et al. 2016) extracted visual salient objects in a paper and overlayed it on top of the viewer when scrolling. Scim (Fok et al. 2023) enabled faceted highlighting of salient paper content. To support scholarly synthesis, Threddy (Kang et al. 2022) and Synergi (Kang, Wu, et al. 2023) facilitated a personalized organization of research papers in threads. Synergi further synthesized research threads with hierarchical LLM-generated summaries to support sensemaking. To address personalized information needs for a paper, Qlarify (Fok et al. 2024) provided paper summaries by recursively expanding the abstract. Some studies (Jardim et al. 2022; Marshall et al. 2016) developed and evaluated machine learning methods for risk of bias assessment in clinical trial reports. Although these systems help users digest research papers and distill knowledge with guidance, we take a step further by converting unstructured knowledge and research findings scattered within research papers into a structured data table with a standardized format.
2.3 Document QA Systems for Information Seeking
People often express their information needs and interests in the documents using natural language questions (ter Hoeve et al. 2020). Many researchers have been working on building question-answering models and benchmarks (Dasigi et al. 2021; Krithara et al. 2023; Jin et al. 2019; Vilares and Gómez-Rodríguez 2019; Ruggeri et al. 2023) for scientific documents. With recent breakthroughs in LLMs, some LLM-fused chatbots, such as ChatDoc (ChatDoc n.d.), ChatPDF (ChatPDF n.d.), ChatGPT (Open AI 2024), Claude (Anthropic 2024), are becoming increasingly popular for people to turn to when they have analytic needs for very long documents. However, LLMs can produce unreliable answers, resulting in hallucinations (Ji et al. 2023; Khullar et al. 2024). It is important to attribute the generated results to the source (or context) of the knowledge (Wang et al. 2024). Then, automated algorithms or human raters can examine whether the reference source really supports the generated answers using different criteria (Gao et al. 2023; Yue et al. 2023; Rashkin et al. 2023; Bohnet et al. 2023; Menick et al. 2022). In our work, we utilize RAG techniques (Lewis, Perez, et al. 2020) to improve the reliability of LLM output by grounding it on the relevant supporting evidence in the source documents. Then, we use quantitative metrics, such as context relevance, to evaluate the answer quality and prioritize users’ attention on checking and fixing low-quality answers.
3 Formative Study
We aim to develop an interactive system that helps researchers distill, synthesize, and organize structured data from scientific literature in a systematic, efficient, and scalable way.2 To better understand the current practice and challenges they face during the process, we conducted a formative interview study.
3.1 Participants and Procedures
3.1.1 Participants
12 researchers (P1–P12, five females, seven males, ages: three from 18 to 24, nine from 25 to 34) were recruited from different disciplines, including medical and health sciences, computer science, social science, natural sciences, and mathematics. Nine obtained PhD degrees, and three were PhD researchers. All of them had extracted data (e.g., interventions and outcomes) from literature, 10 of which had further statistically analyzed data or narratively synthesized data. Seven rated themselves as very experienced, where they had led or been involved with the extraction and synthesis of both quantitative and qualitative data across multiple types of reviews. Five had expert levels of understanding and usage of computer technology for research purposes, and seven rated themselves at moderate levels.
3.1.2 Procedures
Before the interviews, we asked the participants to finish a pre-task survey, where we collected their demographics, experience with literature data extraction and synthesis, and understanding and usage of computer technology. Then, we conducted 50-min interviews with individuals over Zoom. During the interviews, we inquired about the participants’ (1) their general workflow for data extraction from literature, desired organization format of data; (2) what tools were used for data extraction and synthesis, and what their limitations are; (3) expectations and concerns about computer and AI support.
3.2 Findings and Discussions
3.2.1 Workflow and Tools
After getting the final pool of included papers, participants first created a data extraction form (e.g., fields) to capture relevant information related to their research questions, such as data, methods, interventions, and outcomes. Then, they went through individual papers, starting with a high-level review of the title and abstract. Afterward, participants manually distilled and synthesized the relevant information required on the form. The data synthesis process often involved iterative refinement, where participants might go back and forth between different papers to update the extraction form or refine previous extraction results.
Common tools used by participants included Excel (9/12) and Covidence or Revman (4/12) for organizing forms and results of data extraction. Some participants also used additional tools like Typora, Notion, Python, or MATLAB for more specialized tasks or to enhance data organization. The final output of this process was structured data tables in CSV and XLSX format that provided a comprehensive representation of the knowledge extracted from the literature.
3.2.2 Challenges
Time-consuming to manually retrieve and summarize relevant data within the literature. Participants found it time-consuming to extract different types of data, including both qualitative and quantitative data, located in different parts of the papers, such as text snippets, figures, and tables. P1 commented, “Sometimes, numbers and their units are separated out at different places.” The time cost further increases when facing “many papers” (7/12) to be viewed, “long papers” (5/12), or papers targeting very specialized domains they are not so familiar with (5/12). P3 added, “When information is not explicit, such as limitations, I need to do reasoning myself.” P5 said, “It takes much time for me to understand, summarize, and categorize qualitative results and findings.”
Tedious and repetitive manual data entry from literature to data tables. After locating the facts and relevant information, participants need to manually input them into the data tables, which is quite low-efficiency and tedious. P3 pointed out, “…the data is in a table (of a paper), I need to memorize the numbers, then switch to Excel and manually log it, which is not efficient and can cause errors.” P4 echoed, “Switching between literature and tools to log data is tedious, especially when dealing with a large number of papers, which is exhausting.”
Significant workload to resolve data inconsistencies and variations across the literature. Almost all participants mentioned the great challenges of handling inconsistencies and variations in data, such as terminologies, abbreviations, measurement units, and experiment conditions, across multiple papers. It was hard for them to standardize the language expressions and quantitative measurements. P7 stated, “Papers may not use the same terms, but they essentially describe the same things. And it takes me lots of time to figure out the groupings of papers.” P9 said, “I always struggle with choosing what words to categorize papers or how to consolidate the extracted information.”
Inconvenient to maintain connections between extracted data and the origins in literature. The process of data extraction and synthesis often requires iterative review and refinement, such as resolving uncertainties and addressing missing information by revisiting original sources. However, when dealing with numerous papers and various types of information, the links between the data and their sources can easily be lost. Participants commonly relied on memory to navigate specific parts of papers containing the data, which is inefficient, unscalable, and error-prone. P8 admitted, “I can easily forget where I extract the data from. Then, I need to do all over again.”
3.2.3 Expectations and Concerns About AI Support
Participants anticipated that AI systems could automatically extract relevant data from literature based on their requests (7/12) and organize it into tables (9/12). They desired quick data summaries and standardization to (6/12) facilitate synthesis. Additionally, they wanted support for the categorization of papers based on user-defined criteria (4/12) and enabling efficient review and editing in batches (4/12). Besides, participants expected that the computer support should be easy to learn and flexibly adapt to their data needs. Many participants stated that the existing tools, like Covidence and Revman, were somewhat complex, especially for new users who may struggle to understand their functionalities and interface interactions.
Due to the intricate nature of scientific research studies, participants shared concerns about the accuracy and reliability of AI-generated results. They worried that AI lacks sufficient domain knowledge and may generate results based on the wrong tables/text/figures. P12 demanded that AI systems should highlight uncertain and missing information. Many participants requested validation of AI results.
3.3 Design Goals
Based on the current practice and challenges identified in our formative study and the specific needs of researchers engaged in data extraction, we distilled the following design goals:
DG1. Support flexible and comprehensive data extraction and structuring. The system should enable users to customize data extraction queries for diverse data dimensions and measures. To reduce manual effort, it should automate the extraction of both qualitative and quantitative data from various modalities such as text, tables, and figures. The extracted data should be organized into structured tables, providing a solid foundation for further refinement and analysis.
DG2. Enable multi-faceted data summarization and standardization. To address inconsistencies and variations across the literature, the system should provide an overview of key patterns and discrepancies in the extracted data regarding different dimensions and measures. It should also assist in standardizing data derived across multiple documents, such as terminologies, measurements, and categorizations.
DG3. Support efficient data validation and refinement. The system should address the need to ensure the accuracy and reliability of extracted data:
- 3.1.
Provide a preliminary evaluation of automated extracted data. This helps users identify data errors and prioritize effort in validation.
- 3.2.
Facilitate easy comparison of extracted data against original sources. This enables users to trace data origins to verify data accuracy.
- 3.3.
Enable efficient batch editing and refinement of data. It should support data entry for data subsets (e.g., with the same dimension values).
4 System
Here, we introduce the design and implementation of SciDaSynth. First, we provide an overview of the system workflow (Figure 1). Then, we describe the technical pipeline of data extraction and structuring. Finally, we elaborate on the user interface designs and interactions.
System workflow of SciDaSynth: (1) Retrieval augmented generation (RAG) based technical framework for extracting and structuring data from figures, text, and tables in scientific documents using LLMs. (2) The user interface then allows for data extraction via question-answering, data validation, correction, summarization, standardization, and database updates through an iterative refinement process.
4.1 System Workflow
After uploading the PDF files of research literature, users can interact with SciDaSynth with natural language questions (e.g., “What are the task and accuracy of different LMs?”) or customized data extraction forms in the chat interface (Figure 1). The system then processes this question and presents the user with a text summary and a structured data table (DG2). This interaction directly addresses the data needs without requiring tedious interface drag and drop (DG1).
The data table provided by SciDaSynth includes specific dimensions related to the user’s question, such as “Model,” “Task,” and “Accuracy,” along with corresponding values extracted from the literature. To guide users’ attention to areas needing validation, the system highlights missing values (“Empty” cells) and records with low relevance scores (DGs 3.1, 3.2). To validate and refine data records, users can view the relevant contexts used by the LLM, with important text spans highlighted (Figure 2 – middle). They can also access the original PDF.
User interface of SciDaSynth. The interfaces features: (A) A query panel for users to input natural language questions or select specific data attributes; (B) A data table displaying extracted information with highlighting of potentially problematic records; (C) Context menu options to validate data by examining relevant document snippets; (D) PDF viewer for accessing original sources; (E) Data standardization panel with multi-level and multi-faceted data summarization and standardization support.
To handle data inconsistencies across papers, the system provides a multi-level and multi-faceted data standardization interface (DG2). First, users can gain an overview of data attributes and their consistency information (Figure 2 – right). Upon selecting specific attributes, the system performs semantic grouping of attribute values to help users identify contextual patterns and distributions of potential inconsistencies (e.g., full form vs. abbreviation). Next, based on the grouped attribute values and their visual summary, users can create, modify, rename, or merge the groups, effectively categorizing the data. Once satisfied with their groupings, users can apply standardization results to instantly update the main data table (DG3.3). Throughout this process, the system provides real-time updates to charts and statistics to show the impact of standardization efforts.
Once satisfied with the data quality, users can add the table to the database, where it is automatically merged with existing data. This process can be repeated with new queries to incrementally build a comprehensive database (Figure 2 – middle). Finally, users can export the entire database in CSV format for further analysis or reporting.
4.2 Data Extraction and Structuring
We leverage LLMs3 to extract and structure data from scientific literature based on user questions (DG1). To mitigate hallucination issues and facilitate user validation of LLM-generated answers, we adopt the RAG framework by grounding LLMs in relevant information from the papers (as shown in Figure 1).
The process begins with parsing the paper PDF collection into tables, text snippets (e.g., segmented by length and sections), and figures using a state-of-the-art toolkit for processing scientific papers (Lo et al. 2023). To enhance the process, we generate data insights embedded in the figures using large vision-language models (i.e., GPT-4o in this paper). Due to the complexity of modern table structures, we also integrate LLMs to infer, parse out, and standardize the table structure (to CSV format) from table strings in PDF texts via few-shot prompting (details in Supporting Information: S1A).
Vector database construction. We construct a vector database for efficient retrieval and question-answering using the processed text, figures, and tables. For each component, we generate a concise text summary using LLMs to index the original, verbose content. This approach reduces noise and improves RAG quality. The text summary is then transformed into embeddings.
Question-based retrieval. When a user poses a question, we encode it as a vector and perform a similarity-based search to identify relevant original content by comparing it with the summary vectors of figures, tables, and text snippets in the database. The retrieved original content is later used for LLM’s generation.
Data table structure inference. Based on the user’s question, we prompt LLMs to infer and design the structure of the data tables, including column names and value descriptions. This guides the LLM in formulating scoped, consistent, and standardized responses across different papers.
Data extraction and structuring. The user question, retrieved document snippets, and the inferred data structure are fed into LLMs to produce the final data table and an associated summary. We instruct the LLMs to answer questions solely based on the provided contexts and to output “Empty” for values that cannot be determined from the given information. The organization of structured data tables facilitates convenient human validation and refinement (DG3).
Data quality evaluation. To assess the quality of the RAG process, we implement several unsupervised metrics computed by LLMs to judge answer quality regarding retrieved contexts and original questions (Es et al. 2024; Yu et al. 2024): answer relevancy measures how well the answer addresses the user’s question; context relevancy evaluates the pertinence of the retrieved context to the question; and faithfulness assesses the degree to which the answer can be justified by the retrieved context. Additionally, we compute data missingness by tracking the proportion of “Empty” values in the generated table. This metric alerts users to insufficient or missing information in the original source (DG3.1).
| Data Table Structure Inference Example |
|---|
| Example: |
| Q: What are the tasks and accuracy of different LMs? |
| Inferred structure: |
| {“language_model_name”: “String: Name of the language model (e.g., GPT-3, BERT)”, |
| “tasks_supported”: “String: List of tasks the language model can perform (e.g., text generation, summarization, translation)”, |
| “accuracy_metric”: “String: Description of the accuracy metric used (e.g., F1, BLEU)”, |
| “accuracy_value”: “Float: Numerical value for the accuracy (0-100)”, |
| “accuracy_source”: “String: Source of the accuracy data (e.g., research paper, benchmark test)”} |
4.3 User Interface
Building upon the RAG-based technical framework, the user interface streamlines the data extraction, structuring, and refinement from scientific documents. Based on the LLM-generated data tables, users can perform iterative data validation and refinement by pinpointing and correcting error-prone data records and by resolving data inconsistencies via flexible data grouping regarding specific attributes. Finally, users can add the quality-ready data tables into the database. Here, we will introduce the system designs and interactions following the user workflow.
4.3.1 Flexibly Specify Information Needs
Upon uploading the scientific documents into SciDaSynth, users can formulate their questions in the Query tab by typing the natural language questions in the chat input (DG1, Figure 2). Alternatively, users can select and add specific data attributes and their detailed explanations and requirements in a form box to query some complex terminologies. This form includes suggested starting queries such as study summary, results, and limitations. Then, the system will respond to users’ questions with a text summary and present a structured data table in the “Current Result” tab (DG2).
4.3.2 Guided Data Validation and Refinement
To make users aware of and validate the potentially problematic results (DG3.1), SciDaSynth highlights error-prone data records and values. Specifically, the system highlights “Empty” cells and the error-prone table records (
with a counter at the top) using unsupervised metrics (Section 4.2). Then, users can right-click the specific table records or cells to access a context menu, which provides the option to view the relevant contexts used by the LLMs for data generation. Within these contexts, important text spans that exactly match the generated data are highlighted for quick reference. If users find the contexts irrelevant or suspect LLM hallucination, they can easily access the original PDF content or parsed figures and tables in the right panel (Figure 2) for verification (DG3.2). After identifying the errors, users can double-click cells to edit the value and clear alerts by clicking
in the rows.
4.3.3 Multi-Level and Multi-Faceted Data Summarization and Standardization
To resolve data inconsistencies across different literature, the system first present an overview of data attributes, data types, and inconsistency4 information (Figure 2 – right).
Dimension-guided data exploration. (DG2) After selecting data attributes, the system performs visual semantic data grouping based on attribute values. Specifically, each record (row) of the selected attributes in the table is transformed into a text description (“column_name: value”), encoded as a vector, and projected on the 2D plane as a dot5 where size correlates with frequency. Users can hover individual dots to see the column values and their group labels. In addition, they can select a group of dots to examine full data records in the “Current Result” table. The variations and similarities of dimension values for different rows are reflected in the distribution of clusters of dots using KMeans clustering. To concretize the data variations, each cluster is associated with a text label generated by LLMs’ summarization. For example, the scatter plot groups “crops” values into colored clusters, such as sweet potatoes and maize. Besides, users can select multiple attributes at once, such as “nutrient value” and “measurement units,” to gain contextual insights into discrepancies in measurement units.
Group-based standardization. (DG2) After developing some high-level understanding of data variations, users can proceed to standardization of the selected data attributes (Figure 3). Users can start with an overview of total and unique values for each major group within the attribute, visualized through bar charts. Below the charts, the system displays individual group cards, each representing a cluster of similar values. These cards are color-coded based on the frequency of occurrences (high, medium, low), allowing users to quickly identify prevalent and rare entries. Within each card, the system lists all unique value variations, along with their frequency counts. This granular view enables users to easily spot inconsistencies, misspellings, or variations in terminology.
Group standardization process. Users start with the major groups’ statistics within selected data attributes. Then, they can edit individual groups by changing the group labels and removing irrelevant values. Finally, they can apply edited group results to the data table.
The interface supports the following interactive standardization:
Users can create new groups or rename existing ones to better categorize the data.
Users drag-and-drop individual value entries between groups, facilitating the consolidation of similar terms.
Inline editing tools enable users to modify group names or individual value entries directly.
For individual group card, users can apply the standardization to the data table by clicking
. The results can be tracked and viewed by clicking
.
As users make changes, the system provides real-time updates to the overview charts and group statistics, offering immediate feedback on the impact of standardization efforts.
4.3.4 Iterative Table Editing and Database Construction
When satisfied with the quality of the current table, users can add it to the database, where the system automatically merges it with existing data using outer joins on document names. This process can be repeated with new queries, allowing for the incremental construction of a comprehensive database. Once the data extraction is complete, users can download the entire database in CSV format for further analysis.
5 Evaluation Design
We employed a within-subjects design where participants used both SciDaSynth and a baseline system to extract data from two paper collections. The study aimed to answer the following research questions:
Effectiveness of data extraction:
- –
Data quality: How does SciDaSynth impact the quality of extracted data?
- –
Efficiency: How does SciDaSynth affect the speed of data extraction?
User perceptions: What are the perceived benefits and limitations of system designs and workflows?
5.1 Experiment Settings
5.1.1 Participants
We recruited a diverse group of 24 researchers:
Group A: Nutrition Science (12 participants; P1–P12) This group (8 females, 4 males, aged 18–44) specialized in nutritional sciences, including food science and technology, human nutrition, medical and health sciences, and life sciences. There were five postdoctoral fellows and seven PhD students, all actively engaged in research. Their technical expertise varied: five were expert users who regularly coded and programmed, while seven were intermediate users who coded as needed.
Group B: NLP/ML Researchers (8 participants; P13–P20) This group (5 males, 3 females, aged 18–33) included researchers in NLP, machine learning, and artificial intelligence. There were two postdocs, five PhD students, and one MPhil student. They were expert in computer programming.
All participants were familiar with the target dataset dimensions through research or literature dimensions. They had experience in data extraction and synthesis for research studies.
5.2 Datasets
We collected and processed two domain datasets based on the recent survey papers. The surveys cover diverse paper types and formats, such as randomized trials, peer-reviewed articles, meeting abstracts and posters, and doctoral theses:
Dataset I (Nutrition Science) is based on a systematic review in Nature Food (Huey et al. 2023), focusing on micronutrient retention in biofortified crops through various processing methods.
Dataset II (Large Language Models) is derived from a recent LLM survey (Zhao, Zhou, et al. 2023), covering various aspects of LLM development and applications.
To build each dataset, we conveniently sampled 20 publications out of the original pool of included papers in each corresponding review. These papers had sufficient length and complexity and incorporated different modalities of information (text, tables, and figures) (see Table 1), while also maintaining a manageable level of task complexity for the user study. After that, we pre-processed them as described in Section 4.2. The corresponding data tables from the original reviews served as ground truth for our evaluation.
Table 1. Statistics of papers in Datasets I and II.
Page # Character # Figure # Table # Dataset I 9.40 (2.97) 34,842 (14569) 3.6 (1.95) 5.2 (3.35) Dataset II 24.10 (7.50) 74,882 (22077) 7.00 (3.97) 13.60 (4.30)
- Note: Values are presented as mean (standard deviation).
5.3 Baseline Implementation
Baseline A (Human). A simplified version of SciDaSynth without automated data extraction or structuring, designed to replicate current manual practices. It includes (1) A PDF viewer with annotation, highlighting, and searching capabilities; (2) Automatic parsing of paper metadata, tables, and figures; (3) A data entry interface for manual table creation; (4) Side-by-side views of PDFs and data tables; This baseline allows us to assess the impact of SciDaSynth’s automated features and standardization support on efficiency and data quality.
Baseline B (Auomated GPT). We developed a fully automated system based on GPT-3.5/4 to generate data tables according to specified data dimensions. This baseline was intended to evaluate the accuracy of our technical framework for automatic data table generation. The implementation followed the data extraction and structuring approach of SciDaSynth (described in Section 4.2). To produce the data tables for comparison, we input the data attributes and their descriptions in JSON format as queries into the system and generate two data tables for the two splits (i.e., four data points) of each dataset.
5.4 Tasks
We designed tasks to simulate real-world data extraction scenarios while allowing for controlled evaluation across two distinct domains. Each participant was assigned to work with one complete dataset (either Dataset I or Dataset II), consisting of 20 papers in total.
Nutrition science researchers (P1–P12) worked with Dataset I, extracting “crops (types),” “micronutrients (being retained),” “absolute nutrient raw value,” and “raw value measurement units.”
NLP/ML researchers (P13–P20) worked with Dataset II, extracting “model name,” “model size,” “pretrained data scale,” “hardware specifications (GPU/TPU).”
These dimensions covered both qualitative and quantitative measurements, requiring engagement with various parts of the papers.
To ensure a robust comparison between the two interactive systems (SciDaSynth and Baseline A (manual system)), we split each dataset into two subsets of 10 papers each, ensuring balanced distribution of paper characteristics such as paper length, number of figures, and tables. Each participant extracted all four data dimensions for every paper in their assigned dataset (I or II) using both systems. This means that each participant used both systems for both subsets, producing two data tables in total. Baseline B (automated baseline) was not operated by participants; it was separately run to compare automated performance against human-in-the-loop systems. To mitigate the ordering effects (e.g., learning effect), we counterbalanced the order of system usage (SciDaSynth first or Baseline A first) and dataset splits (Split 1 first or Split 2 first), resulting in a 2 (system order) × 2 (data split order) within-subject design. Participants organized the extracted data into tables and downloaded them from the systems. The scenario was framed as “working with colleagues to conduct a systematic review.”
Following the structured tasks, participants engaged in an open-ended exploration of the whole dataset with SciDaSynth, allowing for insights into the system’s full capacity in research use.
5.5 Procedure
We conducted the experiment remotely via Zoom, with both Baseline A and SciDaSynth deployed on a cloud server. The study followed a structured procedure: pre-study setup and briefing (10 min), system tutorials (10 min each), main tasks with both systems and datasets (counterbalanced ordered), post-task surveys, free exploration of SciDaSynth (15 min), and a concluding interview.
Participants first provided consent and background information. They then received tutorials on each system before performing data extraction tasks. After completing structured tasks with both systems, participants freely explored SciDaSynth while thinking aloud. The study concluded with a semi-structured interview gathering feedback on system designs, workflow, and potential use cases.
Each session lasted approximately 2.5 h, with participants compensated $30 USD. This procedure enabled collection of comprehensive quantitative and qualitative data on SciDaSynth’s performance and user experience across different research domains.
5.6 Measurements
We evaluated SciDaSynth using quantitative and qualitative measures focusing on effectiveness, efficiency, and user perceptions, with separate analyses for Datasets I and II.
Effectiveness of data extraction was assessed by evaluating the data quality and task completion time. For data quality, we compared the data tables generated by participants using SciDaSynth, Baseline A, and the automated GPT baseline (Baseline B) against the original data tables from the systematic review. For each dataset, two expert raters who were blind to the system conditions independently scored the 3-point scale: 0 (Not Correct), 1 (Partially Correct),6 and 2 (Correct), based on accuracy and completeness. The inter-rater agreement on individual dimensions is measured by Cohen’s κ $\kappa $ . Generally, two raters had good agreement ( > 0.7 $\gt ,0.7$ ) (as shown in Table S4). Disagreements were resolved through discussion to reach consensus scores. For SciDaSynth and Baseline A, we calculated participants’ scores for the corresponding dataset (one per participant, each ranging from 0 to 20). Then, the paired Student’s t-test was performed to compare the average scores of SciDaSynth and Baseline A. Baseline B yielded two scores per dataset split, compared using Mann–Whitney U-tests (Kang, Wu, et al. 2023).
For task efficiency, we measured task completion time from the moment the PDFs were uploaded to the system to the moment the final data table was downloaded. The task completion times for SciDaSynth and Baseline A were compared using paired Student’s t-tests.
User perceptions were evaluated through post-task questionnaires and interviews. We used the NASA Task Load Index (6 items) to assess perceived workload, and an adapted Technology Acceptance Model (5 items) to measure system compatibility and adaptability, both using 7-point scales (Kang, Wu, et al. 2023; Wu and Wang 2005). Custom items gauged perceived utility in areas such as paper overvie