An open dataset of Chinese duration expressions

Background & Summary

The digitalization of information has made massive amounts of textual data available in the form of books, news articles, blogs and tweets1. Currently, advances in automated text analysis techniques have enabled researchers to efficiently process textual data at scale and extract meaningful information for predicting individual characteristics2, revealing organizational phenomena3 and understanding culture, social processes and other important aspects of social life1. Extracting information from textual data often relies on the establishment of corresponding lexicons1. With the development of text analysis techniques, an abundance of lexicons focusing on different domains have been established, such as the WordNet lexicon4 for cognitive synonyms, the NRC lexicon5 for sentiment, and the Linguistic Inquiry and Word Count (LIWC)[6](https://www.nature.com/articles/s41597-025-06016-2#ref-CR6 “Boyd, R. L., Ashokkumar, A., Seraj, S. & Pennebaker, J. W. The development and psychometric properties of LIWC-22, https://doi.org/10.13140/RG.2.2.23890.43205

(2022).“), which comprises a wide array of psychosocial constructs. Among the various domains covered in the literature, there has been much interest in time-related information in recent years[7](https://www.nature.com/articles/s41597-025-06016-2#ref-CR7 “Jensen, C. S. & Snodgrass, R. T. Temporal database. in Encyclopedia of Database Systems (eds. Liu, L. & Özsu, M. T.) 3945–3949. https://doi.org/10.1007/978-1-4614-8265-9_395

(Springer New York, New York, NY, 2018).“). Time-related information is important because it is essential to how we interpret, organize, and respond to complex events – ranging from understanding antecedents and consequences in organizational processes8 to enabling effective diagnosis and treatment in the medical domain9. In everyday life, time-related information is expressed through a variety of linguistic expressions, such as “in 2 weeks”, “in a while”, and “now”. Recent advances in the study of linguistic expressions of time have primarily focused on developing lexicons for time orientations (i.e., words that indicate the past, present or future, such as “when”, “now” and “going to”, e.g., LIWC[6](https://www.nature.com/articles/s41597-025-06016-2#ref-CR6 “Boyd, R. L., Ashokkumar, A., Seraj, S. & Pennebaker, J. W. The development and psychometric properties of LIWC-22, https://doi.org/10.13140/RG.2.2.23890.43205

(2022).“) and TextMind10). However, the literature has yet to provide lexicons on an important aspect of time-related information11 – duration – which is crucial for capturing the onset and progression of events in text11 and for understanding how humans experience and evaluate events12. Furthermore, existing databases of time-related expressions often lack information about word frequency – a robust predictor of information processing13,14,15 and a standard component of lexical databases16,17. To address these gaps, the current study aims to construct an open dataset on duration expressions that includes frequency information.

Duration refers to a period of time with known length[18](https://www.nature.com/articles/s41597-025-06016-2#ref-CR18 “Jensen, C. S. & Snodgrass, R. T. Time interval. in Encyclopedia of Database Systems (eds. Liu, L. & Özsu, M. T.) 1–2, https://doi.org/10.1007/978-1-4899-7993-3_1423-2

(Springer New York, New York, NY, 2016).“) and is typically expressed in the form of duration expressions such as “a while” or “1 week”. Duration expressions are pervasive in a wide array of circumstances. In everyday communications, people may make appointments to meet each other “in a while”. When making decisions, the reward for choosing an option may be paid “1 week” later. How people map these expressions to corresponding numerical durations has been of substantial interest to researchers in various disciplines. For example, researchers in the natural language processing community have focused on establishing methods to capture and extract temporal information from natural language texts19,20. Furthermore, time series forecasting in domains such as energy, finance and health also requires that time-related data be first transformed into a numerical form before it can be analyzed or used for prediction21. Converting duration expressions to their corresponding numerical durations is thus a crucial step when extracting and analyzing time-related information in text.

Duration expressions comprise two subcategories: numeric duration expressions (presented in the form of “number + time unit”, e.g., “1 week”) and verbal duration expressions (e.g., “a while”). The conversion from numeric duration expressions to corresponding numerical durations is well determined; for example, when “day” is the target time unit, the expression “1 week” would be transformed to “7 days”. While recent advances in natural language processing have enabled the automatic identification of numeric duration expressions in text20,[22](https://www.nature.com/articles/s41597-025-06016-2#ref-CR22 “Strötgen, J., Gertz, M. & Popov, P. Extraction and exploration of spatio-temporal information in documents. in Proceedings of the 6th Workshop on Geographic Information Retrieval 1–8, https://doi.org/10.1145/1722080.1722101

(ACM, Zurich Switzerland, 2010).“), there is currently a lack of established open-access lexicons to support the broader reuse of such expressions in other research contexts. Furthermore, existing studies have yet to provide word frequency information for the numeric duration expressions, which limits their utility for researchers.

Verbal duration expressions are usually ambiguous and thus do not support direct conversion into an equivalent numerical duration. Despite their prevalence in natural language, little is known thus far about how verbal expressions can be converted into their numerical equivalents. This knowledge gap limits the capacity of researchers to analyze temporal information conveyed through ambiguous languages. A similar issue exists in studies on probability expressions, where numerical probability expressions are precise, while verbal probability expressions are usually ambiguous23. To examine whether people understand probabilistic opinions expressed by ambiguous verbal expressions, an established method in prior research is to ask participants to estimate the numerical values of verbal probability24,25,26. The descriptive statistics of participants’ estimates (e.g., mean, median, standard deviation and interquartile ranges) can be used to map ambiguous verbal expressions to numerical probabilities. Gaining insights from these studies on verbal probability, the current study aims to obtain numerical equivalents for verbal duration expressions using a similar method and construct a Chinese lexicon that covers both numeric and verbal duration expressions and their corresponding numerical durations.

In summary, this article reports a new open dataset on duration expressions in Chinese. The dataset contains 2,101 duration expressions and their corresponding numerical durations. To aid in further exploration of the database, we also provide the frequency and adjusted frequency of use for each verbal and numeric duration expression.

Methods

The duration expressions contained in the current dataset consists of two parts: verbal duration expressions and numeric duration expressions. For verbal duration expressions, we first identified the candidate expressions from reliable existing lexicons. We then recruited a sample of participants to assess whether each phrase could be used to convey duration information and to estimate the numerical equivalents for each expression. For numeric duration expressions, we generated commonly used expressions ranging from 1 second to 100 years and converted them to corresponding numerical durations. Finally, we obtained the frequency of each verbal and numeric duration expression from the BLCU Corpus Center (BCC) Corpus27. Figure 1 depicts the construction pipeline of the dataset. The steps in constructing the dataset are described in detail below. The study protocols were approved by the Institutional Review Board of Institute of Psychology, Chinese Academy of Sciences [No. H23074].

Fig. 1

Construction pipeline of the dataset.

Obtaining verbal duration expressions and their numerical equivalents

Participants

Prior to data collection, all participants received a detailed explanation of the experiments’ purpose, procedure, potential risks and benefits, and gave informed consent through an online consent form to take part in the experiment voluntarily. The consent form informed the participants that (a) the de-identified data may be published in books or academic journals; (b) their anonymized data may be shared via a publicly available data archive; and (c) the researchers are committed to ensuring that all personally identifiable information is strictly protected at all times. All participants are required to actively agree to the terms and confirm their understanding before they could proceed with the questionnaire.

In accordance with previous research28, we recruited 55 participants, including 20 males and 35 females, with an average age of 21.62 years (SD = 4.74 years). Participants were recruited through advertisements in a local university, and all participants completed two tasks. All participants were paid ¥20 for their involvement. Three participants failed the attention check in Task 1 and thus were excluded. Five additional participants failed the comprehension check in Task 2 and thus were excluded from the analysis of Task 2.

Selection of candidate verbal duration expressions

Candidate duration expressions were collected from the following sources: (1) The TextMind lexicon10. TextMind consists of two components: (a) the Simplified Chinese version of the LIWC dictionary, and (b) the 5,000 most frequent words collected from the posts of 20,000 active Sina Weibo users. (2) The Lexicon of Common Words in Contemporary Chinese29. This lexicon contains 56,008 Chinese words that are frequently used in daily life.

Three researchers independently examined both lexicons and selected candidate duration expressions. To avoid the possibility of neglecting any potential expression that could convey duration information, the researchers were requested to select all expressions related to temporal information. In total, we selected 180 candidate verbal duration expressions.

Procedure

The in-lab experiment consisted of two tasks conducted in the form of questionnaires on a computer. Task 1 is an expression judgement task. All participants were presented with the 180 candidate verbal duration expressions. For each verbal duration expression, participants made judgements on whether it could be used to express duration (response options: no/unsure/yes). We also presented participants with 5 irrelevant expressions as attention-check questions (i.e., 记者 <journalist>, 国家 <nation> , 棉袄 <padded jacket>, 打扫 <clean> , and 专辑 <album>). Participants who failed to provide correct responses (i.e., no) were excluded from the analysis. The verbal duration expressions were presented in a random order for each participant. Expressions that were judged as “no” (i.e., cannot express duration) by fewer than 20% of the participants were included in the dataset.

Task 2 is a duration estimation task. All participants were asked to estimate the numerical equivalent of each expression that they judged as capable of expressing duration in Task 1. For each expression, participants were first asked to choose an appropriate time unit, including second, minute, hour, day, month, and year, and then to input their estimates. To check whether the participants understood the experimental procedure, we presented all participants with a comprehension check, in which they were asked to estimate the duration of the phrase 两天 <two days> (correct response: 2). Those who failed to provide the correct response were excluded from the analysis. The expressions were presented in a random order for each participant.

Data analysis

In total, 84 verbal duration expressions were judged as “no” (i.e., cannot express duration) by fewer than 20% of the participants and were included in the dataset. To ensure consistency, we converted all numerical equivalents of duration expressions into days, regardless of the original time units selected by the participants (e.g., years, months). The average estimates from all participants were then used as the numerical equivalent for each duration expression. To provide more detailed information about the distribution of participants’ evaluations of verbal duration expressions, we calculated additional statistical measures for each verbal expression, including standard deviation (SD), median (Mdn), interquartile range (IQR), and coefficient of variation (CV). These measures can facilitate further utilization of the database by future researchers. Supplementary Table S1 presents the summary statistics for all verbal duration expressions included in the lexicon. We provide English translations of all verbal duration expressions generated by DeepSeek V3[30](https://www.nature.com/articles/s41597-025-06016-2#ref-CR30 “DeepSeek-AI et al. DeepSeek-V3 technical report. Preprint at https://doi.org/10.48550/ARXIV.2412.19437

(2024).“), a leading large language model (LLM) introduced by DeepSeek in 202431. All translations were manually checked by the authors. Figure 2 shows density plots of the duration estimates for each verbal duration expression, calculated by kernel density estimation using the R package ggridges32.

Fig. 2

Density plots of duration estimate for verbal duration expressions with a mean estimated duration (a) of less than 2 minutes (unit: minutes), (b) ranging between 2 minutes and 30 minutes (unit: minutes), (c) ranging between 30 minutes and 1 day (unit: days), (d) ranging between 1 and 365 days (unit: days), (e) ranging between 1 and 100 years (unit: years), and (f) greater than 100 years (unit: years). Duration estimates that exceed the 95% quantile were not shown to improve readability of the plot. In f, the duration estimates were log10-transformed.

Obtaining numeric duration expressions and their corresponding numerical durations

To incorporate numeric durations into the dataset, we generated numeric duration expressions in the form of “number + time unit”, such as 1天 <1 day> and 1年 <1 year> . To capture a broad range of durations encountered in everyday life, our database includes commonly used expressions spanning from 1 second to 100 years. It is worth noting that some numerical durations can be expressed in various ways in Chinese; for example, 2 小时, 2个小时, 两小时, and 两个小时 are all used to express 2 hours. These different expressions were all included in the lexicon. Overall, a total of 2,017 numeric duration expressions were generated and included in the dataset. The corresponding numerical durations of numeric duration expressions can be easily converted by extracting the number and time unit in each expression. For example, the expression 24 小时 <24 hours> is equivalent to a numerical duration of 24 hours and thus 1 day. To ensure consistency, we converted all numeric duration expressions to their corresponding numerical durations in days.

Estimating the frequency of each verbal and numeric duration expression

Word frequency is a basic and crucial information when conducting text analysis13,15. To aid in further exploration of the database, we provide the frequency of use for each verbal and numeric duration expression in the dataset. Word frequency values were obtained from the BLCU Corpus Center (BCC) Corpus (bcc.blcu.edu.cn27). The BCC Corpus encompasses a wide range of Chinese texts from diverse sources, including newspapers, literature, academic publications, and social media, totaling approximately 10 billion characters27. Given its rich volumes and comprehensive sources, the BCC Corpus is a reliable source for obtaining the frequencies of duration expressions.

Note that some expressions in our lexicon may be used to convey meanings other than duration due to phenomena such as polysemy. For example, 马上 might mean <immediately> or could also mean <on the horse> . To account for this, we analyzed a random sample of fifty occurrences (or all occurrences if fewer than 50 were available) in the BCC Corpus for each duration expression to estimate the proportion of cases where the expression was used to express duration rather than alternative meanings. Since only twenty random occurrences were used to estimate the adjusted frequencies in a previous study on the frequencies of probability phrases28, we believed that a random sample of fifty occurrences would provide a stable estimation. For each expression in our lexicon, we calculated the adjusted frequency by multiplying the frequency of each expression and the proportion it was used to express duration. The adjusted frequency offers a more accurate estimate of the frequency with which each expression is used to express duration.

Data Record

The dataset is available at the repository of the Science Data Bank at https://doi.org/10.57760/sciencedb.28888[33](https://www.nature.com/articles/s41597-025-06016-2#ref-CR33 “Zhang, S.-Q., Niu, J.-W., Liu, X., Sui, X.-Y. & Rao, L.-L. Data and code for ‘An Open Dataset of Chinese Duration Expressions’. Science Data Bank https://doi.org/10.57760/sciencedb.28888

(2025).“).

Data Record 1 – the main dataset

The main dataset is lexicon.csv. This file contains all verbal and numeric duration expressions included in the dataset. The column “expressions” provides the duration expressions. The column “translation” provides English translations for each duration expression, generated by DeepSeek V3 and manually checked by the authors. The column “Type” denotes whether the duration expression is verbal or numeric. The column “num_duration” provides the numerical duration for each duration expression in days. For verbal duration expressions, the numerical equivalent for each expression was determined by averaging the estimates provided by all participants. For numeric expressions, the numerical duration is converted by extracting the number and time unit in each expression. The column “freq” represents the raw frequency of use for each duration expression obtained from the BCC Corpus. The column “adjusted_freq” refer to the adjusted frequency of use for each duration expression.

Data Record 2 – raw rating data

The subfolder “1 verbal_expression_processing” contains the raw data and code for the expression judgement and duration estimation tasks. The “task1.csv” file contains the raw data of Task 1 (the expression judgement task). The column “ID” is an identifier distinguishing the study participants. Columns 2–181 refer to participants’ judgements of whether the 180 candidate verbal duration expression could be used to express duration (response options: 1 = yes; 2 = unsure; 3 = no). Columns 182–186 refer to participants’ judgements of whether the five irrelevant expressions could be used to express duration (response options: 1 = yes; 2 = unsure; 3 = no). English translations for all expressions are provided in the second row of the table. The “task2.csv” file contains the raw data of Task 2 (the duration estimation task). The column “ID” is an identifier distinguishing the study participants. The second column “两天” <two days> refer to participants’ judgements of the comprehension check (correct response: 2). Columns 3–86 refer to participants’ duration estimates of the 84 verbal duration expressions that were included in the main dataset. A missing value indicates that the participant judged the expression unsuitable for describing duration and thus did not provide a duration estimate. English translations for all expressions are provided in the second row of the table. The “judgment + estimation.R” file contains the code for calculating the numeric equivalent and other summary statistics (i.e., standard deviation, median, interquartile range, and coefficient of variation) for each verbal duration expression and the code for reproducing the density plots.

Data Record 3 – validation analysis

The subfolder “2 validation” contains the code for conducting the technical validations (see the Technical Validation section for details).

Technical Validation

To demonstrate the validity and utility of the current dataset, we provide evidence from the following two perspectives.

First, the construction of the dataset has followed standard practices of lexicon establishment, which ensure that it is rigorously constructed and carefully curated. The duration expressions reported in the dataset were collected from multiple reliable sources (TextMind and the Lexicon of Common Words in Contemporary Chinese). The lexicon covers a wide range of time units and incorporates different expressions for the same numeric expression. All verbal expressions in the dataset were validated by human participants, which provided evidence for the quality of the lexicon34. The word frequency data was obtained from a large corpus based on diverse sources (the BCC Corpus), which ensures the reliability of the frequency measure14. Furthermore, we calculated the adjusted frequency for each expression to better reflect the real-world usage of each expression in the temporal-related context.

Second, by providing an effective tool for the conversion between duration expressions and numerical durations, the current dataset has a high potential for reuse. To demonstrate the validity and utility of the dataset, we tested whether it could be utilized to approximate people’s subjective valuations of temporal delays in a way consistent with predictions from the decision by sampling (DbS) theory28. Decision by sampling theory is a theory about how the environment shapes the decision individuals make35. According to the DbS, the subjective value of a temporal delay is determined by its relative rank within a random sample of delays retrieved from memory28. Specifically, the relative rank of each delay is calculated using the formula r = (R − 1)/(N − 1), where N is the total sum of the frequencies of all delays, and R is the rank of a particular delay within N when sorted in descending order. The distribution of delays stored in memory reflects the distribution in the real world. Stewart et al.28 collected English numeric duration expressions ranging from 1 day and 1 year and assessed their real-world frequencies through the number of hits produced by Google. Consistent with the predictions of decision by sampling theory, Stewart et al.28 found that the tendency of people to discount delayed outcomes according to a hyperbolic discounting function arises from the distribution of relative ranks of delays in the real world. To verify the quality of our dataset, we used the delays and frequencies data in our dataset to calculate the relative ranks of delays following Stewart et al.28, in such a way that high-quality data are necessary to obtain meaningful outcomes.

For comparison with the Stewart et al.28 study, we included only expressions with numerical durations ranging from 1 day to 1 year for subsequent analysis. Following the approach of Stewart et al.28, we first conducted an analysis using only numeric duration expressions. To further demonstrate the validity of the verbal duration expression in our dataset, we then extended the analysis to include both numeric and verbal duration expressions in the dataset.

For analysis with only numeric duration expressions, given that some numeric duration expressions may correspond to the same delay, we first accumulated the adjusted frequencies of all numeric duration expressions equivalent to the same delay to determine the frequency of that delay. Next, we sorted delays in descending order and calculated the relative rank of each delay with the formula r = (R − 1)/(N − 1) as mentioned above. The analysis with numeric + verbal duration expressions was conducted under the same process as the analysis with numeric duration expressions, except that we used the mean value of participants’ estimates for each verbal duration expression as its corresponding delay. The relative rank (r) was used as the DV of the modeling of temporal discounting.

We fitted the distribution of the relative ranks of delays with eight common forms of temporal discounting function, including the power function f (D) = cD1−τ*/(1 − τ)28, the hyperbola-like function f (D) = 1/(1 + kD)s 36, the exponential function f (D) = e−kD 37, the simple hyperbolic function f (D) = 1/(1 + kD)38, the hyperboloid function f (D) = 1/(1 + kD**s)39, the constant sensitivity function f ((D)={e}^{-({kD}{)}{s}})40, the quasi-hyperbolic function f (D) = βe*−kD, when D > 041, and the double exponential function f (D) = ae−bD + (1 − a)e−cD 42,43. D in the functions represents the corresponding delay of each duration expressions, and c, τ, k, s, β, a and b are free parameters. Following He et al.44 and Stewart et al.28, we also provided the parameter bounds for each function in Table 1. The parameters of each function were estimated using the least-squares method. Then, we evaluated the performance of each model by inspecting model selection criteria, including the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). The AIC and BIC values were calculated using the standard formulas in the case of least squares estimation[45](https://www.nature.com/articles/s41597-025-06016-2#ref-CR45 “Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference. https://doi.org/10.1007/b97636

(Springer New York, New York, NY, 2002).“): AIC = n (\log \left(\frac{{SSE}}{n}\right)+2k); BIC = n (\log \left(\frac{SSE}{n}\right)+k,\log (n)), where n is the sample size, SSE is the sum of squared errors, and k is the number of free parameters in the model. The analysis was performed using R 4.5.1.

Results for distribution based on numeric duration expressions

We first performed model fitting on the distribution of the relative ranks of delays based on numeric duration expressions. The model fitting results are presented in Table 1. The results showed that the power function f (D) = cD1−τ*/*(1 − τ) provided the best fit (AIC = −3362.87, BIC = −3354.83). Figure 3a illustrates the fit of the best-fitting function as well as the classical exponential and simple hyperbolic functions.

Fig. 3

The distribution of relative ranks of durations. The curved lines represent alternative forms of the temporal discounting function fit to the data, including the best-fitting function, the exponential function, and the simple hyperbolic function. (a) The results of model fitting based on numeric duration expression data. (b) The results of model fitting based on data from both numeric and verbal duration expressions.

Results for distribution based on all duration expressions

The model fitting results for the distribution based on all duration expressions are presented in Table 1. The hyperbola-like function f (D) = 1/(1 + kD)s provided the best fit (AIC = −3440.01, BIC = −3431.91) to the distribution (see Figure 3b).

In conclusion, with regard to the analysis using only numeric duration expressions, we found that the best-fitting function was the power function (f (D) = cD1−τ/(1 − τ)), which is consistent with previous research28. This indicates that the current dataset can successfully capture the distributional properties of temporal delays, providing evidence for its validity.

With regard to the analysis incorporating both numeric and verbal duration expressions from the dataset, we found that the hyperbola-like function (f (D) = 1/ (1 + kD)s) provided the best fit. This pattern suggests a faster decline in the subjective value for shorter delays and a slower decline for longer delays when both numeric and verbal duration expressions are considered. Notably, these findings were consistent with previous research showing that the hyperbola-like function can better describe people’s actual temporal discounting behavior than can other functions46,47,48. The convergence between our results and prior studies suggests that incorporating verbal duration expressions may provide a better proxy for the real-world distribution of temporal delays than relying on numeric duration expressions alone. Therefore, these results not only support the validity of our dataset but also highlight the advantage of incorporating verbal duration expressions in the lexicon.

Usage Notes

This dataset is intended for researchers interested in investigating time-related textual data. The dataset can be applied to a range of analyses, including but not limited to:

The automatic extraction and conversion of duration expressions in natural language processing studies. While prior studies have provided tools for extracting and identifying duration expressions from text20,[22](https://www.nature.com/articles/s41597-025-06016-2#ref-CR22 “Strötgen, J., Gertz, M. & Popov, P. Extraction and exploration of spatio-temporal information in documents. in Proceedings of the 6th Workshop on Geographic Information Retrieval 1–8, https://doi.org/10.1145/1722080.1722101

(ACM, Zurich Switzerland, 2010).“), these studies have largely focused on numeric duration expressions, for which the conversion to the corresponding numerical duration is precise and well determined. However, people’s everyday communication about time comprises not only precise numeric duration expressions but also a large variety of ambiguous verbal duration expressions49. By constructing an annotated lexicon that covers verbal duration expressions, our dataset provides support for studying ambiguous language on duration and thus expands the scope of analysis in time-related textual data.

Investigating word frequency effect in temporal expressions. Word frequency is a central variable in many psycholinguistic research and plays a key role in explaining individuals’ visual word recognition50. Word frequency effect, a widely studied phenomenon in the past few decades, refers to a tendency that frequently-occurred words are processed easier and more efficiently than low-frequency words14. By utilizing the current dataset, researchers can investigate individuals’ processing of temporal information during a lexical decision or naming task. Potential research questions include, but are not limited to: (1) individual and group differences in the processing of time-related information, and (2) the influence of temporal expression processing on real-life behaviors such as temporal discounting.

Selection and designation of research materials in behavioral experiments. Many studies in psychology and behavioral sciences, such as those on intertemporal choices and goal pursuit, incorporate temporal information as a key component of their stimuli. The current dataset provides numerical duration estimates and word frequency information for each duration expression, allowing researchers to select and construct materials that meet the desired experimental requirements, and to systematically control for linguistic features such as frequency of use or familiarity. Notably, the verbal duration expressions in our dataset can be used to describe future outcomes with ambiguous or imprecise delays, which have received growing attention in recent years (e.g., “You will get $30 later”51).

Data availability

The data is accessible without restrictions at the repository of the Science Data Bank at https://doi.org/10.57760/sciencedb.28888[33](https://www.nature.com/articles/s41597-025-06016-2#ref-CR33 “Zhang, S.-Q., Niu, J.-W., Liu, X., Sui, X.-Y. & Rao, L.-L. Data and code for ‘An Open Dataset of Chinese Duration Expressions’. Science Data Bank https://doi.org/10.57760/sciencedb.28888

(2025).“).

Code availability

The code for data preprocessing, analysis and technical validation are available at https://doi.org/10.57760/sciencedb.28888[33](https://www.nature.com/articles/s41597-025-06016-2#ref-CR33 “Zhang, S.-Q., Niu, J.-W., Liu, X., Sui, X.-Y. & Rao, L.-L. Data and code for ‘An Open Dataset of Chinese Duration Expressions’. Science Data Bank https://doi.org/10.57760/sciencedb.28888

(2025).“).

References

Berger, J. & Packard, G. Using natural language processing to understand people and culture. Am. Psychol. 77, 525–537 (2022).

Article PubMed Google Scholar 1.

Eichstaedt, J. C. et al. Closed- and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychol. Methods 26, 398–427 (2021).

Article PubMed Google Scholar 1.

Mathieu, J. E. et al. Indexing dynamic collective constructs using computer-aided text analysis: Construct validity evidence and illustrations featuring team processes. J. Appl. Psychol. 107, 533–559 (2022).

Article PubMed Google Scholar 1.

McCrae, J. P. et al. The global wordnet formats: Updates for 2020. in Proceedings of the 11th Global Wordnet Conference 91–99 (2021). 1.

Moha

Background & Summary

Background & Summary

Methods

Obtaining verbal duration expressions and their numerical equivalents

Participants

Selection of candidate verbal duration expressions

Procedure

Data analysis

Obtaining numeric duration expressions and their corresponding numerical durations

Estimating the frequency of each verbal and numeric duration expression

Data Record

Data Record 1 – the main dataset

Data Record 2 – raw rating data

Data Record 3 – validation analysis

Technical Validation

Results for distribution based on numeric duration expressions

Results for distribution based on all duration expressions

Usage Notes

Data availability

Code availability

References

Similar Posts