Introduction
Creativity is a multifaceted construct at the crossroads of individual expression, problem solving, and innovation. Human creativity is pivotal in shaping cultures and has undergone continuous transformation across historical epochs. Our understanding of this ability is now influencing the landscape of artificial intelligence and cognitive systems1,2,[3](#ref-CR3 “Fryer, M. Some key issues in creativity research and evaluation as seen from a psych…
Introduction
Creativity is a multifaceted construct at the crossroads of individual expression, problem solving, and innovation. Human creativity is pivotal in shaping cultures and has undergone continuous transformation across historical epochs. Our understanding of this ability is now influencing the landscape of artificial intelligence and cognitive systems1,2,3,4,5. In the past few years, the advent of sophisticated Large Language Models (LLMs) has spurred considerable interest in evaluating their capabilities and apparent human-like traits6, particularly in terms of their impacts on human creative processes7,8. Despite a growing interest in evaluating the creative quality of LLM-generated outputs9,10,11,12, current benchmarking approaches have yet to systematically compare LLMs to human performance on tasks that are suitable for both.
Although the ability to generate novel and aesthetically pleasing artifacts has long been considered a uniquely human attribute, this view has been challenged by the recent advances in generative AI. This technological progress has ignited discussions surrounding the creative capabilities of machines13,14,15,16, ushering in the emerging field of computational creativity—a multidisciplinary domain that explores the potential of artificial systems to exhibit creativity in a manner analogous to human cognition.
The release of GPT-4 was marked with an exceptional gain in performance across various standardized benchmarks17. Demonstrating its versatility in language- and vision-based tasks, GPT-4 has successfully passed a uniform bar examination, the SAT, and multiple AP exams, transcending the boundaries of traditional AI capabilities. However, it is important to keep in mind that such benchmarks can be achieved through non-human processes such as data contamination and storage, rather than genuine reasoning or understanding. The model’s web page (openai.com/gpt-4) touts its creative prowess, spurring a fresh examination of the creativity of state-of-the-art LLMs. The stance taken by OpenAI has sparked debates on the extent to which the creativity of LLMs is poised to rival human capabilities.
These advancements raise pivotal questions for the science of creativity: Are these models genuinely evolving to become more creative, and to what extent do they approach human-level creativity? The exploration of these inquiries not only deepens our understanding of artificial creativity but also provides valuable insights into the role that language abilities play in creativity.
Here, we leverage recent computational advances in the field of creativity science in order to quantify creativity across state-of-the-art LLMs and in a massive data set of 100,000 human participants. By scrutinizing these models through the lens of distributional semantics, we probe and compare their potential to generate original linguistic and narrative content.
Numerous definitions and frameworks have been proposed to describe human creativity, encompassing convergent and divergent thinking, as well as variation-selection paradigms2,8,15,18,19,20. Divergent thinking, characterized by the ability to generate novel and diverse solutions to open-ended problems, has gained widespread recognition as a robust and widely-accepted index of creative cognition21. This aspect of cognitive creativity is particularly tied to the initial phase of the creative process (i.e., variation/exploration), where many ideas are produced before the most useful and novel ones are selected.
To quantify divergent thinking, researchers have employed various tools, such as the Alternative Uses Test (AUT), in which people generate novel uses for common objects. Recently, the creativity of LLMs has been probed using the AUT, yielding mixed results; while there were no overall significant differences between LLMs and humans, discrepancies emerged in specific items22,23. The results might be explained by inherent challenges in the methodology24. The AUT’s validity remains contentious25, and chatbot responses might inadvertently draw from online test materials. Additionally, their methodology of eliciting multiple responses from chatbots has raised concerns over the significance of fluency metrics. This aligns with broader critiques of the AUT, highlighting its cumbersome and subjective rating process26, even if recent work has shown promising approaches using LLMs to automatically score the AUT. We acknowledge that subjectivity is intrinsic to creativity assessment; when greater objectivity is desired, semantic-distance scoring provides a validated AUT method27,28, complemented by recent LLM-based automated scoring29.
More recently, semantic distance is increasingly probed as a key component of creative thought30. This emphasis dovetails with classic and contemporary views that creativity relies on associative thinking—traversing and combining remote regions of semantic memory to yield novel connections31. Recent methodological advances include, for instance, the Divergent Association Task (DAT), in which people are asked to generate a list of 10 words that are as semantically distant from one another as possible32. Individuals who are more creative tend to cover a larger semantic repertoire, resulting in a larger mean semantic distance between the words. DAT scores show positive associations with established creativity assessments—including the Alternative Uses Task (AUT) and the Bridge-the-Associative-Gap (BAG) task—as well as with convergent (Compound Remote Associates), insight, and analytical problems32. Together these findings support its reliability as a brief index of divergent (associative) thinking in humans32,33,34,35,36.
The speed and unambiguous scoring of the DAT make it appropriate for large-scale evaluations. The DAT may be useful to assess both LLMs and human creativity, as it is a straightforward task that probes creative potential through language production, a domain accessible to both entities. This commonality facilitates a concise and direct comparison of creative output between LLM models and humans, enabling an in-depth examination of their respective creative capacities. Further, the DAT uses computational scoring to assess semantic distance between all word pairs, allowing the comparison of large samples without additional bias from human raters. Semantic distance is derived from the mean cosine similarity value between pairs of word embeddings—matrix-based representations of words. These embeddings are produced by a language model that is trained to consider word co-occurrences, a characteristic often termed as context-independent word embeddings37.
An alternative method for evaluating creativity is through the examination of creative writing. Recent investigations have used a quantitative approach similar to that taken by the DAT to assess the semantic distance covered by sentence-based texts38. Divergent Semantic Integration (DSI) is a measure of cosine similarity between pairs of word-level embeddings present in a textual narrative. This approach was implemented in light of more recent advances in language modeling allowing the computation of context-dependent word embeddings, which take the entire surrounding sentence into account39. DSI has been found to correlate strongly with human ratings of perceived creativity in short narratives38.
The research community has recently delved into investigating the creative behavior of LLMs7,22,40,41,42,43,44,45,46 and exploring the potential interactions between human and machine creativity24,47,48,49,50,51,52. Recent studies have further advanced this field by evaluating creative writing in LLMs from diverse perspectives—comparing GPT-4 to award-winning novelist Patricio Pron in a human–machine creative writing contest53, demonstrating that LLM productions can match human-level creativity on certain humor and epicness dimensions54, and introducing novel automated methods for analyzing story arcs, turning points, and affective dynamics55—which we complement by directly comparing both DAT scores and performance on diverse creative writing tasks. However, a comprehensive benchmark analysis comparing creativity, measured by semantic divergence, across state-of-the-art LLMs and human performance is lacking. Our study not only seeks to fill this gap empirically but also to discuss the potential implications of applying creativity measures to AI productions on our understanding of human cognition and creative potential.
This paper provides a thorough examination of the ability of LLMs to mimic human creativity by comparing each other’s performance using established creativity measurements. Our goals are threefold: (i) benchmark multiple LLMs against a large human cohort (N = 100,000) on the DAT using identical scoring; (ii) manipulate model outputs via prompt strategies and hyperparameters (temperature) to test whether semantic (associative) creativity can be tuned; and (iii) evaluate generalization by testing whether higher DAT performance predicts greater divergence in creative writing (haikus, synopses, flash fiction) relative to human-written corpora, quantified with automated metrics.
The LLMs assessed in this study were not selected with the intent of conducting a comprehensive and competitive comparison of the best models available. The sheer pace of current LLM development would render such an approach quickly obsolete. Instead, we chose a wide range of models that vary in characteristics such as size, popularity, training, and license, hoping to provide a general framework to assess creativity in LLMs as compared to human participants. Throughout the manuscript, we use the term ‘LLM creativity’ to refer specifically to the divergent, associative aspect of semantic creativity, i.e. the ability to produce highly dissimilar sets of words, or in the case of story-writing, to integrate diverse ideas, objects, etc. into a narrative. As demonstrated by previous research using the DAT and DSI, this dimension of creativity shows a strong correlation with other facets of creative processes in humans32,38. Accordingly, we do not assume that LLMs achieve comparable performance via human-like mechanisms; instead, we present a human–AI benchmarking framework for these tests that can support more granular analyses of the underlying processes.
Results
Comparing large Language models (LLMs) and human creativity using the divergent association task
To benchmark the divergent creativity of humans and different LLMs, we compared the mean of their respective DAT scores (see Methods). As depicted in Fig. 1A, GPT-4 surpasses human scores with a statistically significant margin, followed by GeminiPro, which is statistically indistinguishable from human performance. Interestingly, Vicuna, a drastically smaller model, performs significantly better than some of its larger counterparts. Apart from the Humans/GeminiPro, GeminiPro/Claude3 and Vicuna/GPT-3.5 contrasts, all other pairwise contrasts of mean DAT score are statistically significant (Fig. 1B). Importantly, a later release from OpenAI, GPT-4-turbo, demonstrates a notable decline in performance when compared to its predecessor, GPT-4. A comprehensive analysis across all versions of the GPT-4 models, as illustrated in Figure S2, indicates that newer iterations of the model do not consistently enhance performance on the DAT.
Notably, models with lower scores exhibit greater variability (Fig. 1C), often coinciding with a greater tendency to fail to comply with the instruction (as depicted by the pie charts).
The word count analysis (see Fig. 1D) revealed that GPT-4-turbo showed the highest degree of word repetition across all responses with the word ocean occurring in more than 90% of the word sets. The best performing model, GPT-4, also showed a high degree of word repetition across all responses with 70% of responses containing the word microscope, followed by elephant (60%). The latter was ranked first in GPT-3.5’s responses, while the most frequent words chosen by humans were car (1.4%) followed by dog (1.2%) and tree (1.0%).
Fig. 1
Comparing LLMs and humans on the divergent association task (DAT). Summary of DAT performance across LLM and human samples. (A) Mean DAT score and 95% confidence intervals. (B) Heatmap of all contrasts, generated using two-sided independent t-tests, sorted by their correlation with the highest performing model, GPT-4. (C) Distribution for each model using a ridge plot of smoothed kernel density estimates. Black vertical lines represent the mean, and the small black/gray pie charts show the models’ prompt adherence (i.e. the proportion of valid responses). (D) Most frequent words across responses. The percentages represent the proportion of response sets (10 words) that include these words. *: p < .05, **: p < .01, ***: p < .001.
Fig. 2
Mean creativity scores for a wide range of large language models (LLMs) and human samples on the Divergent Association Task (DAT). Models are ranked from lowest to highest mean score, with error bars indicating 95% confidence intervals. For humans, each bar represents the mean of a random subsample of 500 responses (n = 500), drawn either from the full distribution (N = 100,000) or restricted to the top 50% (N = 50,000), 25% (N = 25,000), or 10% (N = 10,000) of responses. For LLMs, each bar represents the mean of 500 model-generated responses.
To further contextualize these findings, Fig. 2 presents a comprehensive comparison of creativity scores across an expanded set of LLMs released between January 2023 and June 2025 alongside different segments of the human population taken from our sample. Consistent with our main analyses, several leading LLMs now reliably exceed the average score of the general population. However, the most creative humans—those in the top decile, quartile and above median—still achieve higher DAT scores than any model of our curated list (see supplementary Figure S5, S6 and Table S1 for more details on statistical significance, response distributions across a wider range of models, and model specifications). This result underscores a persistent gap between artificial and human divergent thinking at the highest levels, despite rapid advancements in LLM design.
Assessing the validity of the DAT across LLMs
To validate the models’ compliance with the DAT instructions and to ensure their responses weren’t arbitrary word distributions, we compared their performance to a control condition, which entailed prompting the LLMs to generate a list of 10 words, without specifying a need for maximal difference between the words. The findings, illustrated in Fig. 3 reveal that, when prompted with DAT instructions, every model significantly outperformed the control condition. This result was taken as evidence for the adherence of the LLMs to the task of producing a maximally divergent set of words.
Fig. 3
DAT compared to the control condition across LLMs. Performance of each model when being prompted with the original DAT instructions versus when being prompted to write a generic list of ten words. Each contrast is sorted in ascending order based on their mean performance in responding to the DAT instructions. ***: p < .001.
The effect of model temperature on creativity scores
In order to evaluate the potential for modulating LLMs’ creative performance via hyperparameter tuning, we explored the impact of adjusting the temperature value in GPT-4, the top-performing model. The underlying premise is that increased temperature would result in less deterministic responses, thereby yielding higher creativity scores. In line with this hypothesis, we observed a significant rise in DAT scores as a function of temperature (Fig. 4A), with a mean score of 85.6 achieved in the highest temperature condition (Fig. 4B). This mean score was higher than 72% of the human scores.
Notably, we found a reduced frequency of word repetitions as temperature increased, corroborating the notion that higher temperatures facilitate more diverse word sampling, whereas lower temperatures give rise to more deterministic responses (Fig. 4C). Interestingly, this pattern suggests that the superior performance of the top model is not simply attributable to the repetition of a well-optimized set of words (reflected in a high word count), but rather its ability to generate more and diverse responses.
Fig. 4
GPT-4 creativity scores across temperature levels. Varying performance across temperature levels in GPT-4 using the original DAT instructions. Each condition includes n = 500 generations. (A) Distributions of scores for each temperature level (Low: 0.5, Mid: 1.0, High: 1.5). Black vertical lines represent the median. (B) Barplot of the mean scores for each temperature level with results of the two-sided independent t-tests for each contrast. (C) Qualitative summary of the responses showing the 10 most frequent words across repetitions within each temperature condition. ***: p < .001.
Exploring strategies to manipulate LLMs performances
We found that imposing specific strategies influenced LLM performance on the task, as illustrated by the performance-based ranking of strategies (Fig. 5). To prompt the model to adopt different strategies in answering the DAT, we added a specification of the strategy to use at the end of the instructions, using the following sentence structure: “[…] using a strategy that relies on meaning opposition | using a thesaurus | varying etymology”. All differences in means were statistically significant, with the exception of the contrast between the Thesaurus and Basic Instructions, highlighting the impact of strategy variations on LLM creativity scores. Interestingly, we observed that the Etymology strategy outperformed the original DAT prompt for both GPT-3.5 and GPT-4. This finding implies that these models exhibit higher DAT scores when explicitly prompted to use “a strategy that relies on varying etymology.” Interestingly, although the strategy trends were similar across GPT-3.5 and GPT-4, we also noticed subtle differences between the two. Specifically, the Thesaurus strategy also outperformed the DAT in GPT-4.
Fig. 5
Comparison of DAT scores for GPT-3.5 and GPT-4 across different linguistic strategies. (A,** D**) Distribution for each strategy using a ridge plot of smoothed kernel density estimates for the two models. Black vertical lines represent the median. (B,** E**) Mean DAT score and 95% confidence intervals. (C,** F)** Heatmap of all contrasts, arranged in comparison to the highest performing strategy. *: p < .05, **: p < .01, ***: p < .001.
Investigating llms’ performance on creative writing tasks
Our exploration of LLM’s ability to produce creative-like outputs extended beyond the DAT to a range of creative writing tasks designed to further interrogate the models’ creative capabilities in relation to human generated corpuses. These tasks, including the generation of haikus (three-line poems), movie synopses, and flash fiction (brief narratives), were employed as complementary investigations to corroborate the DAT findings and provide broader evidence of the creative capacities of the examined LLMs. The three models that scored highest in the DAT (GPT-3.5, Vicuna, and GPT-4) were used to generate creative writing samples. In analyzing these creative outputs, we employed the Divergent Semantic Integration (DSI) to measure divergence across sentences, Lempel-Ziv Complexity for assessing unpredictability and diversity, and Principal Components Analysis (PCA) embeddings to understand thematic coherence and variance (see Methods).
Fig. 6
Creative assessment of LLMs and human generated synopses. Overview of the level of semantic divergence in synopses generated by humans and high-performing LLMs using different methodologies. (A) Distributions of DSI values across all models and human participants. (B) Scatterplot of the two-dimensional PCA performed on all synopses’ embeddings. (C) Distributions of DSI values across temperature levels for GPT-4. (D) Distribution of normalized LZ complexity across models and human participants. *: p < .05, **: p < .01, ***: p < .001.
Fig. 7
Creative assessment of LLMs generated flash fiction. Overview of the level of semantic divergence in flash fiction generated by high-performing LLMs using different methodologies. (A) Distributions of DSI values across all models. (B) Scatterplot of the two-dimensional PCA performed on all flash fiction embeddings. (C) Distributions of DSI values across temperature levels for GPT-4. (D) Distribution of normalized LZ complexity across models. *: p < .05, **: p < .01, ***: p < .001.
Fig. 8
Assessment of creativity on LLM and human generated haikus. Overview of the level of semantic divergence in haikus generated by humans and high-performing LLMs using different methodologies. (A) Distributions of DSI values across all models and human participants. (B) Scatterplot of the two-dimensional PCA performed on all haikus embeddings. (C) Distributions of DSI values across temperature levels for GPT-4. (D) Distribution of normalized LZ complexity across models and human participants. *: p < .05, **: p < .01, ***: p < .001.
Our results indicate that GPT-4 consistently outperforms GPT-3.5 across all three categories of creative writing, as evaluated by Divergent Semantic Integration (DSI) (Figs. 6A, 7A and 8A). Despite this, human-written samples maintain a significant edge in creativity over both language models. We also observe that the temperature parameter in GPT-4 heavily influences the DSI for synopses and flash fiction, with higher temperature settings correlating with increased creativity scores (Figs. 6C and 7C), but not for Haikus (Fig. 8C). Interestingly, while temperature doesn’t significantly affect the creative scores of haikus, it does play a more prominent role in longer writing formats, underscoring that such formats exhibit more pronounced differences in DSI scores in response to changes in temperature. While the overall DSI score variation across temperature settings in synopses appears modest, the differences are statistically significant (p < .001) and become more pronounced in less structured formats like Flash Fiction (see Fig. 6). This suggests that task constraints modulate the impact of temperature on creative divergence.
A two-dimensional PCA embedding revealed distinct patterns, particularly when contrasting human responses to those of language models. In the case of both haikus and synopses, PCA reveals a clear separation between the embeddings of human-generated texts and those generated by LLMs. The clusters for different LLMs also occupy distinct regions in the embedding space. Additionally, when PCA is applied to the flash fiction data, it effectively distinguishes the three different LLMs, as depicted in Figs. 6B and 7B, and 8B.
In relation to Lempel-Ziv complexity scores, the pattern mirrors in most cases the performance order indicated by the DSI (Figs. 6D, 7D and 8D). Humans exhibit higher scores than LLMs for haikus, which is consistent with the DSI findings. However, humans’ LZ scores are significantly lower than LLMs for synopses, in contrast to the DSI results.
This suggests the DAT is a useful tool for quantifying associative thinking across different LLMs and conditions. However, establishing the full psychometric properties and interpreting these scores in terms of ‘creativity’ or ‘divergent thinking’ analogous to humans requires further investigation. Investigating the underlying mechanisms and latent structures, which likely differ significantly between humans and LLMs even when producing similar outputs, is crucial for validating the DAT’s broader implications in evaluating LLMs’ potential to generate truly original text.
Discussion
The aim of the present paper was to benchmark the performance of a wide range of LLMs on a straightforward and validated creativity test, while comparing their scores to a large cohort of human responses (N = 100,000). Additionally, we aimed to modulate the creative performance of the highest-scoring models by adjusting the temperature level and the strategic approach employed by the LLMs in response to the DAT instruction. State-of-the-art LLMs exhibited remarkable proximity to human performance levels in the creativity assessment; the DAT scores of GeminiPro were statistically close to human performance, while GPT-4 exceeded it. It’s crucial to understand that this finding is nontrivial as LLMs do not directly access all semantic distances between word pairs; instead, they depend on iterative transformations of latent representations, which differ from those used in the DAT computations.
Our results illustrate how targeted prompt design allows for the manipulation of LLMs’ creative outputs, as assessed by the DAT. To strengthen our findings, we also demonstrated that performance on the DAT aligns with creative scores across multiple writing formats, as measured through DSI and LZ. This suggests that the chosen metrics have potential for broad applications in assessing other types of creative outputs, either through matrix operations (cosine similarity) for assessing semantic distance or compression algorithms for assessing redundancy.
LLMs surpass the population average—but not most creative humans
A key finding is that several LLMs, including GPT-4, surpass the population-average DAT score from our sample of 100,000 humans; however, even the best-performing models do not exceed the mean of the top 50% of human responses, and the upper human deciles still define a clear gap. Although our human benchmark is age- and sex-balanced and lacks occupational labels, it is plausible that the upper tail includes individuals with sustained practice in language-rich domains (e.g., writers, poets, editors, humanities scholars, creative-industry professionals); this remains speculative and not directly testable with our metadata. Taken together, these results support the claim made by OpenAI that GPT-4 is more creative than its predecessor, but it also challenges the assumption that language-based tasks are sufficient to understand human creativity in general. Moreover, the performance of GPT-4-turbo, which significantly decreases compared to its predecessor GPT-4, indicates that efficiency improvements or cost reductions might come at the expense of increased redundancy across the model’s responses, suggesting a trade-off between diversity and resource optimization in the development of language models. Recent investigations have contrasted human and artificial creativity employing the Alternative Uses Task (AUT), revealing for instance that humans surpass GPT-3.5 in creative output22. In contrast, another study using the same task but with a different scoring approach found that both GPT-3.5 and GPT-4 outperform humans on average42. A separate study evaluating multiple models found that their scores on the AUT are similar to human performance, with some evidence that GPT-4 can exceed human originality24. A classical battery of creativity tests, the Torrance Tests of Creative Thinking, was also used to benchmark GPT-4 performance and found that it scored within the top 1% for originality and fluency41. One study also assessed the DAT in GPT-3.5 and GPT-4 compared to a human sample, showing that both models outperform humans on average40. Our findings expand upon these insights by (i) juxtaposing human responses with a more extensive array of LLMs, (ii) exploring multiple creativity-related metrics which show potential for comparing LLMs and humans (DAT, DSI, and LZ complexity), (iii) comparing DAT benchmarking with performance on several creative writing tasks, providing convergent evidence for its validity as a proxy for creative writing evaluation in LLMs, (iv) using an unprecedented large human dataset (n = 100,000), all English speakers and balanced for age and sex, (v) verifying for adherence to the DAT instructions through comparison with a control condition, (vi) exploring the effect of hyperparameter tuning (temperature) and prompt design strategies, and (vii) sharing code that both uses direct calls to the API of all closed source models, as well as scripts to run open-source LLMs locally. Despite widespread concern that AI could imminently replace creative professionals (like writers, for instance), our results suggest that such fears remain premature. The persistent gap between the best-performing humans and even the most advanced LLMs indicates that the most demanding creative roles in industry are unlikely to be supplanted by current artificial intelligence systems.
LLM creativity can be manipulated through prompt design and hyperparameter settings
Our comparison of the DAT versus control conditions reaffirms this observation, with all tested LLMs demonstrating a significant increase in DAT scores when instructed explicitly to generate a list of maximally different words compared to merely listing random words. This distinction underscores the sensitivity of LLMs to the nuances of task instructions and their capability to adjust their output based on these specifications. Moreover, the performance of LLMs varied markedly when exposed to different strategies. As expected, when prompted to use the opposition strategy, the models’ performance significantly decreased, as opposing words (e.g. “light” and “darkness”) have a relatively low semantic distance. We also found that when explicitly prompted to use words with varying etymology, both GPT-3.5 and GPT-4 outperformed the original DAT prompts, suggesting the potential for enhancing semantic divergence by referring to the roots of words. These observations align with recent findings showing significant increases in GPT-3.5 performance on the AUT (Alternative Uses Test) when prompted to adopt a two-phase approach of brainstorming followed by selection, surpassing human creativity scores in some instances52. Thus, our results, in concert with these findings, indicate that manipulating prompts can be a powerful tool for modulating the creative performance of LLMs. The efficacy of specifying strategies raises intriguing questions about potential parallels in human creative processes. It is plausible that humans, while responding to the DAT, implicitly or explicitly employ a mix of strategies to generate their responses. Future research would benefit from exploring this dimension, systematically comparing human strategic approaches with those we can program into LLMs. For example, studies could verify whether changing the instructions given to humans or LLMs result in similar changes in performance. Such comparative analyses could further our understanding of how strategy manipulation can be leveraged to enhance the creative performance of both LLMs and humans.
In addition to prompting strategies, hyperparameter tuning was found to significantly bolster the performance of LLMs, particularly GPT-4. An increase in temperature led to a substantial rise in DAT scores, with the highest temperature condition surpassing the mean creativity score of a significant portion of human participants. This increase in semantic divergence aligns with the concurrent decrease in word repetition frequency, suggesting that higher temperatures indee