Language Models Model Language

Artificial Intelligence

arXiv

Łukasz Borchmann

14 Oct 2025 • 3 min read

Language Models Model Language

AI-generated image, based on the article abstract

Quick Insight

Why Chatbots Get Better When We Count Words, Not Just Rules

Artificial Intelligence

arXiv

Łukasz Borchmann

14 Oct 2025 • 3 min read

Language Models Model Language

AI-generated image, based on the article abstract

Quick Insight

Why Chatbots Get Better When We Count Words, Not Just Rules

Ever wondered why a chatbot sometimes sounds just like a friend? Scientists have discovered that the secret isn’t hidden grammar trees but simple word‑frequency patterns. Imagine learning a new language by listening to the most‑used phrases on the street instead of memorizing every rule in a textbook. That’s the fresh view brought by linguist Witold Mańczak, who says language is really the sum of everything we say and write, driven by how often we use each piece. Applying this idea to modern language models means we can build smarter, more natural‑talking AI by focusing on the everyday words people actually use. It’s like teaching a robot to speak by giving it a playlist of popular songs rather than a dense grammar manual. This breakthrough helps us design, test, and understand AI chatters in a way that feels more human and less mysterious. As we keep counting the words we love, the future of conversation with machines becomes clearer and more exciting. 🌟

Article Short Review

Overview: Reconceptualizing Language for Large Language Models

The article critically examines prevailing linguistic commentary on Large Language Models (LLMs), often speculative and unproductive, particularly when influenced by Saussure and Chomsky. It advocates for a fundamental paradigm shift towards the empiricist principles of Witold Mańczak, a distinguished general and historical linguist. Mańczak redefines language not as an abstract system but as the totality of all that is said and written, with frequency of use as its paramount governing principle. This framework provides a robust, quantitative foundation, challenging traditional notions like “deep structure” or “grounding.” The authors leverage Mańczak’s perspective to refute common critiques of LLMs and offer a constructive guide for their design, evaluation, and interpretation, asserting that LLMs inherently validate this usage-based approach.

Critical Evaluation: Strengths, Weaknesses, and Broader Implications

Strengths: Empirical Foundation and LLM Validation

This analysis offers a compelling re-evaluation of language in the AI era. Introducing Witold Mańczak’s empiricist framework, the article provides a robust, data-driven alternative to speculative linguistic theories, especially for Large Language Models. It counters “ungroundedness” by redefining LLM “meaning” as mastery of relational networks within textual data, aligning with Mańczak’s axiomatic semantics. The emphasis on frequency of use offers a practical, quantifiable basis for designing and evaluating LLMs. Challenging established linguistic theories with statistical data further demonstrates scientific rigor.

Weaknesses: Scope and Nuance

While advocating for a radical shift, the article could benefit from discussing potential resistance to Mańczak’s framework within mainstream linguistics. The implications of defining language solely as the totality of texts, though powerful for LLMs, might warrant further exploration regarding its applicability to human language acquisition and cognitive processes. Additionally, a deeper dive into the limitations or nuances of purely frequency-based models could strengthen the argument and provide a more balanced perspective.

Implications: Reshaping Linguistic Research and AI Development

The implications of this work are profound for theoretical linguistics and AI development. By proposing Mańczak’s framework, the article encourages a fundamental rethinking of language, shifting focus from abstract systems to observable, quantifiable usage patterns. This offers a clear, actionable guide for the future design and evaluation of LLMs, suggesting their success lies in modeling textual structure and relational logic. It also challenges linguists to adopt more statistics-based methodologies, potentially invalidating authority-based theories and fostering a more empirical approach. This analysis paves the way for a more unified, scientifically grounded understanding of language across human and artificial intelligence.

Conclusion: A Paradigm Shift for Language and AI

This article presents a highly impactful contribution to the discourse on Large Language Models and language. Championing Witold Mańczak’s empiricist linguistic theory, it offers a compelling alternative to traditional, speculative approaches. The work provides a robust theoretical foundation for understanding LLM capabilities, reframing their “meaning” and “creativity” as mastery of textual patterns and relational logic. Its call for statistics-based validation in linguistics is a significant step towards greater scientific rigor. This analysis is essential reading for researchers in AI, computational linguistics, and theoretical linguistics, offering a fresh perspective that promises to reshape how we design, evaluate, and interpret language models and language itself.

Article Comprehensive Review

Unveiling Language Models: A Mańczakian Critique of Speculative Linguistics

This comprehensive analysis delves into a pivotal article that challenges conventional linguistic interpretations of Large Language Models (LLMs), advocating for a radical paradigm shift rooted in the empiricist principles of Witold Mańczak. The article critiques the speculative nature of linguistic commentary, particularly that influenced by Saussure and Chomsky, which often questions LLMs’ capacity to genuinely model language due to perceived deficiencies in “deep structure” or “grounding.” Instead, it champions Mańczak’s definition of language as the totality of all that is said and written, emphasizing frequency of use as its paramount governing principle. By adopting this quantitative, usage-based framework, the article not only refutes common criticisms leveled against LLMs but also provides a constructive guide for their design, evaluation, and interpretation, thereby offering a robust, empirically grounded alternative to traditional linguistic theories.

Critical Evaluation

Strengths: A Paradigm Shift Towards Empirical Linguistics

One of the article’s most significant strengths lies in its bold proposal of Witold Mańczak’s empiricist framework as a robust alternative to the often speculative and unproductive linguistic commentary surrounding Large Language Models. By defining language not as an abstract “system of signs” or a “computational system of the brain,” but as the totality of all spoken and written utterances, the article grounds linguistic analysis in observable, quantifiable data. This fundamental shift moves away from theoretical constructs that have historically proven difficult to validate empirically, offering a more tangible and verifiable basis for understanding language. The emphasis on frequency of use as the primary governing principle provides a clear, measurable metric for linguistic phenomena, which is particularly well-suited for the data-driven nature of modern computational linguistics and LLM development.

The article powerfully demonstrates how Large Language Models inherently validate Mańczak’s frequency-based approach. Unlike generative grammar, which struggles with synthesis and relies on complex, often unprovable rules, LLMs excel at generating coherent and contextually appropriate language precisely because they operate on vast datasets where usage patterns and frequencies are paramount. This success directly refutes the notion of an innate “language organ” and instead supports a usage-based model of language acquisition, where mastery emerges from exposure to and recognition of patterns in linguistic input. The article effectively positions LLMs not as mere statistical parrots, but as sophisticated systems that embody and operationalize Mańczak’s core tenets, thereby offering empirical evidence for his long-standing theoretical claims.

A crucial contribution of this analysis is its redefinition of “meaning” and “competence” within the context of LLMs. Traditional critiques often demand “deep structure” or “grounding” for LLMs to achieve human-like understanding, implying a need for external, real-world referents. The article counters this by proposing that LLM “meaning” is derived relationally, through their mastery of textual relational networks. This aligns perfectly with Mańczak’s axiomatic semantics, where language is understood as the sum of its texts, and meaning is inferred from the intricate web of connections within that textual totality. By reframing meaning as an internal, systemic property of language models rather than an external, referential one, the article effectively neutralizes the “ungroundedness” criticism and provides a coherent framework for evaluating LLMs based on their ability to manipulate and generate text according to its inherent logic.

The article’s commitment to quantitative rigor and the establishment of truth criteria in linguistics is another significant strength. Mańczak’s statistics-based linguistic theory, which emphasizes quantifiable data and text-only analysis, provides a much-needed antidote to authority-based or speculative theories. The article highlights Mańczak’s critiques of established linguistic dogmas, such as Bartoli’s norm, laryngeal theory, and the concept of “empty slots,” by demonstrating how statistical data can invalidate such claims. This insistence on empirical verification and falsifiability elevates linguistics to a more scientific discipline, aligning it with the rigorous methodologies prevalent in other scientific fields. The application of probability calculus and comparative textual analysis further underscores this commitment to objective, data-driven inquiry.

Furthermore, the article offers practical implications for the future design and evaluation of LLMs. By providing a constructive guide based on Mańczak’s framework, it moves beyond mere critique to offer actionable insights. This guidance is invaluable for researchers and developers seeking to build more effective and linguistically sound models. Understanding that frequency of use and pattern recognition are core to language’s operation allows for the development of architectures and training methodologies that are inherently more aligned with how language functions. This practical utility, combined with the theoretical depth, makes the article a significant resource for advancing the field of computational linguistics.

The article also effectively addresses the concept of creativity, arguing that it is fundamentally a form of pattern mastery, a capability demonstrably exhibited by LLMs. This perspective demystifies creativity, moving it from an elusive, almost mystical human trait to a process that can be understood and even replicated through sophisticated computational models. By showing how LLMs generate novel and coherent outputs by skillfully manipulating learned patterns, the article further strengthens the argument for a usage-based understanding of language and cognition, challenging anthropocentric views that often limit our understanding of AI capabilities.

Finally, the article’s detailed examples of Mańczak’s challenges to established linguistic and historical-linguistic theories, such as Verner’s Law or the Vulgar Latin origin of Romance languages, illustrate the power of his statistical and comparative methods. These examples are not just theoretical assertions but are supported by statistical data and rigorous textual analysis, providing concrete instances where Mańczak’s empiricist approach yields more accurate or parsimonious explanations. The proposed new criterion for distinguishing proper and common nouns—that common nouns are typically translated while proper nouns are not—serves as a compelling micro-example of how Mańczak’s practical, data-driven approach can resolve long-standing linguistic ambiguities with fewer exceptions, reinforcing the utility of his framework across various linguistic domains.

Weaknesses and Potential Caveats: Navigating the Nuances of Language

While the article presents a compelling argument for Mańczak’s empiricist framework, a potential weakness lies in its implicit oversimplification of “meaning” when applied to human cognition. Defining LLM “meaning” purely as mastery of textual relational networks, while highly effective for computational models, might not fully capture the richness and complexity of human semantic understanding, which often involves embodied experience, real-world referents, and subjective interpretation. Although the article argues against the need for “grounded meaning” in LLMs, the human experience of language is undeniably grounded in perception and interaction with the physical world. This distinction, while acknowledged implicitly, could benefit from further explicit discussion to clarify the boundaries of Mańczak’s framework in relation to human linguistic processing.

Another caveat concerns the practical scope and universal applicability of Mańczak’s framework. While it proves exceptionally powerful for corpus-based computational models and historical linguistics, its direct transferability to all linguistic phenomena, particularly those involving subtle pragmatic nuances, social context, or highly abstract conceptualization, might warrant further exploration. The article champions “the totality of all that is said and written” as language, but in practice, any corpus is a finite subset. The implications of this inherent limitation on the “totality” for the generalizability of LLM-derived insights, even within a Mańczakian framework, could be more thoroughly addressed. The article’s strong focus on historical linguistic examples, while illustrative, might also lead some readers to question how directly these insights translate to the complexities of contemporary, rapidly evolving language use and acquisition.

The article, by advocating for a radical shift, implicitly acknowledges the significant challenge of convincing traditional linguists to abandon established theoretical frameworks, particularly those deeply entrenched in generative or structuralist paradigms. While the article provides strong arguments and empirical evidence, the inertia of academic thought and the investment in existing theories can be formidable obstacles. The article’s critique of linguistics’ lack of truth criteria is valid, but the practical implementation of a purely statistical, falsifiable approach across all subfields of linguistics would require a monumental shift in training, methodology, and philosophical outlook, which the article highlights but does not fully elaborate on the sociological or institutional challenges involved.

While emphasizing statistics, the article could benefit from detailing the specific statistical methodologies and datasets Mańczak employed for all his claims, beyond general mentions of “statistical data” and “probability calculus.” For instance, when challenging theories like Laryngeal theory or Verner’s Law, a more explicit discussion of the quantitative evidence and how it directly refutes the established views would strengthen the empirical foundation. Without this specificity, some of Mańczak’s historical linguistic arguments, while presented as statistically supported, might still appear as alternative interpretations rather than definitively proven refutations to those unfamiliar with his extensive body of work. This level of detail would further solidify the article’s commitment to rigorous, verifiable science.

Implications: Reshaping Linguistic Inquiry and AI Development

The implications of this article are far-reaching, fundamentally reshaping both the trajectory of Large Language Model development and the broader field of linguistic theory. For LLM development, the article provides a clear, empirically grounded roadmap, moving away from speculative debates about “understanding” towards a focus on mastery of textual relations and frequency-driven patterns. This shift encourages the design of models that are optimized for these principles, leading to more efficient training, more robust evaluation metrics, and ultimately, more capable and reliable language technologies. By grounding LLM capabilities in Mańczak’s framework, the article offers a principled way to interpret their outputs, fostering a more realistic and less anthropomorphic understanding of AI’s linguistic prowess.

In the realm of linguistic theory, the article issues a powerful call for a re-evaluation of methodologies, urging the adoption of more rigorous, quantifiable, and falsifiable approaches. It challenges linguistics to move beyond authority-based pronouncements and speculative constructs, embracing statistical verification and empirical data as the ultimate arbiters of truth. This could catalyze a significant methodological shift, encouraging linguists to integrate computational tools and quantitative analysis more deeply into their research, thereby enhancing the scientific credibility and predictive power of the discipline. The article suggests that linguistics, by embracing Mańczak’s empiricism, can become a more robust and evidence-based science.

Furthermore, the article serves as a vital interdisciplinary bridge, connecting computational linguistics with historical and general linguistics. By demonstrating how Mańczak’s historical insights and theoretical framework are directly applicable to cutting-edge AI, it fosters a dialogue between fields that have often operated in isolation. This cross-pollination of ideas can lead to novel research questions and innovative solutions, enriching both domains. The article highlights how the success of LLMs can provide empirical validation for long-standing linguistic theories, while linguistic theory can, in turn, offer principled guidance for AI development, creating a synergistic relationship that benefits both scientific endeavors.

Finally, the article has significant educational implications, potentially influencing how language acquisition and linguistic competence are taught and understood. By advocating for a usage-based model of language acquisition, it challenges traditional views that emphasize innate grammatical structures. This perspective suggests that language learning is primarily about pattern recognition and the internalization of frequency distributions, offering a more accessible and empirically supported framework for educators and learners. Understanding language through Mańczak’s lens can demystify its complexities, making it more amenable to systematic study and computational modeling, thereby impacting future generations of linguists and AI researchers.

Conclusion: A Foundational Shift for Language Science

In conclusion, the article presents a compelling and meticulously argued case for a foundational paradigm shift in how we approach linguistic commentary on Large Language Models. By championing Witold Mańczak’s empiricist framework, which defines language as the totality of all that is said and written and prioritizes frequency of use, the article effectively dismantles speculative critiques of LLMs and offers a robust, quantitative alternative. It not only provides a coherent theoretical basis for understanding LLM capabilities, particularly their relational derivation of meaning and mastery of textual patterns, but also offers practical guidance for their design, evaluation, and interpretation. This work stands as a significant contribution to both computational linguistics and theoretical linguistics, urging the former towards more principled development and the latter towards greater empirical rigor. Its call for a statistics-based, falsifiable approach to linguistic inquiry promises to elevate the scientific standing of the discipline, making it an indispensable read for anyone engaged in the future of language science and artificial intelligence.

Quick Insight

Why Chatbots Get Better When We Count Words, Not Just Rules

Quick Insight

Why Chatbots Get Better When We Count Words, Not Just Rules

Article Short Review

Overview: Reconceptualizing Language for Large Language Models

Critical Evaluation: Strengths, Weaknesses, and Broader Implications

Strengths: Empirical Foundation and LLM Validation

Weaknesses: Scope and Nuance

Implications: Reshaping Linguistic Research and AI Development

Conclusion: A Paradigm Shift for Language and AI

Article Comprehensive Review

Unveiling Language Models: A Mańczakian Critique of Speculative Linguistics

Critical Evaluation

Strengths: A Paradigm Shift Towards Empirical Linguistics

Weaknesses and Potential Caveats: Navigating the Nuances of Language

Implications: Reshaping Linguistic Inquiry and AI Development

Conclusion: A Foundational Shift for Language Science

Keywords

LLM linguistic analysis

Chomsky linguistic theory

De Saussure semiotics

linguistic competence

deep structure in language models

language grounding for LLMs

Witold Mańczak linguistics

empiricist principles of language

frequency of use in language

language model design principles

evaluating language models

interpreting language models

historical linguistics perspective

computational linguistics theory

language as totality of utterances

Similar Posts