Research Motivation
This project is an implementation of research conducted at the FSU NLP Lab. It is motivated by three simple observations:
- AI systems exhibit a distinct language style;
- Millions of people interact with this AI-generated language every day; and
- Sustained exposure to this language may influence how humans write and speak.
One major goal of our research was to develop a fully automated procedure for identifying words that are systematically overused by large language models. Automation is crucial: it allows us to scale the analysis across many languages, domains, and model architectures without manual curation. The Explorer makes this procedure accessible and interactive. For a detailed explanation of the methodology, see the video below.
Visualising the Method
How we track word frequency using windowed prevalence across documents:
The “Why”
Our research goes beyond identifying which words are overused. It asks why large language models develop language styles that diverge so strongly from human baselines.
A substantial part of this divergence appears to be linked to Reinforcement Learning from Human Feedback (RLHF). However, many aspects remain underexplored: for example, the role of annotator demographics, the influence of task framing, or how different optimisation objectives influence lexical behaviour. It is essential to understand these mechanisms. As millions of people are exposed to AI output, we must ensure that this output is aligned with our expectations; linguistically and beyond.
Key findings
Model idiolects. Overuse patterns vary markedly across architectures (for example, GPT versus Claude), indicating that different models carry distinct idiolects in the words they favour. This is a statement about model output, not about human language. Read the analysis in Scientific American.
A syntactic fingerprint. A high proportion of the overused terms are function words (connectors and the like), which suggests a model's lexical signature rests heavily on syntax rather than topic alone. See the related research.
Papers
The Explorer accompanies the following study:
Further related work:
-
Anderson, B., Galpin, R., & Juzek, T. S. (2025). Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1), 179–191.https://ojs.aaai.org/index.php/AIES/article/view/36540/38678
-
Juzek, T. S., & Ward, Z. B. (2025). Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback. To appear in the BIAS25 proceedings.https://arxiv.org/pdf/2508.01930
-
Galpin, R., Anderson, B., & Juzek, T. S. (2025). Exploring the Structure of AI-Induced Language Change in Scientific English. The International FLAIRS Conference Proceedings, 38.https://journals.flvc.org/FLAIRS/article/view/138958/144064
-
Juzek, T. S., & Ward, Z. B. (2025). Why does ChatGPT “Delve” so much? Exploring the sources of lexical overrepresentation in Large Language Models. Proceedings of the 31st International Conference on Computational Linguistics, 6397–6411.https://aclanthology.org/2025.coling-main.426.pdf
Cite this Work
@misc{juzek2026lexical,
title = {AI-Associated Lexical Shifts Across 34 Languages:
Cross-Lingual Convergence and Diachronic Uptake in News Writing},
author = {Juzek, Thomas Stephan},
year = {2026},
eprint = {2605.25358},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.25358}
}
Get in Touch
Contact: Thomas Stephan (Tommie) Juzek