\jvol

vv \jnumnn \jyear2025 \docheadShort Paper\pageonefooterAction editor: {action editor name}. Submission received: DD Month YYYY; revised version received: DD Month YYYY; accepted for publication: DD Month YYYY.

\affilblock

Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses

Dong Nguyen¹ Corresponding author Laura Rosseel² Utrecht University, The Netherlands
Vrije Universiteit Brussel, Belgium

Abstract

Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.

1 Introduction

Language is not only used to communicate information about the world, it is also a social tool that speakers can employ to construct a social identity (Eckert, 2012). To that end, language users can draw on the social meaning of language, which refers to social attributes that can be associated with linguistic forms and their users (Walker et al., 2014). As an example, one could write ‘c u tonite’ as a variant of ‘see you tonight’. Both texts carry the same referential meaning; that is, they communicate the same message that the writer will see the addressee the same evening. However, the social attributes the reader connects to the writer and context of each of these texts can be quite different. Perhaps the first version is seen as more informal and the writer of the second version as more serious. Or maybe the writer of the former is perceived as younger than the latter.

Sociolinguistic research has shown that language variants, i.e. different linguistic forms that have an equivalent referential meaning (cf. tonite vs. tonight; Labov 1972), can carry multiple social meanings. This goes for both written and spoken language. For instance, variation in the pronunciation of word-final -ing (either velar -ing \tipaencoding[Iŋ] as in working or as alveolar -in \tipaencoding[In] as in workin) in American English has been shown to be associated with social attributes like perceived intelligence (the velar pronunciation sounding more intelligent than the alveolar one), as well as regional background (workin sounding Southern) and sexual orientation (working being linked with gay speakers) (Campbell-Kibler, 2014, 2009, 2010a). Thus, social meaning is a key part of language understanding and use. Although sociolinguistics has a rich tradition in studying the social meaning of language variation (Campbell-Kibler, 2010b), it has been understudied in NLP Nguyen, Rosseel, and Grieve (2021).

Read the tweets below focussing on the words that are between <W> and </W>.

Tweet 1: Not sure but <W>nvr</W> mind

Tweet 2: Not sure but <W>never</W> mind

Formality: Then rate how (in)formal you find each tweet, with a number from 0 to 100 (0=very informal, 100=very formal).

Carefulness: Then rate how careless/careful you find each writer, with a number from 0 to 100 (0=very careless, 100=very careful).

Age: Then give your best estimation of how old the writer of each tweet is, in number of years (whole numbers).

Table 1: Both participants and LLMs are presented with tweet pairs. To collect formality and carefulness ratings, LLMs provided numerical ratings, while the human participants used a slider. The text shown above are parts of the prompts presented to LLMs; the prompts are based on the instructions that humans received.

Although there is a large body of work in NLP on language variation (Nguyen et al., 2016), the main focus has been on socially stratified language production. For example, studies have investigated how linguistic features in texts vary across sociodemographic groups. Often, such studies were carried out in the context of authorship profiling, e.g., can linguistic features be used to predict attributes like gender and age (Argamon et al., 2009; Schwartz et al., 2013; Flekova, Preoţiuc-Pietro, and Ungar, 2016). As another example, studies have examined how such production differences affect the fairness of models, showing how task performance varies across texts written in different language varieties (Joshi et al., 2025) or by speakers from different sociodemographic groups (Garimella et al., 2019). As a final example, there is an increasing interest in the implications for text generation, e.g., by asking LLMs (large language models) to produce text when simulating certain personas, often (partially) defined by demographic attributes (Malik, Jiang, and Chai, 2024), and to assess the quality of LLM output across different language varieties (e.g., Standard English vs. African American English; Sandoval et al. 2025; Deas et al. 2023).

Contrary to this considerable amount of work on variation in language production, research in NLP focusing on how language variation is socially perceived – in other words, what social meanings it can express — remains scarce. Although there is extensive research on the linguistic competence of LLMs in general (Waldis et al., 2024), their sociolinguistic competence has received far less attention (Duncan, 2024). Sociolinguistic competence is not only about being able to produce language that aligns with how the writer would like to position themselves socially; it also includes understanding how linguistic variation is socially perceived. Understanding what social meanings LLMs associate with language variation is essential, as it might influence their downstream behavior (Blodgett et al., 2020), such as the language they suggest or produce when used as writing tools (Goldshtein et al., 2024; Sourati et al., 2025), or in a decision making context (Hofmann et al., 2024). Measuring what social meanings LLMs attach to linguistic variation – the focus of this study – is therefore an essential first step for assessing how these associations might shape their downstream behavior.

Moreover, following a large body of work that compares LLMs with humans in terms of internal representations, behavior and language processing (Karamolegkou, Abdou, and Søgaard, 2023; Brysbaert, Martínez, and Reviriego, 2024; Tjuatja et al., 2024; Cheung, Maier, and Lieder, 2025), we focus on comparing how language variation is socially perceived by humans and how such patterns are reflected in LLM output. With this research, we hope to contribute to discussions about the use of LLMs for sociolinguistic research (Nguyen, 2025). Taking it a step further, using LLMs to collect human-like experimental data has attracted attention in many fields, ranging from political sciences (Argyle et al., 2023) to cognitive sciences (Binz et al., 2025) to machine learning (Anthis et al., 2025), though it has also sparked criticism (Dillion et al., 2023; Aher, Arriaga, and Kalai, 2023; Agnew et al., 2024; Abdurahman et al., 2024; Wang, Morgenstern, and Dickerson, 2025). If LLMs associate similar social meanings to language variation as humans do, they could serve as a valuable tool for sociolinguistic research. For example, they could be used to uncover previously unknown potential social meanings of language variants that could then be further researched in humans, or support pilot studies to inform experimental design. However, LLM-based insights should ultimately always be validated and explored further through real human experiments.

In this study, we focus on the social perception of spelling variation in English. While social meanings can be attached to various types of linguistic forms (e.g. pronunciation, word choices, grammatical forms), we take spelling variation as a starting point for developing this new research direction for three reasons. First, by focusing on spelling variation, we can create text pairs with clearly equivalent referential meanings; that is, it is fairly straightforward to determine whether they refer to the same thing or concept (e.g., Table 1). Second, sociolinguistic studies have documented how spelling variation is used for social meaning making in online communication (Busch, 2021; Hilte, Vandekerckhove, and Daelemans, 2019; Ilbury, 2020; Russ, 2012; Squires, 2010). Third, since many LLMs are pre-trained on web data (Soldaini et al., 2024; Brown et al., 2020; Gao et al., 2020), it is likely that they have encountered many instances of spelling variation during training. We aim to answer the following (RQ): How do the social meanings that LLMs associate with English spelling variation compare to those of humans? We investigate this RQ by examining seven common types of spelling variation and three key social meanings (formality, carefulness, and age).

Methodology

Our setup is rooted in sociolinguistic research: We adapt the speaker evaluation paradigm, frequently used in sociolinguistics research, to the analysis of LLMs (§3). To do so, we create a highly controlled, carefully designed dataset with tweets. Due to this rigorous design, our dataset is smaller than typical NLP datasets. Following sociolinguistic principles, we then collect many human ratings per item (§4); we then compare these ratings with LLM outputs (§5).

Contributions

(i) We highlight an underexplored aspect of language use — social meaning — for the analysis of LLMs; (ii) We propose a methodology, rooted in sociolinguistic research, to examine LLMs’ associations of social attributes with spelling variation; (iii) We find surprisingly high correlations between human and LLM ratings. However, when zooming in on the rating distributions and the individual spelling variation categories we do observe notable differences; (iv) We compare two prompting strategies and find that the strategy that more closely mirrors the human setup leads to higher correlations. Our code and data will be made publicly available upon publication.

2 Related Work

2.1 Language Variation in NLP

In this subsection we discuss related work on language variation in NLP, with a focus on LLMs and research on social perceptions of language variation in text.

How language models handle language variation

Several studies have analyzed what (large) language models have learned about language variation. Nielsen, Kirov, and Roark (2023) studied whether they are consistent in producing British or American spelling. Furthermore, building on the link between language variation and sociodemographic attributes of authors, researchers have investigated whether LLMs encode sociodemographic knowledge of authors based on age and gender prediction tasks Lauscher et al. (2022), and whether they perform differently on the close task using data from different demographic groups Zhang et al. (2021). We are not aware of work focusing specifically on different types of spelling variation, besides a pre-LLM study on word embeddings (Nguyen and Grieve, 2020).

Language variation has also been considered as a source of bias in NLP models (Blodgett et al., 2020). Studies have found task performance differences between different language varieties (Ziems et al., 2023; Faisal et al., 2024; Lin et al., 2025). Furthermore, Petrov et al. (2023) found higher API costs for certain language varieties, because the texts are segmented into a greater number of tokens. In terms of text generation, studies have analyzed how the language variety of a prompt influences the generated responses (e.g., in terms of stereotyping and demeaning content (Fleisig et al., 2024)) and researchers have debated to what extent LLMs should adapt their language use (e.g., style, dialect) to that of users (Lucy et al., 2024; Sandoval et al., 2025).

Social perception of language variation in NLP

In NLP, studies on perceptions of social attributes linked to linguistic variation in text remain scarse. A few studies have investigated this topic in the context of author profiling. An early study by Nguyen et al. (2014) collected human guesses of the age and gender of Dutch Twitter users based on their tweets. These guesses were then compared with the actual Twitter users’ gender and age, to reflect on the limitations of automatically inferring such social attributes from text. Other studies have collected human guesses of users’ gender and age based on English tweets, investigating how they deviate from model predictions and ground truth data (Flekova et al., 2016) and whether tweets can be strategically selected to influence human guesses (Preoţiuc-Pietro, Chandra Guntuku, and Ungar, 2017). A recent study by Chen et al. (2025) collected human perceptions of gendered style. They did not compare the human ratings with ratings by LLMs. Importantly, none of these studies controlled for confounding factors like topics mentioned in the text (e.g., references to school, or playing soccer). As a result, factors besides language variation could have played a role in participants’ perceptions of the writer’s social attributes. As an example, Chen et al. (2025) found that emotion features were associated with perceptions of feminine style, and a feature analysis by Flekova et al. (2016) highlighted words like boyfriend or male names (e.g., joe).

Recently, Cheng, Yu, and Jurafsky (2025) developed an LLM-based approach to measure human-like tone in text and four dimensions of social perception (warmth, status, social distance, and gender), building on social psychology work. For example, gender perceptions were measured using log likelihood ratios of texts prefixed with ‘he said’ vs. ‘she said’, and social distance by prefixing text with phrases such as ‘the stranger said’ vs. ’my friend said’. They used these measurements to investigate whether human preferences towards LLM-generated output differed along these dimensions. Although some of their data was based on matched prompts (i.e., responses to the same input prompt), similar to prior studies, their experiments did not isolate the effects of specific linguistic features.

The following two recent studies are most relevant to ours, as they focused on social perceptions of language variation and also used meaning-matched input texts. Hofmann et al. (2024) probe LLMs by presenting them with tweets in Standard American English (SAE) and African American English (AAE), investigating both a meaning-matched and non-meaning matched setting. Bui et al. (2025) presented LLMs with texts in different German dialects as well as Standard German. Like us, they considered different social attributes. There are a few key differences with our study though: (i) We focus on different types of spelling variation. In our study, identical texts are offered in pairs with each pair differing by only a single word, while both Hofmann et al. (2024) and Bui et al. (2025) compare between language varieties; (ii) Both studies compare their findings with those obtained from humans, however our comparison is more controlled. Hofmann et al. (2024) provide a quantitative comparison, but importantly, they present LLMs with different input than what the human participants in the original studies received. Bui et al. (2025) relate their overall findings only to general trends observed in human dialect perception research, rather than making the kind of quantitative comparison that we perform. In contrast, we aim to maximize the comparability between the two settings, by providing humans and LLMs with the exact same texts, and constructing LLM prompts that are as close as possible to the instructions that the human participants received; (iii) We collect numerical ratings, while both Hofmann et al. (2024) and Bui et al. (2025) base their analyses on tokens (e.g., adjectives, occupations, outcomes), either through token probabilities or extracted from the generated output.

Spelling variation in sociolinguistics and NLP

Although sociolinguistics has a long tradition in the study of socially meaningful language variation, it has traditionally focused on spoken language and, particularly, on sociophonetic variation. Nevertheless, there is a body of work that has documented how spelling variation can be used — just like other types of linguistic variation — for social meaning making (e.g. Androutsopoulos 2000; Hinrichs and White-Sustaíta 2011; Jaffe et al. 2012; Hilte, Vandekerckhove, and Daelemans 2019). Most of this work is situated in the field of computer mediated communication (CMC) and looks at spelling variation in digital writing (for work in other traditions see for instance Honeybone and Watson (2013) and Ganuza and Rydell (2024) on orthographic variation in literary work to represent speech, and Vosters et al. (2012) for a historical sociolinguistic account of spelling variation).

Turning to the CMC tradition, findings are quite diverse, yet overlapping for various languages (e.g. German, Androutsopoulos 2000; Dutch, Hilte, Vandekerckhove, and Daelemans 2019; English, Eisenstein 2015). First, papers focus on a variety of patterns of spelling variation, some attempting to arrive at typologies of common spelling phenomena in digital writing (e.g. Verheijen 2018). Often distinctions are made between orthographic practices that reflect variation in spoken language (e.g., g-dropping as in workin instead of working) and variation that does not have a spoken counterpart and strictly relates to the orthographic level (e.g., substituting letters with numbers as in 2night instead of tonight) with many authors emphasizing the complex relationship between orthographic variation and spoken variation (e.g., Darics 2013; Eisenstein 2015; Fuchs et al. 2019; Verheijen 2018). Second, linguists have been documenting the social variation linked to non-conventional spelling variants. Studies report non-conventional spellings to be associated with younger writers (e.g., Androutsopoulos 2000; Hilte, Vandekerckhove, and Daelemans 2019; Russ 2012; Verheijen 2018; Wong 2013), non-conformism (Androutsopoulos, 2000; Wong, 2013), informality and relaxed writing styles (Darics 2013; Eisenstein 2015; Russ 2012; Leigh 2018; Verheijen 2018; Wong 2013), gender (Hilte, Vandekerckhove, and Daelemans, 2019; Hinrichs and White-Sustaíta, 2011; Leigh, 2018), edginess (Wong, 2013) and perceived lower intelligence (Russ, 2012; Leigh, 2018).

Contrary to the sociolinguistic focus on the social meaning potential of spelling variation, NLP has often treated spelling variation as a problem that needs solving (Bamman, Eisenstein, and Schnoebelen, 2014), for example, in text normalization tasks (van der Goot et al., 2021; Lourentzou, Manghnani, and Zhai, 2019). Spelling variation also plays a role in various NLP applications involving the social variables mentioned above. In particular, there is a large body of work on formality in NLP, ranging from style transfer (Rao and Tetreault, 2018; Jin et al., 2022) and analyzing the formality of generated texts (Ersoy et al., 2023) to text classification (Pavlick and Tetreault, 2016; Kang and Hovy, 2021). Although some studies aim to create datasets with pairs of texts that are semantically equivalent but vary in formality (e.g., Rao and Tetreault 2018), in practice, often various changes are introduced, resulting in pairs that are not fully semantically equivalent. As a consequence, prior work has not isolated specific types of spelling variation and how they shape formality judgments.

2.2 Comparing LLMs and humans

Many studies, from both NLP and other fields, have investigated how LLMs process language in comparison to humans, including in terms of analogical reasoning (Sourati et al., 2024), judgements of grammatical agreement (Zacharopoulos, Desbordes, and Sablé-Meyer, 2023), and pragmatic language understanding (Hu et al., 2023). For example, researchers within psycholinguistics have compared word ratings obtained from LLMs with those of humans. Martínez et al. (2024) found that GPT-4o’s ratings of concreteness for multi-word expressions correlated well with human ratings, however its scoring distributions differed clearly, with modes concentrated around the integer values of the Likert scale. Furthermore, Conde et al. (2025) compared human word ratings on different word norms and found mixed results; alignment was lowest on norms related to sensory experiences.

Researchers have also compared LLMs to humans in terms of the opinions and values they express, with mixed results (Santurkar et al., 2023; Argyle et al., 2023). The alignment between LLMs and humans varies across sociodemographic groups (Durmus et al., 2024; Lutz et al., 2025). To improve alignment, prompts designed to encourage LLMs to simulate certain personas have been used (e.g., Aher, Arriaga, and Kalai 2023; Beck et al. 2024). However, the specific prompting strategy (e.g., how the demographic groups are primed) matters and can influence the alignment with humans and the strength of stereotypical patterns in the generated output (Lutz et al., 2025).

Taken together, LLM output tends to broadly correlate well with human judgements (e.g., ratings). However, when more fine-grained analyses are performed, e.g., by zooming in on certain sociodemographic groups or when comparing the shape of rating distributions, results have been more mixed. To our knowledge, controlled experiments based on parallel texts (i.e., texts with equivalent meanings) comparing whether LLMs’ and humans’ social perceptions of language variation align remains unexplored, besides Hofmann et al. (2024) and Bui et al. (2025). However, as discussed, they only compare their LLM results with findings based on human data in an indirect way.

2.3 Summary

Our work distinguishes itself from prior work in the following aspects: (1) We focus on widely attested spelling variation in English in online discourse, rather than specific language varieties; (2) In contrast to most NLP research on language variation, which focuses on production differences, we focus on social perception; (3) We use a highly controlled rating task, resembling how sociolinguists measure social meaning with human participants; (4) We compare human ratings with LLM ratings, using data and an experimental setup that aims to maximize comparability.

3 Overall Methodology: The Speaker Evaluation Paradigm in Sociolinguistics

In this study, we follow the strategy of previous work comparing LLMs and humans: we start with traditional methodology developed for human participants and then develop an LLM-suited alternative to match that methodology as closely as possible (cf. Duncan 2024; Hofmann et al. 2024; Bai et al. 2025).

3.1 Background

Sociolinguists have developed a varied toolbox of methods to study the social meaning of language variation, ranging from interview techniques and questionnaires to reaction time-based experiments adapted from social psychology (Garrett, 2010; Kircher and Zipp, 2022).

One of the most frequently used methods is the speaker evaluation paradigm which aims to uncover social evaluations of a person using a certain type of language (feature). In this type of study, participants are presented with a series of recordings and subsequently asked to rate the speaker on a set of social attributes (e.g. intelligence, friendliness, social attractiveness; Loureiro-Rodríguez and Acar 2022). The recorded speech samples are carefully controlled for content and linguistic characteristics so that the researcher can assume that different ratings of the speaker stem from differences in the evaluation of the language they use.

Traditionally, speaker evaluation studies have aimed to keep the participant unaware of the research aim with the goal to measure social meanings indirectly to avoid socially desirable reactions and access more privately held attitudes. More recently, however, variants of the method were introduced that do not hide from participants that the research focuses on their evaluations of different types of language use. Such studies are usually referred to as open guise studies (e.g., Soukup 2013). Which option is used depends on a variety of factors, ranging from the nature of the research question (e.g., is the researcher interested in more direct evaluations) to aspects of the design of the study (e.g., is it an option to include fillers or use a between-subject design to obscure the research purpose). Although the speaker evaluation paradigm was originally developed to study variation in spoken language (Lambert et al., 1960) and it is still predominantly used to that end, adaptations of the method exist that use written stimuli (Anderson and Toribio, 2007; Hilte, Vandekerckhove, and Daelemans, 2019; Holliday and Tano, 2021; Buchstaller, 2006).

3.2 This Study

In this study, we build on the speaker evaluation paradigm to measure the social meaning of written language.

We use an open guise setup, due to our concern regarding validity of the measurement, especially for the human participants: some spelling variation is quite subtle and could be overlooked. We want to make sure participants base their evaluations on the variation under investigation, so to that end we opt for high experimental control, thereby prioritizing construct validity over ecological validity. In our human study, the instructions at the beginning of the study make the participants aware of our interest in spelling variation. Furthermore, in each tweet presented to the human participants, we underline the words of interest. With LLMs, the prompt also makes our interest in spelling variation explicit, and the words of interest are surrounded by tags (see §5). This differs from Hofmann et al. (2024) who — inspired by the matched guise tradition — elicited social perceptions from LLMs without explicitly highlighting the variation under study in the prompt. We view these two approaches as complementary, and a direct comparison between open guise and matched guise for the study of spelling variation (with both humans and LLMs) would be an interesting avenue to pursue in future work. Although we prioritize construct validity in our design, we nonetheless attend to ecological validity, specifically in the way we select and visualize our stimuli (§4.1.1).

To mitigate additional factors that could influence the ratings such as the content of the tweet, pairs of tweets — the conventional and non-conventional versions of the tweet — are presented alongside one another. Assigning ratings to texts in isolation is highly subjective, and can make the scores between texts less comparable (Sterner and Teufel, 2025). This design is similar to the use of minimal pairs in NLP research, which has been frequently used to analyze models, for example to study grammatical acceptability (Warstadt et al., 2020) and code-switching acceptability (Sterner and Teufel, 2025). A recent study, concurrent with ours, has also used minimal pairs in prompts to study patterns of social perceptions of language variation in LLMs, focusing on German dialects (Bui et al., 2025).

Both humans and LLMs are asked about their social perceptions of the two tweets in each pair, by rating them on three key social attributes: formality, carefulness and age. We discuss the human data collection in §4 and then the LLM results and comparisons against human data in §5.

4 Human Perception Data

Although social meaning of spelling variation has been studied in sociolinguistics (e.g., Sebba 2007), a dataset with systematic human ratings that fits our research goal is not available. We therefore collect perception ratings by human participants, who rate tweets with different types of spelling variation (see Table 2). We focus on Twitter (now, X), due to its broad familiarity with the general public and extensive research in both NLP and sociolinguistics (Grondelaers et al., 2023; Ilbury, 2020; Ilbury, Grieve, and Hall, 2024; Nguyen et al., 2016).

4.1 Data Collection

We first describe the tweet dataset (§4.1.1) and then the collection of the human ratings (§4.1.2).

4.1.1 Stimuli: Creation of the tweet dataset

Selection of the types of spelling variation

We first identified common types of spelling variation in social media based on the literature (e.g., Choudhury et al. 2007; Contractor, Faruquie, and Subramaniam 2010; Cook and Stevenson 2009; Liu et al. 2011; Pennell and Liu 2011; Shortis 2016; Tagg 2009; van der Goot, van Noord, and van Noord 2018; Yang and Eisenstein 2013). We grouped them into three high-level categories, building on classifications by Tagg (2009), Shortis (2016) and van der Goot, van Noord, and van Noord (2018).

The first category contains spelling variation that reflects variation in spoken language. For instance, ‘working’ vs. ‘workin’ likely represents the velar vs. alveolar pronunciation of the word-final consonant also present in spoken language. The second category contains spelling variation that has no spoken counterpart and that is likely intentional (Busch, 2021), such as number substitution, e.g., ‘wait’ vs. ‘w8’. The third category contains spelling variation that has no spoken counterpart and that is likely accidental, e.g., misspellings (van der Goot, van Noord, and van Noord, 2018). However, this distinction should be approached with caution, as it is impossible to establish a writer’s intention from only text.

For each high-level category, we include two to three common types of spelling variation from the literature, see Table 2. For example, spelling variation reflecting pronunciation differences was represented by (i) g-dropping (e.g., doing vs. doin) and (ii) lengthening (e.g., know vs. knoooow).¹¹1Note that lengthening occurs twice in Table 2 as we distinguish between lengthening that likely mirrors spoken lengthening of sounds, and lengthening that is purely orthographic, where the nature of the sound represented by the repeated character does not lend itself to a prolonged pronunciation.

High-level category	Variation type	Example
Spelling variation reflecting	G-dropping	We’re working on it / We’re workin on it
variation in pronunciation	Lengthening	This was a bad idea / This was a baaaad idea
Spelling variation	Lengthening	It’s so much fun / It’s so much funnnn
with no pron. counterpart	Vowel omission	I’ll call you after this weekend /
		I’ll call you after this wknd
(‘intentional’)	Number subst.	I work late tomorrow / I work l8 tomorrow
Spelling variation	Letter swap	It will certainly help / It will certianly help
with no pron. counterpart	Keyboard subst.	I should hope so / I shpuld hope so
(‘non-intentional’)

Table 2: The different types of spelling variation in our data. For each type, we include pairs of tweets with the conventional (i.e. standard) and unconventional spelling.

Lexical item selection

Each spelling variation type is exemplified by five words (e.g., working, doing, going, getting, talking for g-dropping). We selected words for which the specific variation type occurs in a Twitter corpus from the London area (May 2018–April 2019) to ensure realistic stimuli.

Tweet selection

For each selected word we included two different tweets. This was done to control for the specific contents of the tweets. Each tweet further appears in two versions: one with the conventional spelling (e.g., Just doing my job) and one with the non-conventional spelling (e.g., Just doin my job). We used a London Twitter corpus for inspiration to construct realistic stimuli by searching for common n-grams with the spelling variants. We manually selected tweets based on two criteria: (1) the tweets are short and should not contain other words that could potentially show the same type of spelling variation; (2) the tweets carry a general and neutral meaning, to minimize other factors that could influence the social perception of the tweet and its writer. We manually edited the tweets, ensuring uniform capitalization and punctuation, removing user mentions and links, and shortening the tweet. For a given tweet (e.g., with a non-conventional spelling), we manually created its counterpart (e.g., with the conventional spelling). In total, we have 70 tweet pairs consisting of a conventional and non-conventional spelling variant (Table 11, Appendix). To provide an authentic context, the texts were visually presented in the form of tweets in the rating experiment.

4.1.2 Collection of the human perception data

Instrumentation and design

For every tweet pair, we collect data on three social attributes: (in)formality, care(ful/less)ness and perceived writer age. These attributes were chosen based on related work complemented with a pilot study, since sociolinguistic literature on the social meaning of spelling variation is still somewhat sparse. Our pilot took the form of a free response experiment (cf. Grondelaers et al. 2020 and Garrett, Williams, and Evans 2005), in which participants (N = 493) gave the first words that came to mind when presented with tweets containing our types of spelling variation. Formality (e.g. “informal”, “relaxed”, “unprofessional” vs. “formal”, “professional”), age (e.g. “younger”, “teenager” vs. “older”), and carefulness (e.g. “careless”, “sloppy”, “lazy”, “hurried” vs. “precise”, “proper”, “focused”) emerged as prominent social attributes. Based on the pilot and previous sociolinguistic work, (users of) non-conventional spellings are hypothesized to be perceived as more informal, less careful and more youthful Darics (2013); Eisenstein (2015); Leigh (2018); Russ (2012); Hilte, Vandekerckhove, and Daelemans (2019).

For the first two attributes, participants were presented with a visual analog scale (VAS), which is used in various fields (e.g., psychology, medicine) to report feelings and emotions. A VAS allows us to capture subtle variance in attitudes. VAS are usually operationalized as 100-point scales for the purpose of quantitative analyses. See Llamas and Watt (2014) for the use of VAS in sociolinguistics. Concretely, we collect perceptions using a 100-point scale (informal vs. formal and careful vs. careless) with a slider that participants can drag to the desired position (see Figure 4, Appendix). Finally, participants were asked for an age estimate of a text’s writer which they could type into a text box.

At the start, participants saw our instructions, which also made our interest in spelling variation explicit (see Appendix, Figure 3). Participants were then presented with one randomly selected tweet pair for each of the seven spelling variation types (Table 2) for each of the three social attributes, resulting in a total of 21 pairs. The tweets were presented in blocks by social attribute. Following the rating tasks, the participants answered two open questions about their views on spelling variation on Twitter and several general questions about their sociodemographic background (i.e. age, gender, region) and their experience with Twitter.

Procedure: crowdsourcing

We collected the data in September 2023 using Prolific, a crowdsourcing platform. As inclusion criteria for the participants we used Country of Birth UK, First Language English and no language related disorders. We recruited 230 participants, who were paid £1.35 to perform the task. The median time spent is 8:54 min, resulting in an average pay of £9.10/hr. We performed quality control on the data, removing in total 13 users; see Appendix 10 for the exclusion criteria. Our final dataset contains 9,114 ratings by 217 participants. The average number of ratings per item (i.e. an individual tweet evaluated on a specific social attribute) is 21.7; this number is higher than most studies in NLP, as we follow standards from sociolinguistics. Many (67%) participants had a Twitter account. Further, 59% identified as female, 40% as male and 1% as other.

4.2 Analysis of the human perception data

We now analyze the human perception data. Example formality ratings are in Table 3.

Main observations

The main trends align with our expectations (§4.1.2). Tweets with the conventional spelling variants are rated as more formal (Mean = 66.2, SD = 5.5) compared to the ones with the unconventional spellings (Mean = 22.0, SD=8.8). Tweets written with the conventional spelling are also rated as more careful (Mean = 74.8, SD=3.8 vs. Mean = 32.8, SD = 9.4) and perceived as written by older authors (Mean = 33.3, SD=2.7 vs. Mean = 22.4 years, SD = 3.0). All differences are significant (p < 0.001), using the paired Wilcoxon signed-rank test. Furthermore, the tweet content also matters (e.g., compare the ratings for See you tomorrow vs. Oh how cool).

Tweet	Avg. rating	Std dev.
Oh how cool	50.4	20.0
Oh how cooool	15.4	12.8
See you tomorrow	71.2	15.4
See you 2morrow	10.7	11.1
It will certainly help	75.8	17.3
It will certianly help	34.8	21.4

Table 3: Tweets and their informality ratings by the Prolific participants (0=very informal, 100=very formal).

Variability of the ratings

To analyze the agreement among individual participants, we calculate the Spearman correlation between each participant’s ratings and the average ratings of the others, and then average over all participants: 0.787 (formality), 0.769 (carefulness) and 0.667 (age). We also consider the difference scores between ratings for the conventional and non-conventional forms (similar to sociolinguistic studies like Zenner, Rosseel, and Speelman 2021). We then obtain lower correlations: 0.394 (formality), 0.416 (carefulness) and 0.253 (age).²²2In sociolinguistic studies, human agreement is usually not reported, as the task is considered inherently subjective. We report agreement metrics to facilitate the interpretation of the agreement numbers between humans and LLMs.

5 LLM Experiments

5.1 Models

We experiment with 12 models, including both open-weight and closed models with different model sizes. We only consider post-trained models, since pilot experiments with base models yielded poor performance (e.g., the models not returning ratings). We include three closed models from OpenAI: GPT-4 (gpt-4-0613), GPT-4o (gpt-4o-2024-08-06) and GPT-5 (gpt-5-2025-08-07) (OpenAI team, 2023). We also include one closed model from Anthrophic: Claude-4.5-sonnet³³3System card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf. The remaining models are open-weight models. We include different model sizes of the Llama family: Llama 3 (8b and 70b) and Llama 3.1 (450b), all instruct versions (Grattafiori et al., 2024), and the more recent Llama4 Maverick (17b)⁴⁴4https://ai.meta.com/blog/llama-4-multimodal-intelligence/. We also include Gemma 2 (2b and 9b, instruct) (Gemma Team and others, 2024).⁵⁵5We also experimented with gemma2-27b-it, but surprisingly this model gave poor results as it often did not respond correctly to the prompt, see Appendix A. Finally, we include Qwen 3 (qwen3-235b-a22b-instruct-2507) and DeepSeek-V3.1 (DeepSeek-AI, 2025).

Temperature

For all models, we set the temperature to the commonly used value of 1, making the runs non-deterministic.⁶⁶6Except GPT-5, since its API does not allow setting a temperature. For GPT-5 we set both verbosity and reasoning_effort to ‘low’. We were also not able to set the temperature for Claude-4.5-sonnet. We use a non-zero temperature because humans also tend to vary their responses, and likely would not produce identical ratings if they would be asked to repeat the task multiple times. Appendix A reports results with the temperature set to 0, which are similar to those with the temperature set to 1.

5.2 Prompting Approach

Prompt design

Our overall aim is to adhere closely to the instructions provided to human participants. We therefore do not tune the prompts. We make three modifications to make the instructions suitable for LLMs. First, to always elicit ratings, even when the models are unsure, we added “You must always respond in the format we ask. Even if you are unsure.”. Second, while human participants provided formality and carefulness ratings using a slider, LLMs are tasked with providing a numerical response ranging from 0 to 100 to align with the slider’s range. Third, while the crowd workers were presented with tweets in which the linguistic variants of interest were underlined, we surrounded the specific variant with tags (<W> and </W>). Fourth, participants provided their ratings without an explanation. We therefore ask the LLMs to also not provide an explanation by adding “without explaining your reasoning”. Because LLMs can be sensitive to the exact wording of a prompt (Sclar et al., 2024; Shu et al., 2024), we have four variants for each prompt, with small changes in terms of sentence length, phrasing and new lines.

Independent versus paired prompt setup

We experiment with two different setups; they differ in the extent to which they resemble the experimental conditions with human participants. (i) Independent, where each prompt requests a rating for one tweet. The tweet contains either a non-conventional or a conventional spelling. Thus, each tweet is rated entirely independently of the others. (ii) Paired. In the human experiments, participants were presented with pairs of tweets on each page (one version with the conventional spelling, one with the unconventional spelling, see Fig 4 in the Appendix). To simulate this setting, each prompt contains the two tweet versions. We balance the presentation order (e.g., conventional/unconventional, unconventional/conventional). The system prompts and examples of user prompts for both the independent setup and the paired setup can be found in Appendix D.

Prompting frequency

In our human data, each item (representing a rating of a specific social attribute for a tweet with either a conventional or unconventional variant) received between 19 and 24 ratings. To collect the same number of responses as with the human experiment, we prompted the models multiple times.⁷⁷7For most models, we were able to extract the same number of ratings as we collected with the human data. For some models, we missed a small number of ratings due to extraction errors (e.g., the model not responding in the requested format). The two models for which we were able to extract the least number of ratings were Gemma-2b and GPT-5. With GPT-5, we missed 49 ratings (out of 9114) in the independent setting and 236 (out of 9114) in the paired setting, because the model refused to provide an answer, e.g., Sorry, I can’t help with that. With Gemma-2b, we missed 390 (independent) and 30 (complex) ratings.

5.3 Results

Prompt	Setting	Llama3-8b	Llama3-70b	Llama3.1-450b	Llama4-maverick
		GPT models
Prompt	Setting	GPT-4o	GPT-4	GPT-5
indep.	raw	0.666	0.775	0.784
indep.	diff	-0.066	0.134	0.166
paired	raw	0.897	0.885	0.901
paired	diff	0.512	0.462	0.524
		Gemma models
Prompt	Setting	Gemma2-2b	Gemma2-9b
indep.	raw	-0.032	0.752
indep.	diff	0.073	0.066
paired	raw	0.340	0.820
paired	diff	0.206	0.211
		Llama models
indep.	raw	0.678	0.812	0.721	0.680
indep.	diff	0.041	0.035	0.033	0.086
paired	raw	0.820	0.872	0.860	0.871
paired	diff	0.277	0.480	0.318	0.250
		Other
Prompt	Setting	Qwen3	Claude-4.5-sonnet	DeepSeek-V3.1
indep.	raw	0.801	0.858	0.683
indep.	diff	0.059	0.174	0.102
paired	raw	0.890	0.910	0.889
paired	diff	0.354	0.414	0.579

Table 4: Spearman correlations between the LLM responses and the human responses, averaged across the three attributes (carefulness, formality, age). The correlations are calculated by comparing the raw ratings (raw), and by comparing the ratings differences between the conventional and non-conventional versions (diff). LLMs were prompted by showing each tweet independently (indep.) and in a paired setting (paired).

Table 4 presents the main results. We calculate the Spearman correlation between the raw human ratings and LLM ratings (taking the average of all human/LLM ratings for an individual item) and report the average across the attributes.⁸⁸8See Appendix A for more details about our choice for Spearman. We also report results with Pearson correlations, and the overall trends are very similar. We also calculate the correlations between the rating difference for the conventional and non-conventional variants. For example, consider Table 3. For the tweet pair See you tomorrow and See you 2morrow, the rating difference for formality would be 60.5.

Model comparison

GPT-5 and Claude-4.5-sonnet have the highest correlations with the human ratings; the smaller models (2–9B parameters) have the lowest correlations. Furthermore, interestingly, Llama3.1-450b has lower correlations than Llama3-70b and the newest Llama model, Llama4-maverick has lower correlations than other recent models. The task is inherently subjective (consider the variability in human ratings, §4.2), and it remains an open question what level of correlation is desirable for different use cases (e.g., in sociolinguistic research).

Higher correlations are obtained when the prompt more closely matches the human setup

Consistently across all models, there is a substantial increase in correlation when tweets with conventional and non-conventional spellings are presented simultaneously rather than individually (Table 4, compare “independent” vs “paired”). For instance, GPT-5’s correlation increases from 0.784 to 0.901 when correlating the raw ratings, and from 0.166 to 0.524 when correlating the rating differences.

Correlations between rating differences are substantially lower

Correlating raw ratings provides a limited view. Consider a simple baseline model that randomly assigns ratings of 0 or 1 to tweets with a non-conventional spelling and 99 or 100 to those with a conventional spelling. In this scenario, the raw correlations would still appear high—e.g., running this baseline resulted in a Spearman correlation of 0.717 across attributes for a single run. However, the correlations between rating differences drop (as expected) dramatically to an average correlation of -0.059 across attributes. Thus, a model that merely distinguishes between non-conventional and conventional spellings can obtain high correlations when comparing raw ratings. However, high correlations on rating differences require the model to account for the size of the effect the spelling variation has on the perception of the writer, potentially influenced by the specific tweet content and the type of spelling variation (see Table 3 for human rating examples).

Table 4 shows substantial drops in correlations when we compare the results for the raw ratings (‘raw’) vs. the rating differences (‘diff’). As an example, GPT-5’s correlation with human ratings decreases from 0.901 (raw ratings) to 0.524 (rating differences) in the paired setup. The decrease with Claude-4.5-sonnet is even larger: from 0.910 to 0.414. The lower correlations when comparing the rating differences aligns with our analysis on human agreement, where we also found that correlations between rating differences were lower than correlations between the raw ratings.

To illustrate why it is more difficult to obtain high correlations when considering rating differences, note that the rating differences between the conventional and non-conventional variants vary across individual tweets and types of spelling variation. As an example, this tweet pair (containing number substitution) had the largest drop in formality ratings by Claude-4.5-sonnet: “Forever grateful ” (Claude: 69.8, humans: 73.5) vs. “4ever grateful” (Claude: 20.0, humans: 13.3). In contrast, this tweet pair (containing a letter swap) had the smallest drop in formality ratings by Claude-4.5-sonnet: “That’s exactly how I feel about that” (Claude: 50.0, humans: 68.7) vs. “That’s excatly how I feel about that” (Claude: 47.4, humans: 36.0). Although the rating difference by Claude-4.5-sonnet, like with the human data, is smaller for the second example, the magnitudes differ substantially.

Refer to caption — (a) Humans: formality

LLMs show less rating variability

The LLM ratings and human ratings also differ in terms of variability when evaluating the same item. Although multiple humans rated each item, the LLMs were prompted multiple times for each item. Overall, the standard deviations of the LLM responses are lower than those of human ratings. For instance, GPT-5’s average standard deviation is 6.7 for formality, 4.0 for carefulness and 1.5 for age; Claude-4.5-sonnet’s average standard deviation is 5.5 for formality, 4.6 for carefulness and 1.6 for age. In contrast, for humans this is 17.4 for formality, 18.2 for carefulness and 8.1 for age. We see a similar pattern with the other LLMs, where the standard deviations are lower compared to humans.

Note that the temperature influences these patterns. We find lower standard deviations with the temperature set to 0 (Table 8, Appendix). Our results align with psychological research that also found less variance in LLM responses (Abdurahman et al., 2024). Note that while we prompt LLMs many times, this is not equivalent to asking different people; instead conceptually it might be more similar to asking the same person multiple times (Abdurahman et al., 2024).

The ratings from humans and LLMs are distributed differently

Figure 1 shows the density plots of ratings by humans and by GPT-5 (paired setup). GPT-5’s formality and carefulness ratings exhibit clear peaks at certain numbers, a pattern not present in the human data. In fact, most ratings on these two attributes are multiples of 5. This difference likely stems from differences in how ratings were collected: humans used a slider, whereas LLMs provided numerical values in their responses. Another factor that might contribute to this pattern is the frequency of certain numbers in the LLMs’ training data (McCoy et al., 2024). Other models show similar trends. For example, Qwen3 and Claude-4.5-sonnet have an even stronger tendency compared to GPT-5 to return multiples of five and ten, and Llama-3.1-405b-instruct even only returns as ratings multiples of ten for formality and carefulness.

Comparing social attributes

Figure 2 shows a boxplot of the Spearman correlations between the human and LLM ratings for all models in the paired prompt setting, for each social attribute. The correlations are lowest for age, a trend that we also observed when analyzing the agreement among humans (§4.2). The median correlation is highest for carefulness. This is different from our human data, where agreement between participants was highest on the formality ratings, although agreement on the carefulness ratings was high as well.

We also calculate, for each LLM, the correlations between the social attributes. First, for each item, we take the average rating for a given social attribute. Then, we calculate the Spearman correlation between pairs of attributes. One observation that stands out is that formality and carefulness have a Spearman correlation of 0.69 in the human data. In contrast, all LLMs (except Gemma2-2b, with low correlations overall, and Claude-4.5-sonnet, with a correlation of 0.65) have higher correlations between these two attributes, suggesting that they may differentiate less between them.

Rating differences between the types of spelling variation generally matches our human data

We now compare the different types of spelling variation and take GPT-5 as an example; see Table 5, which also includes results based on the human ratings.

For formality, the smallest rating drops occur for letter swapping and keyboard substitution. For instance, the formality rating by GPT-5 decreases on average by 33.18 points when the tweet contains a letter swap. This trend aligns with our human data, where these two types also show the smallest drops; and with our intuition, as such spelling variations are more likely to be perceived as typos and thus unintentional. Conversely, these two types exhibit the largest rating drops for carefulness. This trend is also present in our human data and it matches intuition, since typos can be perceived as less careful writing. There are also differences. For example, while these two types stand out with the human guesses for age (exhibiting the smallest drops in ratings), this is not the case with the GPT-5 ratings.

	Humans			GPT-5
Type	Formality	Carefulness	Age	Formality	Carefulness	Age
g-dropping	-47.47	-40.97	-11.73	-42.35	-53.09	-10.45
number subst.	-55.64	-34.36	-12.05	-48.83	-61.79	-12.87
letter swap	-29.72	-50.13	-7.56	-33.18	-72.03	-10.88
keyboard subst.	-36.13	-57.45	-7.62	-31.39	-72.73	-10.46
lengthening (pron)	-46.58	-32.15	-11.91	-37.65	-55.01	-12.69
lengthening (spel)	-45.30	-36.61	-12.94	-40.07	-59.51	-11.74
vowel omission	-48.78	-42.51	-12.56	-35.80	-60.19	-11.70

Table 5: The average drop in ratings by GPT-5 (paired setting) and humans for formality, carefulness and age, when a non-conventional spelling variant is included in the tweet, compared to the conventional (i.e., standard) variant.

The ratings of Gemma2-9b-instruct align less well; it also had a lower Spearman correlation on the rating differences. See Table 10 in the Appendix. For example, we do not observe a clear difference between the formality and carefulness ratings when comparing the (likely) intentional and non-intentional types of spelling variation.

5.4 Robustness Checks

We perform two additional experiments to rule out that the LLMs’ ratings simply reflect unfamiliarity with the non-conventional spelling variants.

Can LLMs provide the conventional spelling for the spelling variants?

We verify that the LLMs can correctly associate each spelling variant with its conventional (i.e., standard) English spelling. We prompt each LLM with “Return the standard English spelling of: [spelling variant]. Only respond with the correct answer.” for all spelling variants. All models perform well on this task. Most of the models have a 100% accuracy; the only model with a lower than 90% accuracy is Gemma2-2b-it. However, even in this case, some of the responses seem reasonable (e.g., returning wild for wld instead of would), especially considering that the spelling variants are presented in isolation without further context.

Pseudowords instead of the non-conventional spelling variants

We also experiment with the same prompts as in our main experiments; however, we replace the non-conventional spelling with another word. This experiment is inspired by Hofmann et al. (2024) and Bui et al. (2025). They randomly replace, delete and insert characters and words, with new words drawn from common English/German words. In contrast, we only change the non-conventional spelling variant, since this is the only difference between the two tweet versions. We do not apply random character-level changes, since the resulting forms could be similar to some of our spelling variants (e.g., typos). Instead, we replace our non-conventional spelling variants with pseudowords generated using Wuggy (Keuleers and Brysbaert, 2010). For each conventional spelling, we generate ten pseudowords. Then for each prompt, we randomly sample one of these pseudowords. For instance, instead of using doin (g-dropping), we use generated pseudowords for doing, which include biing, miing and baing. We use the paired prompt setup and experiment with the two models with the highest correlations (Claude-4.5-sonnet and GPT-5) and two smaller models (Llama3-8b and Gemma2-9b). The results are in Table 6.

Prompt	Setting	Claude-4.5-sonnet	GPT-5	Llama3-8b	Gemma2-9b
paired	raw	0.782	0.828	0.765	0.805
paired	diff	0.035	-0.012	0.059	0.027

Table 6: Results for the pseudo words experiment: Spearman correlations between the LLM responses and the human responses, averaged across the three attributes.

Across all settings and models, the obtained correlations with the human ratings are—as expected—lower than those in our main experiment. Comparing raw ratings still leads to high correlations (see values in the ‘raw’ row; e.g., Claude-4.5-sonnet obtains a correlation of 0.782, compared to 0.910 in the main experiment), but the correlations disappear when comparing the rating differences (‘diff’). This is expected: when correlating the raw ratings, a model that just distinguishes between conventional spelling variants and other forms can do well.

Taken together, these robustness checks show that 1) LLMs have clear associations between the non-conventional and conventional spellings; and 2) that the correlations with human ratings cannot be explained by deviations from the conventional forms alone.

6 Limitations

In this section, we discuss the limitations of our study, grouped into four aspects: scope (§6.1), experimental setup (§6.2), understanding the processes by which social meanings are learned and shaped by context (§6.3), and implications (§6.4).

6.1 Scope

We focused on seven common types of English spelling variation and three social attributes. Future work could explore other types of spelling variation, such as consonant substitution, e.g. ‘r’ for ‘are’, and omission of the first syllable, e.g., ‘bout’ for ‘about’ (Tagg, 2009). Future work could also explore other types of language variation, such as syntactic (e.g., Grondelaers et al. 2023) and lexical variation (e.g, Bamman, Eisenstein, and Schnoebelen 2014).

We had to limit the number of social attributes in our study. Even only three social attributes involved collecting over 9k ratings, due to the controlled data collection with human participants to ensure we met sociolinguistic standards. Future work could consider other social attributes (e.g., gender, education level, social class, ethnicity or whether the writer is perceived as friendly or excited, depending also on the type of variation studied). Analyzing these associations in LLMs could add an important perspective on the study of bias in NLP (Blodgett et al., 2020; Hofmann et al., 2024).

Finally, future studies should consider other languages to test the generalizability of our findings; since English is well represented in training data for LLMs, agreement between LLMs and humans may be lower for other languages.

6.2 Experiment setup

First, we collected data from crowdworkers. Our data is not a fully representative sample of the UK population; future studies could explore whether similar results will be obtained when a different data collection method is used.

Second, we aimed to mimic the human experiment as closely as possible when prompting LLMs. However, both human participants and LLMs were given explicit visual cues to highlight the target words, but the format of these cues differed slightly: humans saw underlined words, whereas we used <W> tags when prompting LLMs. Although we expect this difference to only have a minimal impact, future research could investigate how different types of cues influence ratings. Furthermore, LLMs were more likely to respond with ratings that are multiples of 5 and 10 (§5.3). One possible explanation is the difference in input format: humans used a slider, but LLMs provided a numeric rating directly. To understand to what extent this explains the differences in rating distributions, an experiment where humans also provide numeric ratings instead of using a slider could be performed.

Third, we used a direct measurement approach by explicitly highlighting the spelling variants for both the LLMs and human participants. This allowed us to control more precisely what the ratings were based on. However, future work could explore a similar setup without drawing attention to the spelling variants, for instance by not showing the tweet versions side by side and/or not highlighting the variants of interest. Such an approach may better reflect real-world scenarios, where LLMs encounter unmarked spelling variants in texts.

6.3 Understanding how social meanings are learned and influence of context

Future research could investigate why LLMs provide these ratings. For instance, examining the internal mechanisms of LLMs could shed light on how associations between social attributes and linguistic variation are encoded. Researchers could also explore whether certain factors (e.g., frequency distributions, saliency, etc.) influence the acquisition of social meaning in LLMs in the same way as with humans (e.g. Samara et al. 2017 and Rácz, Hay, and Pierrehumbert 2017).

Furthermore, language variants can index multiple social meanings (cf. -ing example in the introduction). The full social meaning potential, i.e. all social meanings a linguistic form can potentially express, is referred to as its indexical field (Eckert, 2008). Which of these potential meanings is activated, depends on the specific context of an utterance (Campbell-Kibler, 2014). Future research could thus further compare the breadth of indexical fields of linguistic variants of LLMs and humans, and whether the same contextual factors activate specific social meanings.

Such investigations could also take into account the social characteristics of the hearer/reader. In our study, participants had a variety of social profiles, but we did not ask the LLMs to consider social information of the rater in their ratings. Perhaps if we had instructed the models to assume a similar age distribution to our human sample or certain personas (Argyle et al., 2023), the correlation between the human and model ratings would improve.

6.4 Implications

We did not explore the implications of our findings for the practical development of NLG systems. In some scenarios, it might be desirable for LLMs to mimic human associations; in other scenarios, perhaps not. For example, it may not be desirable to evaluate certain spelling variants that deviate from the norm as less careful than conventional spelling variants, even if humans might also view such spelling variants as less careful. Generally, it is an open question how LLMs should treat language variants that are (sometimes) evaluated negatively by humans. Our work is also relevant to the broader discussion around whether LLMs should adapt their response to different social groups, or whether it is more desirable to exhibit identical behavior (Lucy et al., 2024).

7 Discussion and Conclusion

When humans read texts, they form social perceptions about the author and the context in which the text was written. We focused on the social meaning of spelling variation in English along three social attributes (formality, carefulness, age). Returning to our RQ, we found notable differences in how the LLMs rated tweets with non-conventional spelling variants compared to those with a conventional spelling. LLM responses generally correlated strongly with human ratings, though differences were visible in rating distributions, across specific spelling variation types and across models.

Our methodology is more widely applicable and demonstrates how sociolinguistic methodology can be used to analyze what associations LLMs have with language variation. Future work should investigate whether these correlations vary across demographic groups. Our study also illustrates the potential of LLMs for sociolinguistic research. Although our study is only a first step, we see potential for sociolinguists to use LLMs for hypothesis generation or pre-testing of experiments.

More broadly, our study raises questions about whether LLMs should reflect human perception of language variation. While this may be beneficial for applications like sociolinguistic research, there are also scenarios (e.g., specific practical applications) where this might not be desirable.

Ethical Considerations

The Prolific participants provided explicit consent, indicating that they agreed to have their data being used for research and their anonymized data being shared with the scientific community. The data collection and experiments have been approved by the Science-Geo Ethics Review Board of Utrecht University (Bèta S-20434).

One potential risk is that our research, which generally finds strong correlations between human and LLM responses, could be misinterpreted as a justification for replacing human participants with LLMs in (socio)linguistic studies. However, we emphasize that we view LLM data as a potential interesting new resource for sociolinguistics. Furthermore, understanding the social meanings that LLMs associate with linguistic variation is important, given their increasing role in society. Nevertheless, sociolinguistic patterns identified through LLM data should be further validated using experiments with real human participants.

Acknowledgements.

We thank members of the NLP & Society lab, the broader NLP group at Utrecht University, and the DiLCo (Digital Language Variation in Context) network for feedback at different stages of the study. Dong Nguyen is funded by the Veni research programme with project number VI.Veni.192.130, which is (partly) financed by the Dutch Research Council (NWO).

\appendixsection

Additional experimental details and results

Prompting the LLMs

We used replicate.com to prompt the models.

Effect of the temperature

In the main text, we report results using a temperature of 1. Additionally, we performed experiments to investigate the impact of the temperature. We set the temperature to 0, producing deterministic outputs, and test four LLMs in the paired prompt setting (Table 7). The Spearman correlations between the ratings using the two temperatures (temperatures 0 and 1) are high for all tested models: llama3-8b-instruct (0.832), llama3-70b-instruct (0.947), DeepSeek-V3.1 (0.879) and GPT-4o (0.955). When correlating the ratings with the human judgments, we find that all of them are slightly lower than the runs with a temperature set to 1 (cf. Table 4).

Setting	GPT-4o	Llama3-8b	Llama3-70b	DeepSeek-V3.1
raw	0.892	0.805	0.860	0.865
diff	0.459	0.197	0.462	0.440

Table 7: Results with temperature set to 0. Spearman correlations between the LLM responses and the human responses, averaged across the three attributes (carefulness, formality, age). The correlations are calculated by comparing the raw ratings (raw), and by comparing the ratings differences between the conventional and non-conventional versions (diff). LLMs were prompted in the paired prompt setting.

In Table 8 we report the standard deviation of the responses with temperatures set to 0 and 1. As expected, the standard deviations are higher with a higher temperature across the models.

	GPT-4o		Llama3-8b		Llama3-70b		DeepSeek-V3.1
	0	1	0	1	0	1	0	1
Formality	5.8	9.3	11.8	16.9	5.7	7.3	7.9	10.5
Carefulness	3.4	6.2	10.8	14.5	2.8	4.7	6.0	8.9
Age	1.9	2.4	4.1	5.8	1.6	2.3	2.3	3.4

Table 8: Standard deviations of the responses with two temperature settings (0 and 1). All runs used the paired prompt setup.

Spearman vs. Pearson correlation

We use the Spearman correlation, as it tests for a monotonic relationship and does not assume linearity or that the data is normally distributed. Note that rescaling of the ratings would not affect the Spearman correlation. Overall we find similar results (see Table 9 for a subset of the LLMs). Thus, using the Pearson correlation would not have led to different overall conclusions.

Setting	GPT-4o (paired)	Llama3-8b (indep.)	DeepSeek-V3.1 (paired)
raw (r)	0.905	0.701	0.878
raw ( $\rho$ )	0.897	0.678	0.889
diff (r)	0.527	0.043	0.603
diff ( $\rho$ )	0.512	0.041	0.579

Table 9: A comparison between Pearson (r) and Spearman (

\rho

) correlations.

Gemma2

Gemma2-27b-it did not respond well to our prompts. It would often ask for more information and not provide the asked ratings, e.g., “A: Careful B: Careless It seems difficult to accurately judge someone as "careless or "careful" based solely on whether they write "talking" vs "talkin’". Please clarify the task by telling me **what exactly** I should be looking for when judging A: high care/formality vs. B:"Careless" could refer to informal spelling choices, grammar errors suggesting hurriedness, and lack of punctuation indicating casual style. Can you give me more context".

Per category results of gemma2-9b-it

See Table 10.

Type	Formal.	Care.	Age
g-dropping	-40.77	-38.44	-6.65
number subst.	-47.94	-41.05	-7.98
letter swap	-33.84	-39.26	-6.90
keyboard subst.	-45.86	-50.82	-10.15
lengthening (pron)	-32.00	-38.12	-7.35
lengthening (spel)	-33.37	-39.77	-6.14
vowel omission	-47.92	-47.64	-8.86

Table 10: The average drop in ratings by Gemma2-9b-instruct (paired setting) for formality (formal.), carefulness (care.) and age, when a non-conventional spelling variant is included in the tweet, compared to the conventional (i.e., standard) variant.

\appendixsection

Collection of human perception data

Figure 3 shows the starting instructions that were shown to the Prolific workers. Figure 4 shows an example page with two versions of a tweet. The full screenshots of the Prolific task will be made available on our Github repository.

We performed quality control by checking the response times (i.e. whether participants were too fast/slow to have taken the study seriously). Following Speed, Wnuk, and Majid (2017) we inspected participants who took less long than 1.5 times the interquartile range below the first quartile and longer than 1.5 times the interquartile range above the third quartile. 10 participants were removed because they took too long. The fastest participant took 4.2 minutes which was deemed enough time to adequately respond to all questions, hence no participants were removed for filling out the study too fast. We also checked whether there was any straightlining (i.e. always giving the same response; this was not the case for any participant), and their answers to the open questions (whether there were any indications of participants not taking the task seriously; this was not the case for any participant). We excluded 3 users who did not meet our location inclusion criteria (not UK), e.g., answering ‘Asia’ or ‘Here’.

\appendixsection

Dataset The dataset will be made available with a CC-BY license.

word	tweet (conv)	tweet (unconv)
doing	What are you doing?	What are you doin?
	Just doing my job	Just doin my job
working	We’re working on it	We’re workin on it
	I’m working on my next project	I’m workin on my next project
going	Going there tomorrow	Goin there tomorrow
	I’m going to try this	I’m goin to try this
getting	We’re getting ready	We’re gettin ready
	I’m just getting started	I’m just gettin started
talking	Now you’re talking	Now you’re talkin
	I was talking about this yesterday	I was talkin about this yesterday
bad	Today was a bad day	Today was a baaaad day
	This was a bad idea	This was a baaaad idea
cool	This looks cool	This looks cooool
	Oh how cool	Oh how cooool
know	I know	I knoooow
	If we never try how will we know	If we never try how will we knoooow
hello	Well hello there	Well helloooo there
	Hello how are you?	Helloooo how are you?
free	I want to break free	I want to break freeee
	I’m free until Monday	I’m freeee until Monday
fun	This was fun	This was funnnn
	It’s so much fun	It’s so much funnnn
done	I’m done	I’m doneeee
	What have you done?	What have you doneeee?
help	I accidentally deleted my account help	I accidentally deleted my account helpppp
	Which album should I buy help	Which album should I buy helpppp
best	Today was the best	Today was the bestttt
	Best weekend	Bestttt weekend
not	Why not?	Why notttt?
	Could not be happier about it	Could notttt be happier about it
never	Not sure but never mind	Not sure but nvr mind
	Never been so happy to be home	Nvr been so happy to be home
weekend	Have a good weekend	Have a good wknd
	I’ll call you after this weekend	I’ll call you after this wknd
would	Why would you do this?	Why wld you do this?
	Never would have guessed	Never wld have guessed
back	Finally made it back home	Finally made it bck home
	It’s all coming back	It’s all coming bck
when	Will call you when I get home	Will call you whn I get home
	Well when you put it like that	Well whn you put it like that
great	This is great	This is gr8
	Looks great	Looks gr8
late	Staying up late	Staying up l8
	I work late tomorrow	I work l8 tomorrow
forever	Forever grateful	4ever grateful
	For now, but not forever	For now, but not 4ever
tomorrow	Have a great day tomorrow	Have a great day 2morrow
	See you tomorrow	See you 2morrow
today	A lot is happening today	A lot is happening 2day
	Today will be better	2day will be better
exactly	That’s exactly how I feel about that	That’s excatly how I feel about that
	Exactly what I did	Excatly what I did
available	I’m available today	I’m avaliable today
	Is this still available?	Is this still avaliable?
because	Because I have a presentation tomorrow	Becuase I have a presentation tomorrow
	Probably because of this bad weather	Probably becuase of this bad weather
certainly	It will certainly help	It will certianly help
	I’ll certainly be watching	I’ll certianly be watching
against	This is what we’re up against	This is what we’re up aganist
	You and me against the world	You and me aganist the world
could	I could go right now	I coukd go right now
	I wish I could say the same	I wish I coukd say the same
about	I was just about to send you this	I was just abiut to send you this
	That sounds about right	That sounds abiut right
should	They really should	They really shpuld
	I should hope so	I shpuld hope so
something	Something is happening	Sonething is happening
	I was looking for something like this	I was looking for sonething like this
happy	Happy birthday	Hapoy birthday
	I’m happy to help	I’m hapoy to help

Table 11: The tweets presented to both humans and LLMs.

\appendixsection

Prompts

References

Abdurahman et al. (2024) Abdurahman, Suhaib, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. 2024. Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7):pgae245.
Agnew et al. (2024) Agnew, William, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee. 2024. The illusion of artificial inclusion. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, Association for Computing Machinery, New York, NY, USA.
Aher, Arriaga, and Kalai (2023) Aher, Gati V, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 337–371, PMLR.
Anderson and Toribio (2007) Anderson, Tyler Kimball and Almeida Jacqueline Toribio. 2007. Attitudes towards lexical borrowing and intra-sentential code-switching among Spanish-English bilinguals. Spanish in Context, 4(2):217–240.
Androutsopoulos (2000) Androutsopoulos, Jannis K. 2000. Non-standard spellings in media texts: The case of German fanzines. Journal of Sociolinguistics, 4(4):514–533.
Anthis et al. (2025) Anthis, Jacy Reese, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, Erik Brynjolfsson, James Evans, and Michael S Bernstein. 2025. Position: LLM social simulations are a promising research method. In Forty-second International Conference on Machine Learning Position Paper Track.
Argamon et al. (2009) Argamon, Shlomo, Moshe Koppel, James W. Pennebaker, and Jonathan Schler. 2009. Automatically profiling the author of an anonymous text. Commun. ACM, 52(2):119–123.
Argyle et al. (2023) Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351.
Bai et al. (2025) Bai, Xuechunzi, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2025. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8):e2416228122.
Bamman, Eisenstein, and Schnoebelen (2014) Bamman, David, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160.
Beck et al. (2024) Beck, Tilman, Hendrik Schuff, Anne Lauscher, and Iryna Gurevych. 2024. Sensitivity, performance, robustness: Deconstructing the effect of sociodemographic prompting. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2589–2615, Association for Computational Linguistics, St. Julian’s, Malta.
Binz et al. (2025) Binz, Marcel, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltető, et al. 2025. A foundation model to predict and capture human cognition. Nature, pages 1–8.
Blodgett et al. (2020) Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Association for Computational Linguistics, Online.
Brown et al. (2020) Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
Brysbaert, Martínez, and Reviriego (2024) Brysbaert, Marc, Gonzalo Martínez, and Pedro Reviriego. 2024. Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge. Behavior Research Methods, 57(1):28.
Buchstaller (2006) Buchstaller, Isabelle. 2006. Social stereotypes, personality traits and regional perception displaced: Attitudes towards the ‘new’ quotatives in the U.K. Journal of Sociolinguistics, 10(3):362–381.
Bui et al. (2025) Bui, Minh Duc, Carolin Holtermann, Valentin Hofmann, Anne Lauscher, and Katharina von der Wense. 2025. Large language models discriminate against speakers of German dialects. In EMNLP 2025.
Busch (2021) Busch, Florian. 2021. Enregistered spellings in interaction: Social indexicality in digital written communication. Zeitschrift für Sprachwissenschaft, 40(3):297–323.
Campbell-Kibler (2009) Campbell-Kibler, Kathryn. 2009. The nature of sociolinguistic perception. Language Variation and Change, 21(1):135–156.
Campbell-Kibler (2010a) Campbell-Kibler, Kathryn. 2010a. The effect of speaker information on attitudes toward (ing). Journal of Language and Social Psychology, 29(2):214–223.
Campbell-Kibler (2010b) Campbell-Kibler, Kathryn. 2010b. Sociolinguistics and perception. Language and Linguistics Compass, 4(6):377–389.
Campbell-Kibler (2014) Campbell-Kibler, Kathryn. 2014. Accent, (ing), and the social logic of listener perceptions. American Speech, 82(1):2–64.
Chen et al. (2025) Chen, Hongyu, Neele Falk, Michael Roth, and Agnieszka Falenskaa. 2025. “Feels feminine to me”: Understanding perceived gendered style through human annotations. In EMNLP2025.
Cheng, Yu, and Jurafsky (2025) Cheng, Myra, Sunny Yu, and Dan Jurafsky. 2025. HumT DumT: Measuring and controlling human-like language in LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25983–26008, Association for Computational Linguistics, Vienna, Austria.
Cheung, Maier, and Lieder (2025) Cheung, Vanessa, Maximilian Maier, and Falk Lieder. 2025. Large language models show amplified cognitive biases in moral decision-making. Proceedings of the National Academy of Sciences, 122(25):e2412015122.
Choudhury et al. (2007) Choudhury, Monojit, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), 10:157–174.
Conde et al. (2025) Conde, Javier, Miguel González Saiz, María Grandury, Pedro Reviriego, Gonzalo Martínez, and Marc Brysbaert. 2025. Psycholinguistic word features: a new approach for the evaluation of LLMs alignment with humans. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 8–17, Association for Computational Linguistics, Vienna, Austria and virtual meeting.
Contractor, Faruquie, and Subramaniam (2010) Contractor, Danish, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Coling 2010: Posters, pages 189–196, Coling 2010 Organizing Committee, Beijing, China.
Cook and Stevenson (2009) Cook, Paul and Suzanne Stevenson. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71–78, Association for Computational Linguistics, Boulder, Colorado.
Darics (2013) Darics, Erika. 2013. Non-verbal signalling in digital discourse: The case of letter repetition. Discourse, Context & Media, 2(3):141–148.
Deas et al. (2023) Deas, Nicholas, Jessica Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, and Kathleen McKeown. 2023. Evaluation of African American language bias in natural language generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6805–6824, Association for Computational Linguistics, Singapore.
DeepSeek-AI (2025) DeepSeek-AI. 2025. Deepseek-V3 technical report. Arxiv preprint arXiv:2412.19437.
Dillion et al. (2023) Dillion, Danica, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. Can AI language models replace human participants? Trends in Cognitive Sciences.
Duncan (2024) Duncan, Daniel. 2024. Does ChatGPT have sociolinguistic competence? Journal of Computer-Assisted Linguistic Research, 8:51–75.
Durmus et al. (2024) Durmus, Esin, Karina Nguyen, Thomas Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2024. Towards measuring the representation of subjective global opinions in language models. In First Conference on Language Modeling.
Eckert (2008) Eckert, Penelope. 2008. Variation and the indexical field. Journal of Sociolinguistics, 12(4):453–476.
Eckert (2012) Eckert, Penelope. 2012. Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. Annual Review of Anthropology, 41(1):87–100.
Eisenstein (2015) Eisenstein, Jacob. 2015. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics, 19(2):161–188.
Ersoy et al. (2023) Ersoy, Asım, Gerson Vizcarra, Tahsin Mayeesha, and Benjamin Muller. 2023. In what languages are generative language models the most formal? Analyzing formality distribution across languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2650–2666, Association for Computational Linguistics, Singapore.
Faisal et al. (2024) Faisal, Fahim, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and Antonios Anastasopoulos. 2024. DIALECTBENCH: An NLP benchmark for dialects, varieties, and closely-related languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14412–14454, Association for Computational Linguistics, Bangkok, Thailand.
Fleisig et al. (2024) Fleisig, Eve, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, and Dan Klein. 2024. Linguistic bias in ChatGPT: Language models reinforce dialect discrimination. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13541–13564, Association for Computational Linguistics, Miami, Florida, USA.
Flekova et al. (2016) Flekova, Lucie, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. Analyzing biases in human perception of user age and gender from text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 843–854, Association for Computational Linguistics, Berlin, Germany.
Flekova, Preoţiuc-Pietro, and Ungar (2016) Flekova, Lucie, Daniel Preoţiuc-Pietro, and Lyle Ungar. 2016. Exploring stylistic variation with age and income on Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 313–319, Association for Computational Linguistics, Berlin, Germany.
Fuchs et al. (2019) Fuchs, Susanne, Egor Savin, Stephanie Solt, Cornelia Ebert, and Manfred Krifka. 2019. Antonym adjective pairs and prosodic iconicity: Evidence from letter replications in an English blogger corpus. Linguistics Vanguard, 5(1):20180017.
Ganuza and Rydell (2024) Ganuza, Natalia and Maria Rydell. 2024. Turning talk into text: The representation of contemporary urban vernaculars in Swedish fiction. Text & Talk, 45(3):319–339.
Gao et al. (2020) Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
Garimella et al. (2019) Garimella, Aparna, Carmen Banea, Dirk Hovy, and Rada Mihalcea. 2019. Women’s syntactic resilience and men’s grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3493–3498, Association for Computational Linguistics, Florence, Italy.
Garrett (2010) Garrett, Peter. 2010. Attitudes to Language. Key Topics in Sociolinguistics. Cambridge University Press.
Garrett, Williams, and Evans (2005) Garrett, Peter, Angie Williams, and Betsy Evans. 2005. Accessing social meanings: Values of keywords, values in keywords. Acta Linguistica Hafniensia, 37(1):37–54.
Gemma Team and others (2024) Gemma Team and others. 2024. Gemma 2: Improving open language models at a practical size.
Goldshtein et al. (2024) Goldshtein, Maria, Jaclyn Ocumpaugh, Andrew Potter, and Rod D. Roscoe. 2024. The social consequences of language technologies and their underlying language ideologies. In Universal Access in Human-Computer Interaction, pages 271–290, Springer Nature Switzerland, Cham.
van der Goot, van Noord, and van Noord (2018) van der Goot, Rob, Rik van Noord, and Gertjan van Noord. 2018. A taxonomy for in-depth evaluation of normalization for user generated content. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
van der Goot et al. (2021) van der Goot, Rob, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, Özlem Çetinoğlu, Rahmad Mahendra, Talha Çolakoğlu, Timothy Baldwin, Tommaso Caselli, and Wladimir Sidorenko. 2021. MultiLexNorm: A shared task on multilingual lexical normalization. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 493–509, Association for Computational Linguistics, Online.
Grattafiori et al. (2024) Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. 2024. The Llama 3 herd of models.
Grondelaers et al. (2023) Grondelaers, Stefan, Roeland van Hout, Hans van Halteren, and Esther Veerbeek. 2023. Why do we say them when we know it should be they? Twitter as a resource for investigating nonstandard syntactic variation in The Netherlands. Language Variation and Change, 35(2):223–245.
Grondelaers et al. (2020) Grondelaers, Stefan, Dirk Speelman, Chloé Lybaert, and Paul van Gent. 2020. Getting a (big) data-based grip on ideological change. evidence from Belgian Dutch. Journal of Linguistic Geography, 8(1):49–65.
Hilte, Vandekerckhove, and Daelemans (2019) Hilte, Lisa, Reinhild Vandekerckhove, and Walter Daelemans. 2019. Adolescents’ perceptions of social media writing: Has non-standard become the new standard? European Journal of Applied Linguistics, 7(2):189–224.
Hinrichs and White-Sustaíta (2011) Hinrichs, Lars and Jessica White-Sustaíta. 2011. Global Englishes and the sociolinguistics of spelling: A study of Jamaican blog and email writing. English World-Wide, 32(1):46–73.
Hofmann et al. (2024) Hofmann, Valentin, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect. Nature, 633(8028):147–154.
Holliday and Tano (2021) Holliday, Nicole and Marie Tano. 2021. “It’s a whole vibe”: testing evaluations of grammatical and ungrammatical AAE on Twitter. Linguistics Vanguard, 7(1):20200095.
Honeybone and Watson (2013) Honeybone, Patrick and Kevin Watson. 2013. Salience and the sociolinguistics of scouse spelling: Exploring the phonology of the contemporary humorous localised dialect literature of Liverpool. Journal of Varieties of English, 34(3):305–340.
Hu et al. (2023) Hu, Jennifer, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. 2023. A fine-grained comparison of pragmatic language understanding in humans and language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4194–4213, Association for Computational Linguistics, Toronto, Canada.
Ilbury (2020) Ilbury, Christian. 2020. “Sassy Queens”: Stylistic orthographic variation in Twitter and the enregisterment of AAVE. Journal of Sociolinguistics, 24(2):245–264.
Ilbury, Grieve, and Hall (2024) Ilbury, Christian, Jack Grieve, and David Hall. 2024. Using social media to infer the diffusion of an urban contact dialect: A case study of Multicultural London English. Journal of Sociolinguistics, 28(3):45–70.
Jaffe et al. (2012) Jaffe, Alexandra, Jannis Androutsopoulos, Mark Sebba, and Sally Johnson (eds.). 2012. Orthography as Social Action: Scripts, Spelling, Identity and Power. De Gruyter Mouton.
Jin et al. (2022) Jin, Di, Zhijing Jin, Zhiting Hu, Olga Vechtomova, and Rada Mihalcea. 2022. Deep learning for text style transfer: A survey. Computational Linguistics, 48(1):155–205.
Joshi et al. (2025) Joshi, Aditya, Raj Dabre, Diptesh Kanojia, Zhuang Li, Haolan Zhan, Gholamreza Haffari, and Doris Dippold. 2025. Natural language processing for dialects of a language: A survey. ACM Comput. Surv., 57(6).
Kang and Hovy (2021) Kang, Dongyeop and Eduard Hovy. 2021. Style is NOT a single variable: Case studies for cross-stylistic language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2376–2387, Association for Computational Linguistics, Online.
Karamolegkou, Abdou, and Søgaard (2023) Karamolegkou, Antonia, Mostafa Abdou, and Anders Søgaard. 2023. Mapping brains with language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9748–9762, Association for Computational Linguistics, Toronto, Canada.
Keuleers and Brysbaert (2010) Keuleers, Emmanuel and Marc Brysbaert. 2010. Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3):627–633.
Kircher and Zipp (2022) Kircher, Ruth and Lena Zipp. 2022. Research Methods in Language Attitudes. Cambridge University Press.
Labov (1972) Labov, Willian. 1972. Sociolinguistic patterns. Philadelphia University Press.
Lambert et al. (1960) Lambert, Wallace E, Richard C Hodgson, Robert C Gardner, and Samuel Fillenbaum. 1960. Evaluational reactions to spoken languages. The journal of abnormal and social psychology, 60(1):44.
Lauscher et al. (2022) Lauscher, Anne, Federico Bianchi, Samuel R. Bowman, and Dirk Hovy. 2022. SocioProbe: What, when, and where language models learn about sociodemographics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7901–7918, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
Leigh (2018) Leigh, Daisy. 2018. Expecting a performance: Listener expectations of social meaning in social media. Presented at New Ways of Analyzing Variation (NWAV) 47.
Lin et al. (2025) Lin, Fangru, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael J. Wooldridge, Janet B. Pierrehumbert, and Furu Wei. 2025. Assessing dialect fairness and robustness of large language models in reasoning tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6317–6342, Association for Computational Linguistics, Vienna, Austria.
Liu et al. (2011) Liu, Fei, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011. Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 71–76, Association for Computational Linguistics, Portland, Oregon, USA.
Llamas and Watt (2014) Llamas, Carmen and Dominic Watt. 2014. Scottish, English, British?: Innovations in attitude measurement. Language and Linguistics Compass, 8(11):610–617.
Loureiro-Rodríguez and Acar (2022) Loureiro-Rodríguez, Verónica and Elif Fidan Acar. 2022. The matched-guise technique. In Ruth Kircher and LenaEditors Zipp, editors, Research Methods in Language Attitudes. Cambridge University Press, page 185–202.
Lourentzou, Manghnani, and Zhai (2019) Lourentzou, Ismini, Kabir Manghnani, and ChengXiang Zhai. 2019. Adapting sequence to sequence models for text normalization in social media. Proceedings of the International AAAI Conference on Web and Social Media, 13(01):335–345.
Lucy et al. (2024) Lucy, Li, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, and Alexandra Olteanu. 2024. “One-size-fits-all”? Examining expectations around what constitute “fair” or “good” NLG system behaviors. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1054–1089, Association for Computational Linguistics, Mexico City, Mexico.
Lutz et al. (2025) Lutz, Marlene, Indira Sen, Georg Ahnert, Elisa Rogers, and Markus Strohmaier. 2025. The prompt makes the person(a): A systematic evaluation of sociodemographic persona prompting for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23212–23237, Association for Computational Linguistics, Suzhou, China.
Malik, Jiang, and Chai (2024) Malik, Manuj, Jing Jiang, and Kian Ming A. Chai. 2024. An empirical analysis of the writing styles of persona-assigned LLMs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19369–19388, Association for Computational Linguistics, Miami, Florida, USA.
Martínez et al. (2024) Martínez, Gonzalo, Juan Diego Molero, Sandra González, Javier Conde, Marc Brysbaert, and Pedro Reviriego. 2024. Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal. Behavior Research Methods, 57(1):5.
McCoy et al. (2024) McCoy, R. Thomas, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths. 2024. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41):e2322420121.
Nguyen (2025) Nguyen, Dong. 2025. Collaborative growth: When large language models meet sociolinguistics. Language and Linguistics Compass, 19(2):e70010.
Nguyen et al. (2016) Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé, and Franciska de Jong. 2016. Computational sociolinguistics: A survey. Computational Linguistics, 42(3):537–593.
Nguyen and Grieve (2020) Nguyen, Dong and Jack Grieve. 2020. Do word embeddings capture spelling variation? In Proceedings of the 28th International Conference on Computational Linguistics, pages 870–881, International Committee on Computational Linguistics, Barcelona, Spain (Online).
Nguyen, Rosseel, and Grieve (2021) Nguyen, Dong, Laura Rosseel, and Jack Grieve. 2021. On learning and representing social meaning in NLP: a sociolinguistic perspective. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 603–612, Association for Computational Linguistics, Online.
Nguyen et al. (2014) Nguyen, Dong, Dolf Trieschnigg, A. Seza Doğruöz, Rilana Gravel, Mariët Theune, Theo Meder, and Franciska de Jong. 2014. Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1950–1961, Dublin City University and Association for Computational Linguistics, Dublin, Ireland.
Nielsen, Kirov, and Roark (2023) Nielsen, Elizabeth, Christo Kirov, and Brian Roark. 2023. Spelling convention sensitivity in neural language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1334–1346, Association for Computational Linguistics, Dubrovnik, Croatia.
OpenAI team (2023) OpenAI team. 2023. GPT-4 technical report. ArXiv preprint arXiv:2303.08774.
Pavlick and Tetreault (2016) Pavlick, Ellie and Joel Tetreault. 2016. An empirical analysis of formality in online communication. Transactions of the Association for Computational Linguistics, 4:61–74.
Pennell and Liu (2011) Pennell, Deana and Yang Liu. 2011. A character-level machine translation approach for normalization of SMS abbreviations. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 974–982, Asian Federation of Natural Language Processing, Chiang Mai, Thailand.
Petrov et al. (2023) Petrov, Aleksandar, Emanuele La Malfa, Philip Torr, and Adel Bibi. 2023. Language model tokenizers introduce unfairness between languages. In Thirty-seventh Conference on Neural Information Processing Systems.
Preoţiuc-Pietro, Chandra Guntuku, and Ungar (2017) Preoţiuc-Pietro, Daniel, Sharath Chandra Guntuku, and Lyle Ungar. 2017. Controlling human perception of basic user traits. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2335–2341, Association for Computational Linguistics, Copenhagen, Denmark.
Rao and Tetreault (2018) Rao, Sudha and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, Association for Computational Linguistics, New Orleans, Louisiana.
Russ (2012) Russ, Brice. 2012. Social meaning in social media: Perceptual judgments of orthographic variation on Facebook and Twitter. The Ohio State University Second Qualifying Paper Talk.
Rácz, Hay, and Pierrehumbert (2017) Rácz, Peter, Jennifer B. Hay, and Janet B. Pierrehumbert. 2017. Social salience discriminates learnability of contextual cues in an artificial language. Frontiers in Psychology, 8.
Samara et al. (2017) Samara, Anna, Kenny Smith, Helen Brown, and Elizabeth Wonnacott. 2017. Acquiring variation in an artificial language: Children and adults are sensitive to socially conditioned linguistic variation. Cognitive Psychology, 94:85–114.
Sandoval et al. (2025) Sandoval, Sandra Camille, Christabel Acquaye, Kwesi Adu Cobbina, Mohammad Nayeem Teli, and Hal Daumé Iii. 2025. My LLM might mimic AAE - but when should it? In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5277–5302, Association for Computational Linguistics, Albuquerque, New Mexico.
Santurkar et al. (2023) Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML’23, JMLR.org.
Schwartz et al. (2013) Schwartz, H. Andrew, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski, Stephanie M. Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin E. P. Seligman, and Lyle H. Ungar. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PLOS ONE, 8(9):1–16.
Sclar et al. (2024) Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
Sebba (2007) Sebba, Mark. 2007. Spelling and society: The culture and politics of orthography around the world. Cambridge University Press.
Shortis (2016) Shortis, Timothy Francis John. 2016. Orthographic practices in SMS text messaging as a case signifying diachronic change in linguistic and semiotic resources. Ph.D. thesis, UCL (University College London).
Shu et al. (2024) Shu, Bangzhao, Lechen Zhang, Minje Choi, Lavinia Dunagan, Lajanugen Logeswaran, Moontae Lee, Dallas Card, and David Jurgens. 2024. You don‘t need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5263–5281, Association for Computational Linguistics, Mexico City, Mexico.
Soldaini et al. (2024) Soldaini, Luca, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo. 2024. Dolma: an open corpus of three trillion tokens for language model pretraining research.
Soukup (2013) Soukup, Barbara. 2013. On matching speaker (dis)guises – revisiting a methodological tradition. Novus.
Sourati et al. (2024) Sourati, Zhivar, Filip Ilievski, Pia Sommerauer, and Yifan Jiang. 2024. ARN: Analogical reasoning on narratives. Transactions of the Association for Computational Linguistics, 12:1063–1086.
Sourati et al. (2025) Sourati, Zhivar, Farzan Karimi-Malekabadi, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Jackson Trager, Ala Tak, Meng Chen, Fred Morstatter, and Morteza Dehghani. 2025. The shrinking landscape of linguistic diversity in the age of large language models.
Speed, Wnuk, and Majid (2017) Speed, Laura J., Ewelina Wnuk, and Asifa Majid. 2017. Studying Psycholinguistics out of the Lab, chapter 10. John Wiley & Sons, Ltd.
Squires (2010) Squires, Lauren. 2010. Enregistering internet language. Language in Society, 39(4):457–492.
Sterner and Teufel (2025) Sterner, Igor and Simone Teufel. 2025. Minimal pair-based evaluation of code-switching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18575–18598, Association for Computational Linguistics, Vienna, Austria.
Tagg (2009) Tagg, Caroline. 2009. A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham.
Tjuatja et al. (2024) Tjuatja, Lindia, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. 2024. Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics, 12:1011–1026.
Verheijen (2018) Verheijen, Lieke. 2018. Orthographic principles in computer-mediated communication: The SUPER-functions of textisms and their interaction with age and medium. Written Language & Literacy, 7:111–145.
Vosters et al. (2012) Vosters, Rik, Gijsbert Rutten, Marijke Van der Wal, and Wim Vandenbussche. 2012. Spelling and identity in the Southern Netherlands (1750–1830), chapter 6. De Gruyter Mouton.
Waldis et al. (2024) Waldis, Andreas, Yotam Perlitz, Leshem Choshen, Yufang Hou, and Iryna Gurevych. 2024. Holmes: A benchmark to assess the linguistic competence of language models. Transactions of the Association for Computational Linguistics, 12:1616–1647.
Walker et al. (2014) Walker, Abby, Christina García, Yomi Cortés, and Kathryn Campbell-Kibler. 2014. Comparing social meanings across listener and speaker groups: The indexical field of Spanish /s/. Language Variation and Change, 26(2):169–189.
Wang, Morgenstern, and Dickerson (2025) Wang, Angelina, Jamie Morgenstern, and John P Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7:pages 400––411.
Warstadt et al. (2020) Warstadt, Alex, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
Wong (2013) Wong, Andrew D. 2013. Brand names and unconventional spelling: A two-pronged analysis of the orthographic construction of brand identity. Written Language & Literacy, 16(2):115–145.
Yang and Eisenstein (2013) Yang, Yi and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 61–72, Association for Computational Linguistics, Seattle, Washington, USA.
Zacharopoulos, Desbordes, and Sablé-Meyer (2023) Zacharopoulos, Christos, Théo Desbordes, and Mathias Sablé-Meyer. 2023. Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16081–16090, Association for Computational Linguistics, Singapore.
Zenner, Rosseel, and Speelman (2021) Zenner, Eline, Laura Rosseel, and Dirk Speelman. 2021. Starman or Sterrenman: An acquisitional perspective on the social meaning of English in Flanders. International Journal of Bilingualism, 25(3):568–591.
Zhang et al. (2021) Zhang, Sheng, Xin Zhang, Weiming Zhang, and Anders Søgaard. 2021. Sociolectal analysis of pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4581–4588, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
Ziems et al. (2023) Ziems, Caleb, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta, and Diyi Yang. 2023. Multi-VALUE: A framework for cross-dialectal English NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 744–768, Association for Computational Linguistics, Toronto, Canada.