When exploring the origins of language, and how we use and process language, it is perhaps logical to assume that the vast canon of published works may provide some guidance. That was indeed the hypothesis of a new study from the University of Buffalo, which examined around 26,000 books to explore how they can help inform a wide range of questions about language.
“Previously in linguistics it was assumed a lot of our ability to use language was instinctual and that our environmental experience lacked the depth necessary to fully acquire the necessary skills,” the researchers explain. “The models that we’re developing today have us questioning those earlier conclusions. Environment does appear to be shaping behavior.”
The research relies upon the advances available via natural language processing to answer what were previously intractable questions. It enabled the team to develop distributional models that were analogous to the human language learning process. In total, some 26,000 books from 3,000 authors were fed into the model.
Natural languages
The researchers were able to discern where each of the 26,000 books were located in both time and place to identify what impact that had on the language used, and the way language developed over time. They supplemented this with a further analysis of 10 different studies using multiple psycholinguistic tasks.
“The question this paper tries to answer is, ‘If we train a model with similar materials that someone in the U.K. might have read versus what someone in the U.S. might have read, will they become more like these people?'” the authors say. “We found that the environment people are embedded in seems to shape their behavior.”
This was especially so with the books that were largely about different cultures, with this helping to explain a large amount of the variance in the data.
“It’s a huge benefit to have a culture-specific corpus, and an even greater benefit to have a time-specific corpus,” the researchers say. “The differences we find in language environment and behavior as a function of time and place is what we call the ‘selective reading hypothesis.'”
Culture specific
By using machine learning to crunch such a large corpus of material, the researchers were able to gauge the importance of nature and environment to the way language evolves. The team believes this opens the door for using machine learning to improve the way we learn languages. Indeed, they believe such models can even assess someone’s language and estimate the kind of books they’ve read.
The team also believe their work could help with our understanding and treatment of Alzheimer’s. They reveal that there are clear signs of slight memory loss at the early stages of the condition, without any other forms of cognitive decline being evident. It’s believed that these people have around a 10-15% chance of developing Alzheimer’s, which compares to just 2% for the general population (aged over 65).
“We’re finding that people who go on to develop Alzheimer’s across time are showing specific types of language loss and production where they seem to be losing long-distance semantic associations between words, as well as low-frequency words,” they say. “Can we develop tasks and stimuli that will allow that group to retain their language ability for longer, or develop a more personalized assessment to understand what type of information they’re losing in their cognitive system?”