In this blog post
In Google’s latest annual developer conference, Google I/O, CEO Sundar Pichai announced their latest breakthrough called “Language Model for Dialogue Applications” or LaMDA. LaMDA is a language AI technology that can chat about any topic. That’s something that even a normal chatbot can do, then what makes LaMDA special?
Modern conversational agents or chatbots follow a narrow pre-defined conversational path, while LaMDA can engage in a free-flowing open-ended conversation just like humans. Google plans to integrate this new technology with their search engine as well as other software like voice assistant, workplace, gmail, etc. so that people can retrieve any kind of information, in any format (text, visual or audio), from Google’s suite of products. LaMDA is an example of what is known as a Large Language Model (LLM).
Introduction and Capabilities
What is a language model (LM)? A language model is a statistical and probabilistic tool which determines the probability of a given sequence of words occurring in a sentence. Simply put, it is a tool which is trained to predict the next word in a sentence. It works like how a text message autocomplete works. Where weather models predict the 7-day forecast, language models try to find patterns in the human language, one of computer science’s most difficult puzzles as languages are ever-changing and adaptable.
A language model is called a large language model when it is trained on enormous amount of data. Some of the other examples of LLMs are Google’s BERT and OpenAI’s GPT-2 and GPT-3. GPT-3 is the largest language model known at the time with 175 billion parameters trained on 570 gigabytes of text. These models have capabilities ranging from writing a simple essay to generating complex computer codes – all with limited to no supervision.
Limitations and Impact on Society
As exciting as this technology may sound, it has some alarming shortcomings.
1. Biasness: Studies have shown that these models are embedded with racist, sexist, and discriminatory ideas. These models can also encourage people for genocide, self-harm, and child sexual abuse. Google is already using an LLM for its search engine which is rooted in biasness. Since Google is not only used as a primary knowledge base for general people but also provides an information infrastructure for various universities and institutions, such a biased result set can have very harmful consequences.
2. Environmental impact: LLMs also have an outsize impact on the environment as these emit shockingly high carbon dioxide – equivalent to nearly five times the lifetime emissions of an average car including manufacturing of the car.
3. Misinformation: Experts have also warned about the mass production of misinformation through these models as because of the model’s fluency, people can confuse into thinking that humans have produced the output. Some models have also excelled at writing convincing fake news articles.
4. Mishandling negative data: The world speaks different languages that are not prioritized by the Silicon Valley. These languages are unaccounted for in the mainstream language technologies and hence, these communities are affected the most. When a platform uses an LLM which is not capable of handling these languages to automate its content moderation, the model struggles to control the misinformation. During extraordinary situations, like a riot, the amount of unfavorable data coming in is huge, and this ends up creating a hostile digital environment. The problem does not end here. When the fake news, hate speech and all such negative text is not filtered, it is used as a training data for next generation of LLMs. These toxic linguistic patterns then parrot back on the internet.
Further Research for Better Models
Despite all these challenges, very little research is being done to understand how this technology can affect us or how better LLMs can be designed. In fact, the few big companies that have the required resources to train and maintain LLMs refuse or show no interest in investigating them. But it’s not just Google that is planning to use this technology. Facebook has developed its own LLMs for translation and content moderation while Microsoft has exclusively licensed GPT-3. Many startups have also started creating products and services based on these models.
While the big tech giants are trying to create private and mostly inaccessible models that cannot be used for research, a New York-based startup, called Hugging Face, is leading a research workshop to build an open-source LLM that will serve as a shared resource for the scientific community and can be used to learn more about the capabilities and limitations of these models. This one-year-long research (from May 2021 to May 2022) called the ‘Summer of Language Models 21’ (in short ‘BigScience’) has more than 500 researchers from around the world working together on a volunteer basis.
The collaborative is divided into multiple working groups, each investigating different aspects of model development. One of the groups will work on calculating the model’s environmental impact, while another will focus on responsible ways of sourcing the training data, free from toxic language. One working group is dedicated to the model’s multilingual character including minority language coverage. To start with, the team has selected eight language families which include English, Chinese, Arabic, Indic (including Hindi and Urdu), and Bantu (including Swahili).
Hopefully, the BigScience Project will help produce better tools and practices for building and deploying LLMs responsibly. The enthusiasm around these large language models cannot be curbed but it can surely be nudged in a direction that has lesser shortcomings. Soon enough, all our digital communications—be it emails, search results, or social media posts —will be filtered using LLMs. These large language models are the next frontier for artificial intelligence.