Enhancing Language Data for Enhanced Precision in Natural Language Processing

In a significant step towards bridging the gap in natural language processing (NLP) capabilities across languages, the White House Office of Science and Technology Policy (OSTP) and the National Science Foundation (NSF) are focusing on improving large AI training datasets. This initiative, spearheaded by the development of the National AI Research Resource (NAIRR), aims to provide high-quality datasets for NLP on low-resource languages.

The NAIRR, a joint program led by NSF in collaboration with OSTP and other entities, aims to offer the research community access to world-class private sector computing, AI models, data, and software resources. This access will enable researchers working on low-resource languages to train and evaluate on better datasets and models, addressing systemic issues such as a lack of usable text and low-quality labeling.

The NAIRR should also focus on languages like Indonesian, Bengali, Hindi, Urdu, Swahili, and others that are often underserved in terms of NLP resources. These languages, spoken by millions of people, are crucial for widening the scope of NLP and ensuring inclusivity.

The development of high-quality datasets for low-resource languages can help boost the use of NLP more widely, particularly in communities where tools like translation applications don't work effectively. As NLP models are the product of their training data, inaccurate data can lead to discrimination and obstacles in detecting online harms. Without better language data, a divide will persist between those with access to accurate, data-driven technologies powered by NLP and those without.

In addition to the NAIRR, the OSTP and federal offices are actively pursuing initiatives to address disparities in NLP models for low-resource languages. These initiatives include building world-class scientific datasets, encouraging collaboration with industry, fostering an inclusive AI R&D strategy, and removing regulatory barriers.

The United Nations can also play a key role in building, funding, and expanding existing language datasets as a forum for international collaboration. The government should explore opportunities to partner with other countries to aid the development of accurate NLP-enabled tools for non-English languages.

Moreover, language identification tools often misclassify African American Vernacular English (AAVE), potentially excluding users who post online using AAVE from research datasets or limiting their ability to use NLP-powered tools. Improving the quality of language datasets can help reduce discrimination and obstacles in detecting online harms.

NLP holds great potential to transform society by enabling more people to connect and communicate online, access information, and participate in democracy. However, only if major improvements are made in the quality of language datasets available. The OSTP, NSF, and their collaborators are working diligently to ensure that NLP benefits everyone, regardless of the language they speak.

The White House Office of Science and Technology Policy (OSTP) and the National Science Foundation (NSF), in collaboration with other entities, are jointly leading the National AI Research Resource (NAIRR), focusing on improving large AI training datasets for natural language processing (NLP) on low-resource languages.
The NAIRR is designed to offer the research community access to world-class private sector computing, AI models, data, and software resources, enabling researchers to train and evaluate on better datasets and models for low-resource languages.
The NAIRR should prioritize languages such as Indonesian, Bengali, Hindi, Urdu, Swahili, and others, which are often underserved in terms of NLP resources, to widen the scope of NLP and ensure inclusivity.
The development of high-quality datasets for low-resource languages can address systemic issues in NLP, such as a lack of usable text and low-quality labeling, boosting the use of NLP more widely, particularly in communities where current tools are ineffective.
In addition to the NAIRR, the OSTP and federal offices are pursuing initiatives to address disparities in NLP models for low-resource languages, including building world-class scientific datasets, encouraging collaboration with industry, fostering an inclusive AI R&D strategy, and removing regulatory barriers.
The United Nations can play a key role in international collaboration, building, funding, and expanding existing language datasets to help ensure accurate NLP-enabled tools for non-English languages, reducing discrimination and obstacles in detecting online harms.