MINERVA VOICES

How One Minerva Student is Preserving an Endangered Indigenous Language

Explore how Alex Korablev (M27) is using AI to help preserve Qom, an endangered Indigenous language spoken in Argentina.

June 3, 2026

Your capstone focuses on helping preserve the Qom language, which is at risk of disappearing. What first drew you to this project, and why did it feel important to you personally?

Before Minerva, I studied linguistics, one of those interests I carried without quite knowing where it would lead. Getting into computer science made NLP (Natural Language Processing) the obvious next step. When I started looking for research opportunities in Argentina, I found Professor Viviana Cotik's work at the University of Buenos Aires on NLP for low-resource indigenous languages. I reached out, and a few weeks later, I was on the team.

Around 80,000 people in Argentina self-identify as Qom, an indigenous community of the Gran Chaco region in the country's north, but only an estimated 30,000 of them speak the language fluently. That is roughly two-thirds of the community no longer carrying the heritage language. The trend is downward: Qom is listed as endangered by UNESCO, has very limited reach in the school system, and almost no digital presence. There was no Qom on Google Translate, no AI tool that could process a sentence of it, and no NLP dataset that even acknowledged the language existed. So you have a community whose language is fading faster than the community itself, and the tools that might help haven't been built yet. 

For readers who may not be familiar, can you briefly explain what makes Qom such a unique and complex language?

Qom belongs to the Guaycurú family, a small group of indigenous languages spoken in the Gran Chaco region of Argentina. It is what linguists call polysynthetic and agglutinative, meaning a single word can carry the meaning of an entire sentence in English. A verb might bundle the subject, the object, tense, evidentiality, and direction into one form. Word order also shifts depending on whether the verb is transitive or intransitive, which is unusual for speakers used to languages like Spanish or English.

What's tricky from a computer science perspective is that Qom was only recently written down. There are several dialect areas in the region, and the orthography varies between authors and communities. So the model has to learn a language whose written form is still being negotiated.

You’re applying computer science to language preservation, can you walk us through what you’re actually building or researching in your capstone, as well as some of the biggest technical challenges you’ve faced?

The work breaks into two parts: building the first Qom–Spanish parallel corpus in a computationally usable format, which is a structured collection of sentences in both languages that machines can learn from, and training a neural translation model on top of it as a baseline that future work can improve on.

The hardest part has been data scarcity. We started with a few short stories and a handful of bilingual booklets, and we've been expanding from there, drawing on sources such as the Bible, the Universal Declaration of Human Rights, The Little Prince, and educational materials. Just getting clean text out of those sources is harder than it sounds. Most are scanned PDFs with broken encodings, and Qom uses characters like ỹ and ' (a glottal-stop apostrophe) that often don't survive a copy-paste. For the worst cases, we used a large language model to reconstruct the text from corrupted output, then corrected it manually.

Then there's the parallelization step. Having text in Qom and text in Spanish isn't enough. The model needs to know which segment in one language corresponds to which segment in the other. The approach we took varied by source: the Bible was aligned at the verse level using a custom program, since the structure maps cleanly across versions. The Little Prince required first reorganizing both texts to match at the paragraph level, guided by illustration placement, before doing finer alignment. The UDHR was aligned manually from the start. After alignment comes filtering: removing pairs that are too short, too long, suspiciously identical, or just unlikely to be real translations.

On the model side, we fine-tuned a multilingual system called NLLB-200, using Guaraní as a proxy language since it shares typological features with Qom and the model already knows it. Compute has been a constant constraint. Translation is the foundation. The longer-term hope is to extend this work to speech: automatic transcription on one end, text-to-speech on the other.

How do tools like machine learning, natural language processing, or data modeling play a role in your work?

They are the engine of the project, but they only work because of the data modeling underneath. Modern multilingual models like NLLB or mBART have already learned representations that generalize across hundreds of languages. Fine-tuning them on a small Qom corpus lets us borrow that general competence and steer it toward our specific pair, which would be impossible from scratch with the data sizes we're working with.

We also rely on subword tokenization, which breaks words into smaller pieces. That matters a lot for a polysynthetic language where a single word can be morphologically dense. On the evaluation side, we use character-level metrics like ChrF++ instead of just word-level ones like BLEU, because they capture partial matches and are fairer to morphologically rich languages where a tiny suffix change shouldn't be treated as a completely wrong answer.

Beyond the technical side, what does preserving the Qom language mean for the communities who speak it?

I am not part of the Qom community, and I don't think it's my place to speak for them. What I can share is what I've learned from the linguists and members involved in the project. Language isn't just a vehicle for communication. It carries history, ways of categorizing the world, oral tradition, songs, ceremony, and humor that often don't survive translation. When a language stops being spoken at home, all of that becomes harder to pass on. The communities are not waiting to be saved by technology. They've been teaching, writing, and recording for decades, often under conditions of real economic and political pressure. What they sometimes lack are the tools that speakers of other languages take for granted. Our work is a small step toward that.

Why do you think it’s important for technologists to engage in projects like this?

A lot of the AI conversation right now is about scale: bigger models, more data, more compute. That makes sense for the languages and use cases that are already well represented online, but it means only dominant languages keep getting better tools while everything else falls further behind. There are around seven thousand languages in the world, and the vast majority of them are essentially invisible to AI.

I think technologists have a responsibility to push back against that asymmetry. Not because every language needs its own ChatGPT, but because the people who speak smaller languages deserve the same access to useful tools as anyone else. There's also a self-interested argument: low-resource problems are where you actually learn the limits of your methods. It's easy to look good on a benchmark with a million parallel sentences. The harder, more honest work happens when you have two thousand and every decision counts.

How did your time at Minerva prepare you to take on a project like this?

Minerva is structured around the idea that you learn by doing hard things in unfamiliar contexts. That environment makes you comfortable with ambiguity, which is most of what research actually is. More concretely, the curriculum gave me the technical foundation, but what Minerva really trained was the habit of asking whether what you're building is the right thing to build, and for whom. That question is unavoidable in a project like this one. You're working with a community's language, with materials that took decades to create, and with collaborators who have far more domain expertise than you do, all of it coordinated across languages, since the project runs entirely in Spanish. Knowing how to listen, how to be useful without overstepping, and how to be honest about what you don't know, those are things Minerva pushed me to develop, even when I didn't realize it at the time.

In what ways do you hope your work contributes to the broader field of language preservation or technology for social good?

Two things. First, a baseline. Until now, there's been no published machine translation system or parallel corpus in a computationally usable format for Qom, which means anyone interested in this language had to start from zero. By releasing both, we're giving future researchers, educators, and community members a starting point they can build on, critique, or replace with something better. 

Second, a proof of concept. The bigger argument I want this project to make is that low-resource language preservation is no longer out of reach. The combination of multilingual models, careful data work, and committed collaboration between linguists and computer scientists means that even tiny corpora can produce useful tools. Qom is one language. The same approach works for many others. I hope this project shows other people they don't need massive resources to start.

Quick Facts

Name
Alex Korablev
Country
Russian Federation
Class
M27
Major

Computational Sciences

Computational Sciences

Natural Sciences

Computational Sciences

Arts & Humanities, Natural Sciences

Social Sciences & Arts and Humanities

Business

Computational Sciences

Computational Sciences

Social Sciences & Business

Computational Sciences

Social Sciences

Computational Sciences & Business

Business & Computational Sciences

Computational Sciences

Computational Sciences

Social Sciences & Business

Business

Natural Sciences

Social Sciences

Social Sciences

Social Sciences & Business

Business & Computational Sciences

Business and Social Sciences

Social Sciences and Business

Computational Sciences & Social Sciences

Computer Science & Arts and Humanities

Business and Computational Sciences

Business and Social Sciences

Natural Sciences

Arts and Humanities

Business, Social Sciences

Business & Arts and Humanities

Computational Sciences

Natural Sciences, Computer Science

Computational Sciences

Arts & Humanities

Computational Sciences, Social Sciences

Computational Sciences

Computational Sciences

Natural Sciences, Social Sciences

Social Sciences, Natural Sciences

Data Science, Statistics

Computational Sciences

Business

Computational Sciences, Data Science

Social Sciences

Natural Sciences

Business, Natural Sciences

Business, Social Sciences

Computational Sciences

Arts & Humanities, Social Sciences

Social Sciences

Computational Sciences, Natural Sciences

Natural Sciences

Computational Sciences, Social Sciences

Business, Social Sciences

Computational Sciences

Natural Sciences, Social Sciences

Social Sciences

Arts & Humanities, Social Sciences

Arts & Humanities, Social Science

Social Sciences, Business

Arts & Humanities

Computational Sciences, Social Science

Natural Sciences, Computer Science

Computational Science, Statistic Natural Sciences

Business & Social Sciences

Minor

Sustainability

Sustainability

Natural Sciences & Sustainability

Natural Sciences

Sustainability

Computational Sciences

Computational Sciences

Computational Science & Business

Concentration

Data Science and Statistics

Data Science and Statistics, Digital Practices

Earth and Environmental Systems

Cognition, Brain, and Behavior & Philosophy, Ethics, and the Law

Computational Theory and Analysis

Computer Science and Artificial Intelligence

Brand Management & Computer Science and Artificial Intelligence

Computer Science and Artificial Intelligence

Economics and Society & Strategic Finance

Enterprise Management

Economics and Society

Cells and Organisms & Brain, Cognition, and Behavior

Cognitive Science and Economics & Political Science

Applied Problem Solving & Computer Science and Artificial Intelligence

Computer Science and Artificial Intelligence & Cognition, Brain, and Behavior

Designing Societies & New Ventures

Strategic Finance & Data Science and Statistics

Brand Management and Designing Societies

Data Science & Economics

Machine Learning

Cells, Organisms, Data Science, Statistics

Arts & Literature and Historical Forces

Artificial Intelligence & Computer Science

Cells and Organisms, Mind and Emotion

Economics, Physics

Managing Operational Complexity and Strategic Finance

Global Development Studies and Brain, Cognition, and Behavior

Scalable Growth, Designing Societies

Business

Drug Discovery Research, Designing and Implementing Policies

Historical Forces, Cognition, Brain, and Behavior

Artificial Intelligence, Psychology

Designing Solutions, Data Science and Statistics

Data Science and Statistic, Theoretical Foundations of Natural Science

Strategic Finance, Politics, Government, and Society

Internship
Higia Technologies
Project Development and Marketing Analyst Intern at VIVITA, a Mistletoe company
Business Development Intern, DoSomething.org
Business Analyst, Clean Energy Associates (CEA)

Conversation

Your capstone focuses on helping preserve the Qom language, which is at risk of disappearing. What first drew you to this project, and why did it feel important to you personally?

Before Minerva, I studied linguistics, one of those interests I carried without quite knowing where it would lead. Getting into computer science made NLP (Natural Language Processing) the obvious next step. When I started looking for research opportunities in Argentina, I found Professor Viviana Cotik's work at the University of Buenos Aires on NLP for low-resource indigenous languages. I reached out, and a few weeks later, I was on the team.

Around 80,000 people in Argentina self-identify as Qom, an indigenous community of the Gran Chaco region in the country's north, but only an estimated 30,000 of them speak the language fluently. That is roughly two-thirds of the community no longer carrying the heritage language. The trend is downward: Qom is listed as endangered by UNESCO, has very limited reach in the school system, and almost no digital presence. There was no Qom on Google Translate, no AI tool that could process a sentence of it, and no NLP dataset that even acknowledged the language existed. So you have a community whose language is fading faster than the community itself, and the tools that might help haven't been built yet. 

For readers who may not be familiar, can you briefly explain what makes Qom such a unique and complex language?

Qom belongs to the Guaycurú family, a small group of indigenous languages spoken in the Gran Chaco region of Argentina. It is what linguists call polysynthetic and agglutinative, meaning a single word can carry the meaning of an entire sentence in English. A verb might bundle the subject, the object, tense, evidentiality, and direction into one form. Word order also shifts depending on whether the verb is transitive or intransitive, which is unusual for speakers used to languages like Spanish or English.

What's tricky from a computer science perspective is that Qom was only recently written down. There are several dialect areas in the region, and the orthography varies between authors and communities. So the model has to learn a language whose written form is still being negotiated.

You’re applying computer science to language preservation, can you walk us through what you’re actually building or researching in your capstone, as well as some of the biggest technical challenges you’ve faced?

The work breaks into two parts: building the first Qom–Spanish parallel corpus in a computationally usable format, which is a structured collection of sentences in both languages that machines can learn from, and training a neural translation model on top of it as a baseline that future work can improve on.

The hardest part has been data scarcity. We started with a few short stories and a handful of bilingual booklets, and we've been expanding from there, drawing on sources such as the Bible, the Universal Declaration of Human Rights, The Little Prince, and educational materials. Just getting clean text out of those sources is harder than it sounds. Most are scanned PDFs with broken encodings, and Qom uses characters like ỹ and ' (a glottal-stop apostrophe) that often don't survive a copy-paste. For the worst cases, we used a large language model to reconstruct the text from corrupted output, then corrected it manually.

Then there's the parallelization step. Having text in Qom and text in Spanish isn't enough. The model needs to know which segment in one language corresponds to which segment in the other. The approach we took varied by source: the Bible was aligned at the verse level using a custom program, since the structure maps cleanly across versions. The Little Prince required first reorganizing both texts to match at the paragraph level, guided by illustration placement, before doing finer alignment. The UDHR was aligned manually from the start. After alignment comes filtering: removing pairs that are too short, too long, suspiciously identical, or just unlikely to be real translations.

On the model side, we fine-tuned a multilingual system called NLLB-200, using Guaraní as a proxy language since it shares typological features with Qom and the model already knows it. Compute has been a constant constraint. Translation is the foundation. The longer-term hope is to extend this work to speech: automatic transcription on one end, text-to-speech on the other.

How do tools like machine learning, natural language processing, or data modeling play a role in your work?

They are the engine of the project, but they only work because of the data modeling underneath. Modern multilingual models like NLLB or mBART have already learned representations that generalize across hundreds of languages. Fine-tuning them on a small Qom corpus lets us borrow that general competence and steer it toward our specific pair, which would be impossible from scratch with the data sizes we're working with.

We also rely on subword tokenization, which breaks words into smaller pieces. That matters a lot for a polysynthetic language where a single word can be morphologically dense. On the evaluation side, we use character-level metrics like ChrF++ instead of just word-level ones like BLEU, because they capture partial matches and are fairer to morphologically rich languages where a tiny suffix change shouldn't be treated as a completely wrong answer.

Beyond the technical side, what does preserving the Qom language mean for the communities who speak it?

I am not part of the Qom community, and I don't think it's my place to speak for them. What I can share is what I've learned from the linguists and members involved in the project. Language isn't just a vehicle for communication. It carries history, ways of categorizing the world, oral tradition, songs, ceremony, and humor that often don't survive translation. When a language stops being spoken at home, all of that becomes harder to pass on. The communities are not waiting to be saved by technology. They've been teaching, writing, and recording for decades, often under conditions of real economic and political pressure. What they sometimes lack are the tools that speakers of other languages take for granted. Our work is a small step toward that.

Why do you think it’s important for technologists to engage in projects like this?

A lot of the AI conversation right now is about scale: bigger models, more data, more compute. That makes sense for the languages and use cases that are already well represented online, but it means only dominant languages keep getting better tools while everything else falls further behind. There are around seven thousand languages in the world, and the vast majority of them are essentially invisible to AI.

I think technologists have a responsibility to push back against that asymmetry. Not because every language needs its own ChatGPT, but because the people who speak smaller languages deserve the same access to useful tools as anyone else. There's also a self-interested argument: low-resource problems are where you actually learn the limits of your methods. It's easy to look good on a benchmark with a million parallel sentences. The harder, more honest work happens when you have two thousand and every decision counts.

How did your time at Minerva prepare you to take on a project like this?

Minerva is structured around the idea that you learn by doing hard things in unfamiliar contexts. That environment makes you comfortable with ambiguity, which is most of what research actually is. More concretely, the curriculum gave me the technical foundation, but what Minerva really trained was the habit of asking whether what you're building is the right thing to build, and for whom. That question is unavoidable in a project like this one. You're working with a community's language, with materials that took decades to create, and with collaborators who have far more domain expertise than you do, all of it coordinated across languages, since the project runs entirely in Spanish. Knowing how to listen, how to be useful without overstepping, and how to be honest about what you don't know, those are things Minerva pushed me to develop, even when I didn't realize it at the time.

In what ways do you hope your work contributes to the broader field of language preservation or technology for social good?

Two things. First, a baseline. Until now, there's been no published machine translation system or parallel corpus in a computationally usable format for Qom, which means anyone interested in this language had to start from zero. By releasing both, we're giving future researchers, educators, and community members a starting point they can build on, critique, or replace with something better. 

Second, a proof of concept. The bigger argument I want this project to make is that low-resource language preservation is no longer out of reach. The combination of multilingual models, careful data work, and committed collaboration between linguists and computer scientists means that even tiny corpora can produce useful tools. Qom is one language. The same approach works for many others. I hope this project shows other people they don't need massive resources to start.