.jpeg)
Your capstone focuses on helping preserve the Qom language, which is at risk of disappearing. What first drew you to this project, and why did it feel important to you personally?
Before Minerva, I studied linguistics, one of those interests I carried without quite knowing where it would lead. Getting into computer science made NLP (Natural Language Processing) the obvious next step. When I started looking for research opportunities in Argentina, I found Professor Viviana Cotik's work at the University of Buenos Aires on NLP for low-resource indigenous languages. I reached out, and a few weeks later, I was on the team.
Around 80,000 people in Argentina self-identify as Qom, an indigenous community of the Gran Chaco region in the country's north, but only an estimated 30,000 of them speak the language fluently. That is roughly two-thirds of the community no longer carrying the heritage language. The trend is downward: Qom is listed as endangered by UNESCO, has very limited reach in the school system, and almost no digital presence. There was no Qom on Google Translate, no AI tool that could process a sentence of it, and no NLP dataset that even acknowledged the language existed. So you have a community whose language is fading faster than the community itself, and the tools that might help haven't been built yet.
For readers who may not be familiar, can you briefly explain what makes Qom such a unique and complex language?
Qom belongs to the Guaycurú family, a small group of indigenous languages spoken in the Gran Chaco region of Argentina. It is what linguists call polysynthetic and agglutinative, meaning a single word can carry the meaning of an entire sentence in English. A verb might bundle the subject, the object, tense, evidentiality, and direction into one form. Word order also shifts depending on whether the verb is transitive or intransitive, which is unusual for speakers used to languages like Spanish or English.
What's tricky from a computer science perspective is that Qom was only recently written down. There are several dialect areas in the region, and the orthography varies between authors and communities. So the model has to learn a language whose written form is still being negotiated.
You’re applying computer science to language preservation, can you walk us through what you’re actually building or researching in your capstone, as well as some of the biggest technical challenges you’ve faced?
The work breaks into two parts: building the first Qom–Spanish parallel corpus in a computationally usable format, which is a structured collection of sentences in both languages that machines can learn from, and training a neural translation model on top of it as a baseline that future work can improve on.
The hardest part has been data scarcity. We started with a few short stories and a handful of bilingual booklets, and we've been expanding from there, drawing on sources such as the Bible, the Universal Declaration of Human Rights, The Little Prince, and educational materials. Just getting clean text out of those sources is harder than it sounds. Most are scanned PDFs with broken encodings, and Qom uses characters like ỹ and ' (a glottal-stop apostrophe) that often don't survive a copy-paste. For the worst cases, we used a large language model to reconstruct the text from corrupted output, then corrected it manually.
Then there's the parallelization step. Having text in Qom and text in Spanish isn't enough. The model needs to know which segment in one language corresponds to which segment in the other. The approach we took varied by source: the Bible was aligned at the verse level using a custom program, since the structure maps cleanly across versions. The Little Prince required first reorganizing both texts to match at the paragraph level, guided by illustration placement, before doing finer alignment. The UDHR was aligned manually from the start. After alignment comes filtering: removing pairs that are too short, too long, suspiciously identical, or just unlikely to be real translations.
On the model side, we fine-tuned a multilingual system called NLLB-200, using Guaraní as a proxy language since it shares typological features with Qom and the model already knows it. Compute has been a constant constraint. Translation is the foundation. The longer-term hope is to extend this work to speech: automatic transcription on one end, text-to-speech on the other.
.png)
How do tools like machine learning, natural language processing, or data modeling play a role in your work?
They are the engine of the project, but they only work because of the data modeling underneath. Modern multilingual models like NLLB or mBART have already learned representations that generalize across hundreds of languages. Fine-tuning them on a small Qom corpus lets us borrow that general competence and steer it toward our specific pair, which would be impossible from scratch with the data sizes we're working with.
We also rely on subword tokenization, which breaks words into smaller pieces. That matters a lot for a polysynthetic language where a single word can be morphologically dense. On the evaluation side, we use character-level metrics like ChrF++ instead of just word-level ones like BLEU, because they capture partial matches and are fairer to morphologically rich languages where a tiny suffix change shouldn't be treated as a completely wrong answer.
Beyond the technical side, what does preserving the Qom language mean for the communities who speak it?
I am not part of the Qom community, and I don't think it's my place to speak for them. What I can share is what I've learned from the linguists and members involved in the project. Language isn't just a vehicle for communication. It carries history, ways of categorizing the world, oral tradition, songs, ceremony, and humor that often don't survive translation. When a language stops being spoken at home, all of that becomes harder to pass on. The communities are not waiting to be saved by technology. They've been teaching, writing, and recording for decades, often under conditions of real economic and political pressure. What they sometimes lack are the tools that speakers of other languages take for granted. Our work is a small step toward that.
Why do you think it’s important for technologists to engage in projects like this?
A lot of the AI conversation right now is about scale: bigger models, more data, more compute. That makes sense for the languages and use cases that are already well represented online, but it means only dominant languages keep getting better tools while everything else falls further behind. There are around seven thousand languages in the world, and the vast majority of them are essentially invisible to AI.
I think technologists have a responsibility to push back against that asymmetry. Not because every language needs its own ChatGPT, but because the people who speak smaller languages deserve the same access to useful tools as anyone else. There's also a self-interested argument: low-resource problems are where you actually learn the limits of your methods. It's easy to look good on a benchmark with a million parallel sentences. The harder, more honest work happens when you have two thousand and every decision counts.
How did your time at Minerva prepare you to take on a project like this?
Minerva is structured around the idea that you learn by doing hard things in unfamiliar contexts. That environment makes you comfortable with ambiguity, which is most of what research actually is. More concretely, the curriculum gave me the technical foundation, but what Minerva really trained was the habit of asking whether what you're building is the right thing to build, and for whom. That question is unavoidable in a project like this one. You're working with a community's language, with materials that took decades to create, and with collaborators who have far more domain expertise than you do, all of it coordinated across languages, since the project runs entirely in Spanish. Knowing how to listen, how to be useful without overstepping, and how to be honest about what you don't know, those are things Minerva pushed me to develop, even when I didn't realize it at the time.
In what ways do you hope your work contributes to the broader field of language preservation or technology for social good?
Two things. First, a baseline. Until now, there's been no published machine translation system or parallel corpus in a computationally usable format for Qom, which means anyone interested in this language had to start from zero. By releasing both, we're giving future researchers, educators, and community members a starting point they can build on, critique, or replace with something better.
Second, a proof of concept. The bigger argument I want this project to make is that low-resource language preservation is no longer out of reach. The combination of multilingual models, careful data work, and committed collaboration between linguists and computer scientists means that even tiny corpora can produce useful tools. Qom is one language. The same approach works for many others. I hope this project shows other people they don't need massive resources to start.
Quick Facts
Computational Sciences
Computational Sciences
Natural Sciences
Computational Sciences
Arts & Humanities, Natural Sciences
Social Sciences & Arts and Humanities
Business
Computational Sciences
Computational Sciences
Social Sciences & Business
Computational Sciences
Social Sciences
Computational Sciences & Business
Business & Computational Sciences
Computational Sciences
Computational Sciences
Social Sciences & Business
Business
Natural Sciences
Social Sciences
Social Sciences
Social Sciences & Business
Business & Computational Sciences
Business and Social Sciences
Social Sciences and Business
Computational Sciences & Social Sciences
Computer Science & Arts and Humanities
Business and Computational Sciences
Business and Social Sciences
Natural Sciences
Arts and Humanities
Business, Social Sciences
Business & Arts and Humanities
Computational Sciences
Natural Sciences, Computer Science
Computational Sciences
Arts & Humanities
Computational Sciences, Social Sciences
Computational Sciences
Computational Sciences
Natural Sciences, Social Sciences
Social Sciences, Natural Sciences
Data Science, Statistics
Computational Sciences
Business
Computational Sciences, Data Science
Social Sciences
Natural Sciences
Business, Natural Sciences
Business, Social Sciences
Computational Sciences
Arts & Humanities, Social Sciences
Social Sciences
Computational Sciences, Natural Sciences
Natural Sciences
Computational Sciences, Social Sciences
Business, Social Sciences
Computational Sciences
Natural Sciences, Social Sciences
Social Sciences
Arts & Humanities, Social Sciences
Arts & Humanities, Social Science
Social Sciences, Business
Arts & Humanities
Computational Sciences, Social Science
Natural Sciences, Computer Science
Computational Science, Statistic Natural Sciences
Business & Social Sciences
Sustainability
Sustainability
Natural Sciences & Sustainability
Natural Sciences
Sustainability
Computational Sciences
Computational Sciences
Computational Science & Business
Data Science and Statistics
Data Science and Statistics, Digital Practices
Earth and Environmental Systems
Cognition, Brain, and Behavior & Philosophy, Ethics, and the Law
Computational Theory and Analysis
Computer Science and Artificial Intelligence
Brand Management & Computer Science and Artificial Intelligence
Computer Science and Artificial Intelligence
Economics and Society & Strategic Finance
Enterprise Management
Economics and Society
Cells and Organisms & Brain, Cognition, and Behavior
Cognitive Science and Economics & Political Science
Applied Problem Solving & Computer Science and Artificial Intelligence
Computer Science and Artificial Intelligence & Cognition, Brain, and Behavior
Designing Societies & New Ventures
Strategic Finance & Data Science and Statistics
Brand Management and Designing Societies
Data Science & Economics
Machine Learning
Cells, Organisms, Data Science, Statistics
Arts & Literature and Historical Forces
Artificial Intelligence & Computer Science
Cells and Organisms, Mind and Emotion
Economics, Physics
Managing Operational Complexity and Strategic Finance
Global Development Studies and Brain, Cognition, and Behavior
Scalable Growth, Designing Societies
Business
Drug Discovery Research, Designing and Implementing Policies
Historical Forces, Cognition, Brain, and Behavior
Artificial Intelligence, Psychology
Designing Solutions, Data Science and Statistics
Data Science and Statistic, Theoretical Foundations of Natural Science
Strategic Finance, Politics, Government, and Society
Conversation
Your capstone focuses on helping preserve the Qom language, which is at risk of disappearing. What first drew you to this project, and why did it feel important to you personally?
Before Minerva, I studied linguistics, one of those interests I carried without quite knowing where it would lead. Getting into computer science made NLP (Natural Language Processing) the obvious next step. When I started looking for research opportunities in Argentina, I found Professor Viviana Cotik's work at the University of Buenos Aires on NLP for low-resource indigenous languages. I reached out, and a few weeks later, I was on the team.
Around 80,000 people in Argentina self-identify as Qom, an indigenous community of the Gran Chaco region in the country's north, but only an estimated 30,000 of them speak the language fluently. That is roughly two-thirds of the community no longer carrying the heritage language. The trend is downward: Qom is listed as endangered by UNESCO, has very limited reach in the school system, and almost no digital presence. There was no Qom on Google Translate, no AI tool that could process a sentence of it, and no NLP dataset that even acknowledged the language existed. So you have a community whose language is fading faster than the community itself, and the tools that might help haven't been built yet.
For readers who may not be familiar, can you briefly explain what makes Qom such a unique and complex language?
Qom belongs to the Guaycurú family, a small group of indigenous languages spoken in the Gran Chaco region of Argentina. It is what linguists call polysynthetic and agglutinative, meaning a single word can carry the meaning of an entire sentence in English. A verb might bundle the subject, the object, tense, evidentiality, and direction into one form. Word order also shifts depending on whether the verb is transitive or intransitive, which is unusual for speakers used to languages like Spanish or English.
What's tricky from a computer science perspective is that Qom was only recently written down. There are several dialect areas in the region, and the orthography varies between authors and communities. So the model has to learn a language whose written form is still being negotiated.
You’re applying computer science to language preservation, can you walk us through what you’re actually building or researching in your capstone, as well as some of the biggest technical challenges you’ve faced?
The work breaks into two parts: building the first Qom–Spanish parallel corpus in a computationally usable format, which is a structured collection of sentences in both languages that machines can learn from, and training a neural translation model on top of it as a baseline that future work can improve on.
The hardest part has been data scarcity. We started with a few short stories and a handful of bilingual booklets, and we've been expanding from there, drawing on sources such as the Bible, the Universal Declaration of Human Rights, The Little Prince, and educational materials. Just getting clean text out of those sources is harder than it sounds. Most are scanned PDFs with broken encodings, and Qom uses characters like ỹ and ' (a glottal-stop apostrophe) that often don't survive a copy-paste. For the worst cases, we used a large language model to reconstruct the text from corrupted output, then corrected it manually.
Then there's the parallelization step. Having text in Qom and text in Spanish isn't enough. The model needs to know which segment in one language corresponds to which segment in the other. The approach we took varied by source: the Bible was aligned at the verse level using a custom program, since the structure maps cleanly across versions. The Little Prince required first reorganizing both texts to match at the paragraph level, guided by illustration placement, before doing finer alignment. The UDHR was aligned manually from the start. After alignment comes filtering: removing pairs that are too short, too long, suspiciously identical, or just unlikely to be real translations.
On the model side, we fine-tuned a multilingual system called NLLB-200, using Guaraní as a proxy language since it shares typological features with Qom and the model already knows it. Compute has been a constant constraint. Translation is the foundation. The longer-term hope is to extend this work to speech: automatic transcription on one end, text-to-speech on the other.
.png)
How do tools like machine learning, natural language processing, or data modeling play a role in your work?
They are the engine of the project, but they only work because of the data modeling underneath. Modern multilingual models like NLLB or mBART have already learned representations that generalize across hundreds of languages. Fine-tuning them on a small Qom corpus lets us borrow that general competence and steer it toward our specific pair, which would be impossible from scratch with the data sizes we're working with.
We also rely on subword tokenization, which breaks words into smaller pieces. That matters a lot for a polysynthetic language where a single word can be morphologically dense. On the evaluation side, we use character-level metrics like ChrF++ instead of just word-level ones like BLEU, because they capture partial matches and are fairer to morphologically rich languages where a tiny suffix change shouldn't be treated as a completely wrong answer.
Beyond the technical side, what does preserving the Qom language mean for the communities who speak it?
I am not part of the Qom community, and I don't think it's my place to speak for them. What I can share is what I've learned from the linguists and members involved in the project. Language isn't just a vehicle for communication. It carries history, ways of categorizing the world, oral tradition, songs, ceremony, and humor that often don't survive translation. When a language stops being spoken at home, all of that becomes harder to pass on. The communities are not waiting to be saved by technology. They've been teaching, writing, and recording for decades, often under conditions of real economic and political pressure. What they sometimes lack are the tools that speakers of other languages take for granted. Our work is a small step toward that.
Why do you think it’s important for technologists to engage in projects like this?
A lot of the AI conversation right now is about scale: bigger models, more data, more compute. That makes sense for the languages and use cases that are already well represented online, but it means only dominant languages keep getting better tools while everything else falls further behind. There are around seven thousand languages in the world, and the vast majority of them are essentially invisible to AI.
I think technologists have a responsibility to push back against that asymmetry. Not because every language needs its own ChatGPT, but because the people who speak smaller languages deserve the same access to useful tools as anyone else. There's also a self-interested argument: low-resource problems are where you actually learn the limits of your methods. It's easy to look good on a benchmark with a million parallel sentences. The harder, more honest work happens when you have two thousand and every decision counts.
How did your time at Minerva prepare you to take on a project like this?
Minerva is structured around the idea that you learn by doing hard things in unfamiliar contexts. That environment makes you comfortable with ambiguity, which is most of what research actually is. More concretely, the curriculum gave me the technical foundation, but what Minerva really trained was the habit of asking whether what you're building is the right thing to build, and for whom. That question is unavoidable in a project like this one. You're working with a community's language, with materials that took decades to create, and with collaborators who have far more domain expertise than you do, all of it coordinated across languages, since the project runs entirely in Spanish. Knowing how to listen, how to be useful without overstepping, and how to be honest about what you don't know, those are things Minerva pushed me to develop, even when I didn't realize it at the time.
In what ways do you hope your work contributes to the broader field of language preservation or technology for social good?
Two things. First, a baseline. Until now, there's been no published machine translation system or parallel corpus in a computationally usable format for Qom, which means anyone interested in this language had to start from zero. By releasing both, we're giving future researchers, educators, and community members a starting point they can build on, critique, or replace with something better.
Second, a proof of concept. The bigger argument I want this project to make is that low-resource language preservation is no longer out of reach. The combination of multilingual models, careful data work, and committed collaboration between linguists and computer scientists means that even tiny corpora can produce useful tools. Qom is one language. The same approach works for many others. I hope this project shows other people they don't need massive resources to start.