Machine Learning Translation for the Puyuma language

Overview

Language is more than a tool of communication. It is a vessel of memory and worldview. Machine Learning Translation for the Puyuma Language explores how machine learning can serve in preserving a nearly endangered Formosan language spoken in southeastern Taiwan. Built on a dataset of only two thousand Mandarin–Puyuma sentence pairs, the project investigates whether a modern multilingual model can learn to translate across linguistic boundaries with limited data.

Rather than treating translation as a mechanical task, this work frames it as an act of cultural mediation: how can algorithms respect the morphology, rhythm, and semantic subtlety of an oral tradition seldom represented in digital form? The experiment became a dialogue between data scarcity and linguistic resilience, between the precision of computation and the fragility of human heritage.

Design

The system was conceived as a minimal yet rigorous framework for low-resource translation research. Every stage, from preprocessing to evaluation, was guided by the question of how to extend the expressive range of a model when data itself is scarce. The foundation is mT5, a multilingual sequence-to-sequence model fine-tuned in Python and PyTorch.

The process began with careful data curation: sentences were normalized, aligned, and stripped of inconsistencies in punctuation and orthography. A lexicon-aware augmentation module was then introduced to expand training coverage while preserving grammatical integrity, allowing the model to infer unseen word forms and phrase boundaries through controlled substitution.

Training employed gradient accumulation, early stopping, and checkpoint averaging to stabilize optimization under limited samples. Evaluation combined BLEU and chrF metrics on bidirectional tasks, Mandarin to Puyuma and Puyuma to Mandarin, ensuring that performance reflected both lexical accuracy and character-level fluency.

This design sought not to maximize scale but to find elegance in constraint, a study of how structure, when finely tuned, can illuminate meaning beyond abundance.

Challenges

Finding a model that could learn meaningfully from such limited data proved difficult. Because Puyuma exists only in romanized form, tokenization errors were frequent, and the absence of a fixed writing system made sentence boundaries ambiguous.

To address this, We incorporated custom preprocessing that normalized spellings, standardized punctuation, and guided segmentation using a bilingual lexicon. These adjustments allowed the model to interpret words and phrases more consistently, revealing that careful linguistic alignment can sometimes achieve what scale alone cannot.

Impact

The project demonstrated that meaningful progress in low-resource translation does not depend on massive data or computation, but on linguistic care and methodological clarity. The system achieved a combined BLEU and chrF score of 11.38, over six times higher than the competition baseline. This result showed that thoughtful preprocessing and lexicon-based augmentation can unlock latent potential within small datasets. Beyond metrics, the work suggests a path for revitalizing underrepresented languages through accessible and reproducible machine learning methods that respect the integrity of the communities they serve.