The Kinyarwanda tokenizer is a tool for converting text into tokens. It is designed to provide fast and accurate tokenization, with a focus on quality and performance.
Alta-tokenizer is a tool designed for converting Kinyarwanda language text into tokens, it can also tokenizer other languages like English, French or similar languages but with low compression rate since the tokenizer was trained on Kinyarwanda text only. There is an option for training your own custom tokenizer using defined function or method. It is covered in the section of training your own tokenizer. Hence, you can use that way to train your own tokenizer using dataset for a different language. It is based on the Byte Pair Encoding (BPE) algorithm. It can both encode and decode text in Kinyarwanda.
Tokenization
Tokenization is the first step in preparing dataset raw text for processing by foundational models, especially in NLP tasks. Human language is inherently unstructured sentences can vary in length, contain punctuation, and follow complex grammatical rules. Machines, however, require structured, numerical input to process and understand language. Tokenization bridges this gap by converting text into a format that models can work with. Tokenization is a critical and fundamental process in natural language processing (NLP) that serves as the critical first step in preparing raw text data for analysis or modeling by converting unstructured language into a structured format that machines can process.
Metrics
The metric used to measure the accuracy of this tokenizer is the compression rate and ability to encode and decode texts Compression rate is the ratio of the total original number of characters in the text to the number of tokens in the encoded text.
For example the sentence: "Nagiye gusura abanyeshuri."
The sentence has 26 characters. Suppose the sentence is tokenized into the following tokens:[78, 1760, 32, 5256, 32, 1845, 46]. The total number of tokens is 7. So, the compression rate is 3.714X(where X indicates that the number is approximate).

Custom Training
Easy to use python package with option to train your own tokenizer. By supplying a custom dataset and using the provided training function, users can retrain with on a custom dataset to improve it for specific use case.

Language Optimization
Built on the Byte-Pair Encoding algorithm, it efficiently processes texts and preserving their meaning. it is designed specifically for Kinyarwanda, it leverages its unique linguistic patterns to outperform generic tokenizers on kinyarwanda.

Easy Integration
With its Python package and clear documentation, alta-tokenizer can be integrated into existing NLP pipelines, making it a versatile tool for developers and researchers.

Language Flexibility
Although optimized for Kinyarwanda, alta-tokenizer can be applied to other languages ideal for multilingual projects or cases where you want a custom solution that fits your specific language data.