Harmonizing Multilingual Product Data Using Machine Learning: A Case Study of the Rwanda Revenue Authority

Kamana, Raymond

Harmonizing Multilingual Product Data Using Machine Learning: A Case Study of the Rwanda Revenue Authority

Kamana, Raymond (2025) Harmonizing Multilingual Product Data Using Machine Learning: A Case Study of the Rwanda Revenue Authority. International Journal of Innovative Science and Research Technology, 10 (8): 25aug118. pp. 79-84. ISSN 2456-2165

[A][B][+][-]

Abstract

This study focuses on solving the problem of inconsistent and multilingual product names in the Rwanda Revenue Authority’s (RRA) Electronic Billing Machine (EBM) system. Because product names are entered manually, many spelling differences and translations make it hard to track and analyze tax data. To fix this, the study uses Natural Language Processing (NLP) and Machine Learning (ML) to clean and group similar product names. A total of 4.1 million records from 2020 to 2022 were translated into English and processed. Sentence meaning was captured using MiniLM embeddings, then simplified using UMAP, and finally grouped using HDBSCAN. The cleaned and grouped product names make it easier to detect possible fraud, spot underpricing, and improve the accuracy of tax reporting. This method helps RRA improve data quality and tax compliance.

Documents