Rare words in Neural machine translation
In the work “Continuous Learning in Neural Machine Translation using Bilingual Dictionaries” we analysed the ability of NMT systems to translate rare terms and presented techniques to improve their ability to translate morphological variants.The propsed methods is based on creating a new test set using a different split of the training and test data concentrating on these terms.
Ready-to-use-dataset
In the paper we splited the following data sets. The splits from the paper can be downloaded here
- TED Enlish-German
- Europarl English-German
- Europarl English-Czech
Own data sets
You can split your own data sets with the same methods using the code in the github repository
Publication
All details of the methods you can find in the publication: Niehues, J. (2021). Continuous Learning in Neural Machine Translation using Bilingual Dictionaries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021). Kiew, Ukraine.