MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset
Fomicheva,Marina ; Sun,Shuo ; Fonseca,Erick ; Zerva,Chrysoula ; Blain,Frédéric ; Chaudhary,Vishrav ; Guzmán,Francisco ; Lopatina,Nina ; Martins,André F.T. ; Specia,Lucia
Fomicheva,Marina
Sun,Shuo
Fonseca,Erick
Zerva,Chrysoula
Blain,Frédéric
Chaudhary,Vishrav
Guzmán,Francisco
Lopatina,Nina
Martins,André F.T.
Specia,Lucia
Abstract
We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains annotations for eleven language pairs, including both high- and low-resource languages. Specifically, it is annotated for translation quality with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level binary good/bad labels. Apart from the quality-related scores, each source-translation sentence pair is accompanied by the corresponding post-edited sentence, as well as titles of the articles where the sentences were extracted from, and information on the neural MT models used to translate the text. We provide a thorough description of the data collection and annotation process as well as an analysis of the annotation distribution for each language pair. We also report the performance of baseline systems trained on the MLQE-PE dataset. The dataset is freely available and has already been used for several WMT shared tasks.
Description
Funding Information: Marina Fomicheva, Frédéric Blain and Lucia Specia were supported by funding from the Bergamot project (EU H2020 Grant No. 825303). André Martins, Chrysoula Zerva and Erick Fonseca were funded by the P2020 programs Unbabel4EU (contract 042671) and MAIA (contract 045909), by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contract UIDB/50008/2020. We would like to thank Marina Sánchez-Torrón and Camila Pohlmann for monitoring the post-editing process. We also thank Mark Fishel from the University of Tartu for providing the Estonian reference translations. Funding Information: Marina Fomicheva, Frédéric Blain and Lucia Specia were supported by funding from the Bergamot project (EU H2020 Grant No. 825303). André Martins, Chrysoula Zerva and Erick Fonseca were funded by the P2020 programs Unbabel4EU (contract 042671) and MAIA (contract 045909), by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundac¸ão para a Ciência e Tecnologia through contract UIDB/50008/2020. We would like to thank Marina Sánchez-Torrón and Camila Pohlmann for monitoring the post-editing process. We also thank Mark Fishel from the University of Tartu for providing the Estonian reference translations. Publisher Copyright: © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
European Language Resources Association (ELRA)
Research Projects
Organizational Units
Journal Issue
Keywords
direct assessments, evaluation, Machine Translation, post-edits, quality estimation
Citation
Fomicheva, M, Sun, S, Fonseca, E, Zerva, C, Blain, F, Chaudhary, V, Guzmán, F, Lopatina, N, Martins, A F T & Specia, L 2022, MLQE-PE : A Multilingual Quality Estimation and Post-Editing Dataset. in N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, J Odijk & S Piperidis (eds), 2022 Language Resources and Evaluation Conference, LREC 2022. 2022 Language Resources and Evaluation Conference, LREC 2022, European Language Resources Association (ELRA), pp. 4963-4974, 13th International Conference on Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20/06/22. < http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.530.pdf >
