Item

Elephant: Sequence Labeling for Word and Sentence Segmentation

Evang,Kilian
Basile,Valerio
Chrupala,Grzegorz
Bos,Johan
Abstract
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.
Description
Date
2013
Journal Title
Journal ISSN
Volume Title
Publisher
Association for Computational Linguistics (ACL)
Research Projects
Organizational Units
Journal Issue
Keywords
Citation
Evang, K, Basile, V, Chrupala, G & Bos, J 2013, Elephant: Sequence Labeling for Word and Sentence Segmentation. in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), Seattle, Washington, USA, pp. 1422-1426, EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, United States, 18/10/13. < http://www.aclweb.org/anthology/D13-1146 >
Embedded videos