Normalizing tweets with edit scripts and recurrent neural embeddings
Chrupala,Grzegorz
Chrupala,Grzegorz
Abstract
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.
Description
Date
2014
Journal Title
Journal ISSN
Volume Title
Publisher
Association for Computational Linguistics (ACL)
Research Projects
Organizational Units
Journal Issue
Keywords
Citation
Chrupala, G 2014, Normalizing tweets with edit scripts and recurrent neural embeddings. in K Toutanova & H Wu (eds), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Short Papers edn, vol. 2, Association for Computational Linguistics (ACL), Baltimore, Maryland, pp. 680-686, The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, United States, 22/06/14.
