Item

Learning English with Peppa Pig

Nikolaus,Mitja
Alishahi,Afra
Chrupała,Grzegorz
Abstract
Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children’s cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data, our model succeeds at learning aspects of the visual semantics of spoken language.
Description
Funding Information: We would like to thank Nikos Papasarantopou-los and Shay B. Cohen for creating the Peppa Pig annotations and for sharing them with us. Part of the NWO/E-Science Center grant number 027.018.G03 was used to purchase the video data on DVD. Thanks to Bertrand Higy for taking care of this purchase, as well as for sharing his ideas with us in the initial stages of this work. We also thank Abdellah Fourtassi, and three anonymous reviewers for their useful feedback. Funding Information: This work was supported by grants ANR-16-CONV-0002 (ILCB), AMX-19-IET-009 (Archimedes Institute), and the Excel lence Initiative of Aix-Marseille University (A*MIDEX). Publisher Copyright: © 2022 Association for Computational Linguistics.
Date
2022-09-01
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Keywords
Visual Modalities, Spoken Modalities, Speech, visual Data
Citation
Nikolaus, M, Alishahi, A & Chrupała, G 2022, 'Learning English with Peppa Pig', Transactions of the Association for Computational Linguistics, vol. 10, pp. 922-936. https://doi.org/10.1162/tacl_a_00498
License
info:eu-repo/semantics/openAccess
Embedded videos