Loading...
Thumbnail Image
Item

Our Dialogue System Sucks - but Luckily we are at the Top of the Leaderboard!: A Discussion on Current Practices in NLP Evaluation

Braggaar,Anouck
He,Linwei
Wit,Jan De
Abstract
Currently, leaderboards are often used to evaluate natural language processing (NLP) systems and in particular large language models. In this paper we argue why we should step away from leaderboards and follow a more inclusive approach both in developing as well as in evaluating models. The focus of evaluation should be on the complete context in which the system operates. To accomplish this, researchers should take an inclusive approach and take note of developments in multiple scientific fields (from NLP to communication science).
Description
Publisher Copyright: © 2024 Owner/Author.
Date
2024-07-08
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Keywords
Evaluation, NLP, leaderboards, multidisciplinary research
Citation
Braggaar, A, He, L & Wit, J D 2024, Our Dialogue System Sucks - but Luckily we are at the Top of the Leaderboard!: A Discussion on Current Practices in NLP Evaluation. in Proceedings of the 6th Conference on ACM Conversational User Interfaces, CUI 2024., 48, Proceedings of the 6th Conference on ACM Conversational User Interfaces, CUI 2024. https://doi.org/10.1145/3640794.3665889
Embedded videos