Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Belz, Anya; Thomson, Craig; Reiter, Ehud; Abercrombie, Gavin; Alonso-Moral, Jose M.; Arvan, Mohammad; Cheung, Jackie; Cieliebak, Mark; Clark, Elizabeth; Deemter, Kees van; Dinkar, Tanvi; Dušek, Ondřej; Eger, Steffen; Fang, Qixiang; Gatt, Albert; Gkatzia, Dimitra; González-Corbelle, Javier; Hovy, Dirk; Hürlimann, Manuela; Ito, Takumi; Kelleher, John D.; Klubicka, Filip; Lai, Huiyuan; Lee, Chris van der; Miltenburg, Emiel van; Li, Yiru; Mahamood, Saad; Mieskes, Margot; Nissim, Malvina; Parde, Natalie; Plátek, Ondřej; Rieser, Verena; Romero, Pablo Mosteiro; Tetreault, Joel; Toral, Antonio; Wan, Xiaojun; Wanner, Leo; Watson, Lewis; Yang, Diyi

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

DSpace/Manakin Repository

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Belz, Anya; Thomson, Craig; Reiter, Ehud; Abercrombie, Gavin; Alonso-Moral, Jose M.; Arvan, Mohammad; Cheung, Jackie; Cieliebak, Mark; Clark, Elizabeth; Deemter, Kees van; Dinkar, Tanvi; Dušek, Ondřej; Eger, Steffen; Fang, Qixiang; Gatt, Albert; Gkatzia, Dimitra; González-Corbelle, Javier; Hovy, Dirk; Hürlimann, Manuela; Ito, Takumi; Kelleher, John D.; Klubicka, Filip; Lai, Huiyuan; Lee, Chris van der; Miltenburg, Emiel van; Li, Yiru; Mahamood, Saad; Mieskes, Margot; Nissim, Malvina; Parde, Natalie; Plátek, Ondřej; Rieser, Verena; Romero, Pablo Mosteiro; Tetreault, Joel; Toral, Antonio; Wan, Xiaojun; Wanner, Leo; Watson, Lewis; Yang, Diyi

(2023) The Fourth Workshop on Insights from Negative Results in NLP, pp. 1 - 10

(Part of book)

Abstract

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to ... read more

Download/Full Text

Open Access version via Utrecht University Repository

Publisher version

Publisher: Association for Computational Linguistics

(Peer reviewed)

See more statistics about this item