Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Herrewijnen, Elize; Craandijk, Dennis F. W.

Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

DSpace/Manakin Repository

Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Herrewijnen, Elize; Craandijk, Dennis F. W.

(2023)
Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023, volume 3441, pp. 13 - 18
CEUR Workshop Proceedings, volume 3441, pp. 13 - 18
19th International Conference on Artificial Intelligence and Law, volume 3441, pp. 13 - 18

(Part of book)

Abstract

Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and ... read more

Download/Full Text

Open Access version via Utrecht University Repository

Publisher version

Keywords: Transformers, BERT, Language Models, Legal Text Classification, ECtHR dataset, Text Embeddings

ISSN: 1613-0073

Publisher: CEUR-WS.org

(Peer reviewed)

See more statistics about this item