Program

Workshop Program and Accepted Papers

9:00-9:10 Opening remarks

 

9:10-10:10 Morning session: New Datasets

Chair: Mariana Romanyshyn

9:10-9:25 A Contemporary News Corpus of Ukrainian: Compilation, Annotation, Publication

Stefan Fischer, Kateryna Haidarzhyi, Jörg Knappen, Olha Polishchuk, Yuliya Stodolinska and Elke Teich

9:25-9:40 Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized CVs and Job Postings

Nazarii Drushchak and Mariana Romanyshyn

9:40-9:55 Creating Parallel Corpora for the Ukrainian Language: a German-Ukrainian Parallel Corpus

Maria Shvedova and Arsenii Lukashevskyi

9:55-10:10 Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian

Dmytro Chaplynskyi and Mariana Romanyshyn

 

10:10-10:30 Morning session: Invited Lightning Talks

10:10-10:20 Introducing CLARIN K-center for Ukrainian Language Research: Cooperation and Development

Olha Kanishcheva

10:20-10:30 PAWUK: Polish Automatic Web corpus of UKrainian

Witold Kieraś, Łukasz Kobyliński, Dorota Komosińska, Bartłomiej Nitoń, Michał Rudolf, Maria Shvedova, Aleksandra Zwierzchowska

 

10:30-11:00 Morning Coffee break

 

11:00-12:10 Morning Session: New Directions

Chair: Oleksii Ignatenko

11:00-11:20 Instant Messaging Platforms News Multi-Task Classification for Stance, Sentiment, and Discrimination Detection

Taras Ustyianovych and Denilson Barbosa

11:20-11:35 Setting up the Data Printer with Improved English to Ukrainian Machine Translation

Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus and Volodymyr Kyrylov

11:35-11:55 Automated Extraction of Hypo-Hypernym Relations for the Ukrainian WordNet

Nataliia Romanyshyn, Dmytro Chaplynskyi and Mariana Romanyshyn

11:55-12:10 Ukrainian Visual Word Sense Disambiguation Benchmark

Yurii Laba, Yaryna Mohytych, Ivanna Rohulia, Halyna Kyryleyza, Hanna Dydyk-Meush, Oles Dobosevych and Rostyslav Hryniv

12:10-13:00 Invited talk: Towards Equitable and Culturally Adapted Multilingual Dialog Systems

Ivan Vulić, University of Cambridge, UK

 

13:00-14:00 Lunch break

 

14:00-16:00 Afternoon session: LLMs for Ukrainian

Chair: Mariana Romanyshyn

14:00-14:15 The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian

Oleksiy Syvokon, Mariana Romanyshyn and Roman Kyslyi

14:15-14:35 Fine-tuning and Retrieval Augmented Generation for Question Answering using affordable Large Language Models

Tiberiu Boros, Radu Chivereanu, Stefan Dumitrescu and Octavian Purcaru

14:35-14:55 From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

Artur Kiulian, Anton Polishko, Mykola Khandoga, Oryna Chubych, Jack Connor, Raghav Ravishankar and Adarsh Shirawalmath

14:55-15:15 Spivavtor: An Instruction Tuned Ukrainian Text Editing Model

Aman Saini, Artem Chernodub, Vipul Raheja and Vivek Kulkarni

15:15-15:35 Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian (Large) Language Models

Serhii Hamotskyi, Anna-Izabella Levbarg and Christian Hänig

15:35-15:55 LiBERTa: Advancing Ukrainian Language Modeling through Pre-training from Scratch

Mykola Haltiuk and Aleksander Smywiński-Pohla

 

16:00-16:30 Afternoon Coffee break

 

16:30-17:00 Afternoon session: LLMs for Ukrainian

Chair: Oleksii Ignatenko

16:30-16:45 Entity Embelishment Mitigation in LLMs Output with Noisy Dataset for Alignment

Svitlana GALESHCHUK

16:45-17:00 Language-Specific Pruning for Efficient Reduction of Large Language Models

Maksym Shamrai

17:00-17:50 Invited talk: BRUK Team’s Resources for Ukrainian Corpus Creation

Vasyl Starko, Ukrainian Catholic University (Ukraine)
Andriy Rysin, Independent Researcher (USA)

 

17:50-18:00 Closing Words

Keynote Speakers

 

Ivan Vulić, University of Cambridge, UK

 

Topic: Towards Equitable and Culturally Adapted Multilingual Dialog Systems

 

The ability to intelligently converse with humans has been one of the fundamental objectives in the pursuit of artificial intelligence, and dialog systems are one of the prime user-facing applications of NLP technology. Task-oriented dialog systems in particular have been designed to assist or replace human operators in focused problems and domains with well-defined goals. However, designing and bootstrapping such systems that are able to cover multiple languages and/or domains without performance degradation, or even just collecting data for training and evaluation, is known to be notoriously difficult. In this talk, I will first point to the main gaps and challenges of modern task-oriented dialog systems, including the sheer lack of multilingual datasets and lack of cultural adaptation, which result in culturally biased, inequitable, English-centric, and non-adaptive systems. I will then follow up by describing new data collection protocols that enable creation of high-quality culturally adapted multi-parallel multilingual data. The new datasets unlock the potential for unprecedented quantitative and qualitative evaluation and analyses of multilingual performance disparities. I will then delve deeper into current performance and (cultural) diversity gaps and disparities in multilingual multi-domain dialog systems, and provide a quick overview of the latest algorithmic developments aiming to reduce the detected gaps.

Vasyl Starko, Ukrainian Catholic University, Ukraine

Andriy Rysin, Independent researcher, USA

Topic: BRUK Team’s Resources for Ukrainian Corpus Creation

 

The talk will focus on the key resources and tools developed by the BRUK team for the automatic processing of Ukrainian texts, especially for building Ukrainian corpora. The resources include:
* BRUK (Ukrainian Brown Corpus, a projected one-million-word POS gold standard)
* VESUM (A Large Electronic Dictionary of Ukrainian, over 420,000 lemmas and counting, for POS tagging)
* USL (Ukrainian Semantic Lexicon for semantic tagging).
The tools come in the form of the NLP_UK suite for Ukrainian text tokenization, lemmatization, POS tagging, and cleaning. The application of NLP_UK to build multiple iterations of the GRAC corpus will be discussed.