MIRACL πŸŒπŸ™ŒπŸŒ

Multilingual Information Retrieval Across a Continuum of Languages


Overview

MIRACL πŸŒπŸ™ŒπŸŒ (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources — including what we typically characterize as high-resource as well as low-resource languages. The focus of this challenge is monolingual retrieval, where the queries and the corpus are in the same language (e.g., Swahili queries searching for Swahili documents). Our goal is to spur research that will improve retrieval models across a broad continuum of languages, and thus improve information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved.

With the advent and dominance of deep learning and approaches based on neural networks (particularly transformer-based models) in information retrieval and beyond, the importance of large datasets as drivers of progress is well understood. For retrieval models in English, the MS MARCO datasets have had a transformative impact in advancing the field. To stimulate similar advances in multilingual retrieval, we have built the MIRACL πŸŒπŸ™ŒπŸŒ dataset, comprising human-annotated passage-level relevance judgments on Wikipedia for 18 languages, totaling over 600k+ training pairs. Along with the dataset, WSDM 2023 Cup provides a common evaluation methodology, a venue for a competition-style event with prizes, a leaderboard. To get participants off the ground quickly our team will provide easy-to-reproduce baselines. There will be two tracks in this challenge: "known languages" and "surprise languages". In the first, we will provide data well in advance of the submission deadline. In the second, the identity of the languages (along with data) will only be made available at the last moment. The "surprise languages" task emphasizes the rapid development of language-specific capabilities.

Connect with us!


News!

  • September 20, 2022: Initial announcement.
  • Oct 19, 2022: Release training and development set of known languages.


Dataset Details

The topics and judgment in training and development set are now released, as well as the corpora. Checkout our Github repository and paper for more details!

The following table provides the number of topics (= queries), relevance judgment (= relevance labels) for each (language, split) combination, and the number of passages and Wikipedia articles in the corpora.

Lang Train Dev Test-A Test-B # Passages # Articles
# Q # J # Q # J # Q # J # Q # J
Arabic (ar) 3,495 25,382 2,896 29,197 936 9,325 1,405 14,036 2,061,414 656,982
Bengali (bn) 1,631 16,754 411 4,206 102 1,037 1,130 11,286 297,265 63,762
English (en) 2,863 29,416 799 8,350 734 5,617 1,790 18,241 32,893,221 5,758,285
Spanish (es) 2,162 21,531 648 6,443 0 0 1,515 15,074 10,373,953 1,669,181
Persian (fa) 2,107 21,844 632 6,571 0 0 1,476 15,313 2,207,172 857,827
Finnish (fi) 2,897 20,350 1,271 12,008 1,060 10,586 711 7,100 1,883,509 447,815
French (fr) 1,143 11,426 343 3,429 0 0 801 8,008 14,636,953 2,325,608
Hindi (hi) 1169 11,668 350 3,494 0 0 819 8,169 506,264 148,107
Indonesian (id) 4,071 41,358 960 9,668 731 7,430 611 6,098 1,446,315 446,330
Japanese (ja) 3,477 34,387 860 8,354 650 6,922 1,141 11,410 6,953,614 1,133,444
Korean (ko) 868 12,767 213 3,057 263 3,855 1,417 14,161 1,486,752 437,373
Russian (ru) 4,683 33,921 1,252 13,100 911 8,777 718 7,174 9,543,918 1,476,045
Swahili (sw) 1,901 9,359 482 5,092 638 6,615 465 4,620 131,924 47,793
Telugu (te) 3,452 18,608 828 1,606 594 5,948 793 7,920 518,079 66,353
Thai (th) 2,972 21,293 733 7,573 992 10,432 650 6,493 542,166 128,179
Chinese (zh) 1,312 13,113 393 3,928 0 0 920 9,196 4,934,368 1,246,389

Descriptive statistics for πŸŒπŸ™ŒπŸŒ MIRACL. Lang denotes the language and ISO 639‑1 Code of the language; # Q denotes the number of queries; # J denotes the total number relevance judgments (including both positive and negative judgments); # Passages denotes the number of passages in each language and # Articles denotes the number of Wikipedia articles in the same language.


Challenge and Leaderboard

Our challenge follows a standard retrieval setup: test queries will be released (at different points in time for the two tasks), and participants will submit top-k results for each of the queries. These results will be primarily evaluated in terms of effectiveness (i.e., relevance of the responses). We will build a leaderboard that tracks the effectiveness of submissions. More details to follow!


Schedule

Date Event
Wed. Oct 19, 2022 Data release for β€œknown” languages
Wed. Nov 1, 2022 Leaderboard open
Wed. Jan 18, 2023 Data release for β€œsurprise” languages
Wed. Jan 25, 2023 Final submission deadline
Wed. Feb 1, 2023 Final winners announced


Organizers

Xinyu Crystina Zhang
Xinyu Crystina Zhang
University of Waterloo
Nandan Thakur
Nandan Thakur
University of Waterloo
Odunayo Ogundepo
Odunayo Ogundepo
University of Waterloo
David Alfonso-Hermelo
David Alfonso-Hermelo
Huawei Noah's Ark Lab
Ehsan Kamalloo
Ehsan Kamalloo
Huawei Noah's Ark Lab
Xiaoguang Li
Xiaoguang Li
Huawei Noah's Ark Lab
Jimmy Lin
Jimmy Lin
University of Waterloo
Mehdi Rezagholizadeh
Mehdi Rezagholizadeh
Huawei Noah's Ark Lab
Qun Liu
Qun Liu
Huawei Noah's Ark Lab