MIRACL πππ
Multilingual Information Retrieval Across a Continuum of Languages
Overview
MIRACL πππ (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources — including what we typically characterize as high-resource as well as low-resource languages. The focus of this challenge is monolingual retrieval, where the queries and the corpus are in the same language (e.g., Swahili queries searching for Swahili documents). Our goal is to spur research that will improve retrieval models across a broad continuum of languages, and thus improve information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved.
With the advent and dominance of deep learning and approaches based on neural networks (particularly transformer-based models) in information retrieval and beyond, the importance of large datasets as drivers of progress is well understood. For retrieval models in English, the MS MARCO datasets have had a transformative impact in advancing the field. To stimulate similar advances in multilingual retrieval, we have built the MIRACL πππ dataset, comprising human-annotated passage-level relevance judgments on Wikipedia for 18 languages, totaling over 600k+ training pairs. Along with the dataset, WSDM 2023 Cup provides a common evaluation methodology, a venue for a competition-style event with prizes, a leaderboard. To get participants off the ground quickly our team will provide easy-to-reproduce baselines. There will be two tracks in this challenge: "known languages" and "surprise languages". In the first, we will provide data well in advance of the submission deadline. In the second, the identity of the languages (along with data) will only be made available at the last moment. The "surprise languages" task emphasizes the rapid development of language-specific capabilities.
Connect with us!
- π¬ Mailing list
- π¬ Slack Workspace
- π£ Twitter
News!
- September 20, 2022: Initial announcement.
- October 19, 2022: Release training and development set of known languages.
- January 5, 2022: Release surprise languages.
- January 5, 2022: Release test-b set of all languages.
Dataset Details
The topics and judgment in training and development set are now released, as well as the corpora. Checkout our Github repository and paper for more details!
The following table provides the number of topics (= queries), relevance judgment (= relevance labels) for each (language, split) combination, and the number of passages and Wikipedia articles in the corpora.
Lang | Train | Dev | Test-A | Test-B | # Passages | # Articles | ||||
---|---|---|---|---|---|---|---|---|---|---|
# Q | # J | # Q | # J | # Q | # J | # Q | # J | |||
Arabic (ar) | 3,495 | 25,382 | 2,896 | 29,197 | 936 | 9,325 | 1,405 | 14,036 | 2,061,414 | 656,982 |
Bengali (bn) | 1,631 | 16,754 | 411 | 4,206 | 102 | 1,037 | 1,130 | 11,286 | 297,265 | 63,762 |
English (en) | 2,863 | 29,416 | 799 | 8,350 | 734 | 5,617 | 1,790 | 18,241 | 32,893,221 | 5,758,285 |
Spanish (es) | 2,162 | 21,531 | 648 | 6,443 | 0 | 0 | 1,515 | 15,074 | 10,373,953 | 1,669,181 |
Persian (fa) | 2,107 | 21,844 | 632 | 6,571 | 0 | 0 | 1,476 | 15,313 | 2,207,172 | 857,827 |
Finnish (fi) | 2,897 | 20,350 | 1,271 | 12,008 | 1,060 | 10,586 | 711 | 7,100 | 1,883,509 | 447,815 |
French (fr) | 1,143 | 11,426 | 343 | 3,429 | 0 | 0 | 801 | 8,008 | 14,636,953 | 2,325,608 |
Hindi (hi) | 1169 | 11,668 | 350 | 3,494 | 0 | 0 | 819 | 8,169 | 506,264 | 148,107 |
Indonesian (id) | 4,071 | 41,358 | 960 | 9,668 | 731 | 7,430 | 611 | 6,098 | 1,446,315 | 446,330 |
Japanese (ja) | 3,477 | 34,387 | 860 | 8,354 | 650 | 6,922 | 1,141 | 11,410 | 6,953,614 | 1,133,444 |
Korean (ko) | 868 | 12,767 | 213 | 3,057 | 263 | 3,855 | 1,417 | 14,161 | 1,486,752 | 437,373 |
Russian (ru) | 4,683 | 33,921 | 1,252 | 13,100 | 911 | 8,777 | 718 | 7,174 | 9,543,918 | 1,476,045 |
Swahili (sw) | 1,901 | 9,359 | 482 | 5,092 | 638 | 6,615 | 465 | 4,620 | 131,924 | 47,793 |
Telugu (te) | 3,452 | 18,608 | 828 | 1,606 | 594 | 5,948 | 793 | 7,920 | 518,079 | 66,353 |
Thai (th) | 2,972 | 21,293 | 733 | 7,573 | 992 | 10,432 | 650 | 6,493 | 542,166 | 128,179 |
Chinese (zh) | 1,312 | 13,113 | 393 | 3,928 | 0 | 0 | 920 | 9,196 | 4,934,368 | 1,246,389 |
Germany (de) | 0 | 0 | 305 | 3,144 | 0 | 0 | 712 | 7,317 | 15,866,222 | 2,651,352 |
Yoruba (yo) | 0 | 0 | 119 | 1,188 | 0 | 0 | 288 | 2,880 | 49,043 | 33,094 |
Descriptive statistics for πππ MIRACL. Lang denotes the language and ISO 639β1 Code of the language; # Q denotes the number of queries; # J denotes the total number relevance judgments (including both positive and negative judgments); # Passages denotes the number of passages in each language and # Articles denotes the number of Wikipedia articles in the same language.
Challenge and Leaderboard
Our challenge follows a standard retrieval setup: test queries will be released (at different points in time for the two tasks), and participants will submit top-k results for each of the queries. These results will be primarily evaluated in terms of effectiveness (i.e., relevance of the responses). We will build a leaderboard that tracks the effectiveness of submissions. More details to follow!
Schedule
Date | Event |
---|---|
Wed. Oct 19, 2022 | Data release for βknownβ languages |
Tue. Nov 1, 2022 | Leaderboard open |
Thu. Jan 5, 2023 | Data release for βsurpriseβ languages |
|
Final submission deadline |
|
Final winners announced |
Organizers