MIRACL 🌍🙌🌏

Multilingual Information Retrieval Across a Continuum of Languages

Overview

MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources — including what we typically characterize as high-resource as well as low-resource languages. The focus of this challenge is monolingual retrieval, where the queries and the corpus are in the same language (e.g., Swahili queries searching for Swahili documents). Our goal is to spur research that will improve retrieval models across a broad continuum of languages, and thus improve information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved.

With the advent and dominance of deep learning and approaches based on neural networks (particularly transformer-based models) in information retrieval and beyond, the importance of large datasets as drivers of progress is well understood. For retrieval models in English, the MS MARCO datasets have had a transformative impact in advancing the field. To stimulate similar advances in multilingual retrieval, we have built the MIRACL 🌍🙌🌏 dataset, comprising human-annotated passage-level relevance judgments on Wikipedia for 18 languages, totaling over 600k+ training pairs. Along with the dataset, WSDM 2023 Cup provides a common evaluation methodology, a venue for a competition-style event with prizes, a leaderboard. To get participants off the ground quickly our team will provide easy-to-reproduce baselines. There will be two tracks in this challenge: "known languages" and "surprise languages". In the first, we will provide data well in advance of the submission deadline. In the second, the identity of the languages (along with data) will only be made available at the last moment. The "surprise languages" task emphasizes the rapid development of language-specific capabilities.

Connect with us!

News!

September 20, 2022: Initial announcement.
October 19, 2022: Release training and development set of known languages.
January 5, 2022: Release surprise languages.
January 5, 2022: Release test-b set of all languages.

Dataset Details

The topics and judgment in training and development set are now released, as well as the corpora. Checkout our Github repository and paper for more details!

The following table provides the number of topics (= queries), relevance judgment (= relevance labels) for each (language, split) combination, and the number of passages and Wikipedia articles in the corpora.

Lang	Train		Dev		Test-A		Test-B		# Passages	# Articles
Lang	# Q	# J	# Q	# J	# Q	# J	# Q	# J	# Passages	# Articles
Arabic (ar)	3,495	25,382	2,896	29,197	936	9,325	1,405	14,036	2,061,414	656,982
Bengali (bn)	1,631	16,754	411	4,206	102	1,037	1,130	11,286	297,265	63,762
English (en)	2,863	29,416	799	8,350	734	5,617	1,790	18,241	32,893,221	5,758,285
Spanish (es)	2,162	21,531	648	6,443	0	0	1,515	15,074	10,373,953	1,669,181
Persian (fa)	2,107	21,844	632	6,571	0	0	1,476	15,313	2,207,172	857,827
Finnish (fi)	2,897	20,350	1,271	12,008	1,060	10,586	711	7,100	1,883,509	447,815
French (fr)	1,143	11,426	343	3,429	0	0	801	8,008	14,636,953	2,325,608
Hindi (hi)	1169	11,668	350	3,494	0	0	819	8,169	506,264	148,107
Indonesian (id)	4,071	41,358	960	9,668	731	7,430	611	6,098	1,446,315	446,330
Japanese (ja)	3,477	34,387	860	8,354	650	6,922	1,141	11,410	6,953,614	1,133,444
Korean (ko)	868	12,767	213	3,057	263	3,855	1,417	14,161	1,486,752	437,373
Russian (ru)	4,683	33,921	1,252	13,100	911	8,777	718	7,174	9,543,918	1,476,045
Swahili (sw)	1,901	9,359	482	5,092	638	6,615	465	4,620	131,924	47,793
Telugu (te)	3,452	18,608	828	1,606	594	5,948	793	7,920	518,079	66,353
Thai (th)	2,972	21,293	733	7,573	992	10,432	650	6,493	542,166	128,179
Chinese (zh)	1,312	13,113	393	3,928	0	0	920	9,196	4,934,368	1,246,389
Germany (de)	0	0	305	3,144	0	0	712	7,317	15,866,222	2,651,352
Yoruba (yo)	0	0	119	1,188	0	0	288	2,880	49,043	33,094

Descriptive statistics for 🌍🙌🌏 MIRACL. Lang denotes the language and ISO 639‑1 Code of the language; # Q denotes the number of queries; # J denotes the total number relevance judgments (including both positive and negative judgments); # Passages denotes the number of passages in each language and # Articles denotes the number of Wikipedia articles in the same language.

Challenge and Leaderboard

Our challenge follows a standard retrieval setup: test queries will be released (at different points in time for the two tasks), and participants will submit top-k results for each of the queries. These results will be primarily evaluated in terms of effectiveness (i.e., relevance of the responses). We will build a leaderboard that tracks the effectiveness of submissions. More details to follow!

Schedule

Date	Event
Wed. Oct 19, 2022	Data release for “known” languages
Tue. Nov 1, 2022	Leaderboard open
Thu. Jan 5, 2023	Data release for “surprise” languages
~~Thu. Jan 12, 2023~~ Mon. Jan 16, 2023	Final submission deadline
~~Mon. Jan 16, 2023~~ Tue. Jan 17, 2023	Final winners announced