Masader is the largest public catalogue for Arabic NLP datasets consisting of 200 datasets annotated with 25 attributes. Furthermore, the authors develop a metadata annotation strategy that could be extended to other languages. This work was developed as part of the BigScience Data Sourcing efforts.

Data Catalogue: https://arbml.github.io/masader/

Paper and contributors: Masader: Metadata Sourcing for Arabic Text and Speech Data Resources (Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani)

Screenshot 2022-01-12 at 10.10.45.png