Masader is the largest public catalogue for Arabic NLP datasets consisting of 200 datasets annotated with 25 attributes. Furthermore, the authors develop a metadata annotation strategy that could be extended to other languages. This work was developed as part of the BigScience Data Sourcing efforts.
Data Catalogue: https://arbml.github.io/masader/
Paper and contributors: Masader: Metadata Sourcing for Arabic Text and Speech Data Resources (Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani)