As part of their NLP seminar at Technion – Israel Institute of Technology, Yonatan Belinkov involved his students in BigScience. A dozen students interacted with chairs and working group members and contributed datasets, code, issues, pull requests and more over the course of the seminar.
40 new datasets were added to the Data Catalogue by Amit Alfassy. The main goal of the Data Catalogue was to document and collect data sources for the BigScience training dataset, gathering a wide variety of resources that represent different kinds of language use: different regions, different contexts and different audiences.
Data Tooling
Adi Simhi and Efrat Levkovizh worked on a near duplication project that addresses the classification problem of near-duplicate sentences by generating new datasets and providing a codebase for dataset generation.
https://lh6.googleusercontent.com/zB9J_Qc7p2yDCXtw4urH_h25wsWB96HTn-9BaR2FV7C-26gjjhIvyD8dmMl88f_xB8CqFv4tSqy9ho9pjwHBEG-dAkv7B3NMMUxNr-yTOTEIdfoKtAcj2I5kLYZxt4kUHmm3lnpP
Omer Antverg added the ANLI dataset to the BigScience Evaluation repo, which will be considered as part of the evaluation of the main language model at BigScience.
Shaked Brody contributed a set of code-related prompts to PromptSource – a toolkit for creating, sharing and using natural language prompts created by the BigScience Prompt Engineering Working Group.
Dataset: great_codePrompts:
Dataset: openai_humanevalPrompts: