As part of their NLP seminar at Technion – Israel Institute of Technology, Yonatan Belinkov involved his students in BigScience. A dozen students interacted with chairs and working group members and contributed datasets, code, issues, pull requests and more over the course of the seminar.

Data Sourcing

40 new datasets were added to the Data Catalogue by Amit Alfassy. The main goal of the Data Catalogue was to document and collect data sources for the BigScience training dataset, gathering a wide variety of resources that represent different kinds of language use: different regions, different contexts and different audiences.

Data Tooling

Adi Simhi and Efrat Levkovizh worked on a near duplication project that addresses the classification problem of near-duplicate sentences by generating new datasets and providing a codebase for dataset generation.

https://lh6.googleusercontent.com/zB9J_Qc7p2yDCXtw4urH_h25wsWB96HTn-9BaR2FV7C-26gjjhIvyD8dmMl88f_xB8CqFv4tSqy9ho9pjwHBEG-dAkv7B3NMMUxNr-yTOTEIdfoKtAcj2I5kLYZxt4kUHmm3lnpP

Evaluation

Omer Antverg added the ANLI dataset to the BigScience Evaluation repo, which will be considered as part of the evaluation of the main language model at BigScience.

Prompt Engineering

Code-related prompts

Shaked Brody contributed a set of code-related prompts to PromptSource – a toolkit for creating, sharing and using natural language prompts created by the BigScience Prompt Engineering Working Group.

Dataset: great_codePrompts:

  1. bug detection: Given a function, predict whether there is a bug in the code.
  2. fix buggy line: Given a function and a buggy line, fix the buggy line.
  3. function name generation: Given a function body, generate the function name.
  4. identifier prediction no choices: Given a code with a masked identifier, generate it, (without choices).
  5. identifier prediction with choices: Given a code with a masked identifier, generate it, (with choices).

Dataset: openai_humanevalPrompts:

  1. function body generation: Given a function signature and docstring, generate the function body.
  2. test_x return value generation: Given a function body and a call, predict the return value of test x.