Intro

One of the typical constructions, that have been developed by data engineers and data scientists for data processing are the data pipelines. Pipelines – consist of simple steps connected to each other by passing data from the dataset along the connection direction. The purpose of the functions inside the pipeline may vary widely: from the restoring of missing values to split the data to the train, test and validation samples, and learning complex neural-network-based models. Design of such pipelines can be automated using advanced language models given the description of the steps in plain English.

Project Goal

The goal of the project is the collection of data that contains typical steps for data processing pipelines. Each step item includes the description in English, the corresponding snippets in Python and some high-level tagging information that would help identifying similar tasks. Such collection should be extensible and accessible via API.

Old info

Data Corpus

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

2022-2023 project team

Taxonomy class to Python Translator

RL for best solution choice

Pipeline generation

photo_2023-01-26 19.27.10.jpeg

Whiteboard-[ibJrGtamRb-XDh-kVnXwLA]-01.png

Additional links