Unknown

Dataset Information

0

Machine learning code snippets semantic classification.


ABSTRACT: Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code corpus compiled from Kaggle, a renowned data science competition platform. The corpus includes code snippets and information about the respective kernels and competitions, but it is limited in the quality of the tagged data, which is ~0.2%. Our method addresses the lack of labeled snippets for supervised model training by exploiting the internal ambiguity in particular labeled snippets where multiple class labels are combined. Using a specially designed algorithm, we effectively separate these ambiguous fragments, thereby expanding the pool of training data. This data augmentation approach greatly increases the amount of labeled data and improves the overall quality of the trained models. The experimental results demonstrate the prowess of the proposed code classifier, achieving an impressive F1 test score of ~89%. This achievement not only enhances the practicality of CodeBERT for classifying code snippets but also highlights the importance of enriching large-scale annotated machine learning code datasets such as Code4ML. With a significant increase in accurately annotated code snippets, Code4ML is becoming an even more valuable resource for learning and improving various data processing models.

SUBMITTER: Berezovskiy V 

PROVIDER: S-EPMC10703005 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

altmetric image

Publications

Machine learning code snippets semantic classification.

Berezovskiy Valeriy V   Gorodilova Anastasia A   Trofimova Ekaterina E   Ustyuzhanin Andrey A  

PeerJ. Computer science 20231127


Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code <i>corpus</i> compiled from Kaggle, a renowned data science comp  ...[more]

Similar Datasets

| S-EPMC9941804 | biostudies-literature
| S-EPMC8844421 | biostudies-literature
| S-EPMC10184910 | biostudies-literature
| S-EPMC8293838 | biostudies-literature
| S-EPMC7264183 | biostudies-literature
| S-EPMC5088188 | biostudies-literature
| S-EPMC8131987 | biostudies-literature
| S-EPMC10280557 | biostudies-literature
| S-EPMC7909418 | biostudies-literature
| S-EPMC11339331 | biostudies-literature