Dataset Information

Machine learning code snippets semantic classification.

ABSTRACT: Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code corpus compiled from Kaggle, a renowned data science competition platform. The corpus includes code snippets and information about the respective kernels and competitions, but it is limited in the quality of the tagged data, which is ~0.2%. Our method addresses the lack of labeled snippets for supervised model training by exploiting the internal ambiguity in particular labeled snippets where multiple class labels are combined. Using a specially designed algorithm, we effectively separate these ambiguous fragments, thereby expanding the pool of training data. This data augmentation approach greatly increases the amount of labeled data and improves the overall quality of the trained models. The experimental results demonstrate the prowess of the proposed code classifier, achieving an impressive F1 test score of ~89%. This achievement not only enhances the practicality of CodeBERT for classifying code snippets but also highlights the importance of enriching large-scale annotated machine learning code datasets such as Code4ML. With a significant increase in accurately annotated code snippets, Code4ML is becoming an even more valuable resource for learning and improving various data processing models.

SUBMITTER: Berezovskiy V

PROVIDER: S-EPMC10703005 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine learning code snippets semantic classification.

Berezovskiy Valeriy V Gorodilova Anastasia A Trofimova Ekaterina E Ustyuzhanin Andrey A

PeerJ. Computer science 20231127

Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code <i>corpus</i> compiled from Kaggle, a renowned data science comp ...[more]

PMID: 38077565

Dataset Information

Machine learning code snippets semantic classification.

Publications

Machine learning code snippets semantic classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Code-free machine learning for classification of central nervous system histopathology images.
| S-EPMC9941804 | biostudies-literature

Development of a code-free machine learning model for the classification of cataract surgery phases.
| S-EPMC8844421 | biostudies-literature

Virtual screening of antimicrobial plant extracts by machine-learning classification of chemical compounds in semantic space.
| S-EPMC10184910 | biostudies-literature

Semantic similarity and machine learning with ontologies.
| S-EPMC8293838 | biostudies-literature

Machine learning uncovers cell identity regulator by histone code.
| S-EPMC7264183 | biostudies-literature

Morphological Neuron Classification Using Machine Learning.
| S-EPMC5088188 | biostudies-literature

Machine learning reveals hidden stability code in protein native fluorescence.
| S-EPMC8131987 | biostudies-literature

Code4ML: a large-scale dataset of annotated Machine Learning code.
| S-EPMC10280557 | biostudies-literature

Breast Cancer Type Classification Using Machine Learning.
| S-EPMC7909418 | biostudies-literature

Semantic segmentation in crystal growth process using fake micrograph machine learning
| S-EPMC11339331 | biostudies-literature