GitHub - DRong1121/cross_language_clone_detection: Cross-Language Semantic Clone Detection towards Java and Python

Cross-Language Code Clone Detection (XLCCD)

Introduction

This project aims to develop a module for semantic (Type-4) clone detection towards Java and Python code snippets.
The module is developed in Python 3.10, PyTorch 1.13.0 and can be trained on CUDA 11.4, Ubuntu 20.04.2.
The pipeline of the module is shown in streamline.png.

Requirements

System Requirements:

Ubuntu 20.04.2

Runtime and Develop Requirements:

Python 3.10.13
pip 23.3
PyTorch 1.13.0
CUDA 11.4
other dependencies: see requirements.txt

Network Requirements:

at least 10MB/s bandwidth
access to HuggingFace

Storage Requirements:

at least 5GB for downloaded and saved models

Datasets

The Pre-training Dataset: see Jupyter Notebook--Pre-training_Dataset_Exploration.ipynb
The Fine-tuning Dataset: see Jupyter Notebook--Fine-tuning_Dataset_Exploration.ipynb
Data Augmentation Technique: see Jupyter Notebook--Data_Augmentation.ipynb
Code Tokenization Methodology: see Jupyter Notebook--Code_Tokenization.ipynb

Scripts

For the Baseline Experiment, run the following commands:

cd /xlccd/core/scripts  
bash run_fine_tune_train.sh  
bash run_fine_tune_test.sh

For the C4 Experiment, run the following commands:

cd /xlccd/core/scripts  
bash run_c4_train.sh  
bash run_c4_test.sh

For the XLCCD Experiment(pre-training, fine-tuning), run the following commands:

cd /xlccd/core/scripts  
bash run_pre_train.sh  
bash run_fine_tune_train.sh  
bash run_fine_tune_test.sh

For Data Augmentation, run the following commands (need to install dependencies for Transcoder):

cd /xlccd/core/scripts  
bash run_augmentation.sh

Relevant links:
Transcoder Model: https://github.com/facebookresearch/CodeGen
fastBPE: https://github.com/glample/fastBPE

Other Infos

见跨语言克隆代码检测--成果说明文档.docx

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
core		core
docs		docs
notebooks		notebooks
README.md		README.md
requirements.txt		requirements.txt
streamline.png		streamline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Language Code Clone Detection (XLCCD)

Introduction

Requirements

Datasets

Scripts

Other Infos

About

Releases

Packages

Languages

DRong1121/cross_language_clone_detection

Folders and files

Latest commit

History

Repository files navigation

Cross-Language Code Clone Detection (XLCCD)

Introduction

Requirements

Datasets

Scripts

Other Infos

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages