This project aims to develop a module for semantic (Type-4) clone detection towards Java and Python code snippets.
The module is developed in Python 3.10, PyTorch 1.13.0 and can be trained on CUDA 11.4, Ubuntu 20.04.2.
The pipeline of the module is shown in streamline.png.
System Requirements:
- Ubuntu 20.04.2
Runtime and Develop Requirements:
- Python 3.10.13
- pip 23.3
- PyTorch 1.13.0
- CUDA 11.4
- other dependencies: see requirements.txt
Network Requirements:
- at least 10MB/s bandwidth
- access to HuggingFace
Storage Requirements:
- at least 5GB for downloaded and saved models
The Pre-training Dataset: see Jupyter Notebook--Pre-training_Dataset_Exploration.ipynb
The Fine-tuning Dataset: see Jupyter Notebook--Fine-tuning_Dataset_Exploration.ipynb
Data Augmentation Technique: see Jupyter Notebook--Data_Augmentation.ipynb
Code Tokenization Methodology: see Jupyter Notebook--Code_Tokenization.ipynb
For the Baseline Experiment, run the following commands:
cd /xlccd/core/scripts
bash run_fine_tune_train.sh
bash run_fine_tune_test.sh
For the C4 Experiment, run the following commands:
cd /xlccd/core/scripts
bash run_c4_train.sh
bash run_c4_test.sh
For the XLCCD Experiment(pre-training, fine-tuning), run the following commands:
cd /xlccd/core/scripts
bash run_pre_train.sh
bash run_fine_tune_train.sh
bash run_fine_tune_test.sh
For Data Augmentation, run the following commands (need to install dependencies for Transcoder):
cd /xlccd/core/scripts
bash run_augmentation.sh
Relevant links:
Transcoder Model: https://github.com/facebookresearch/CodeGen
fastBPE: https://github.com/glample/fastBPE