This repository contains the code and replication scripts for IntraTyper, a modified version of the deep learning type inferring tool DeepTyper by Hellendoorn et al. (2018). DeepTyper has been trained and evaluated on a set of different projects, a so-called inter-project environment. In contrast to that, IntraTyper is trained and evaluated in an intra-project setting. This means that IntraTyper is specifically tailored for one project. As the results of the experiments show, due to this specific setting, the tool excels at predicting relatively uncommon, project-specific types.
IntraTyper uses the CNTK library.
Therefore, an environment which supports CNTK is necessary.
- Execute the bash script
data/cloner.sh
. This will clone all the repositories mentioned in thedata/repo-SHAs.txt
file and reset them to the SHA commits of 28th February 2018. - Copy the created
data/Repos
directory and name itdata/Repos-cleaned
. - Run
node CleanRepos.js
. This will create corresponding tokenized data and type (*.ttokens
) files in Repos-cleaned. Furthermore, it scrapes all user-added type annotations of the source code and stores them in*.ttokens.pure
files. - Run
node GetTypes.js
. This will create three directories. In each directory, each line corresponds to a TypeScript file. Each line contains space-separated TypeScript tokens followed by the corresponding space-separated types. A tab separates the source-tokens and type-tokens.outputs-all
contains data in which every identifier is annotated with its inferred type. This will be used for training data.outputs-pure
contains only the real user-added type annotations for the TypeScript code (andno-type
elsewhere); this is used for evaluation (GOLD data)outputs-checkjs
contains the TSc+CheckJS inferred types for every identifier. This can be used for comparing performance with TSc+CheckJS.
- In the following, choose between the
intra-xyz.py
andinter-xyz.py
scripts, depending on which setting you want to build. Hereafter, for simplicity, theintra
scripts are used but can always be replaced with theinter
scripts. - Run
intra_data_split.py
to create an 80% train, 10% valid and 10% test split, as well as source and target vocabularies. This will also create a txt file containing all the projects/source files chosen for the test split in the inter-project/intra-project setting respectively. - Convert the train/valid/test data to CNTK compatible
.ctf
input files by using CNTK's txt2ctf script:
python txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
python txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
python txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf
- Adjust the epoch size of
intra_infer.py
andintra_evaluation.py
according to the output ofintra_data_split.py
in the line "Overall tokens: [xyz] train". - Run
intra_infer.py
to train the neural net over 10 epochs. - Choose the model with the best evaluation error and provide its path to the
model_file
variable inintra_evaluation.py
. - Run
intra_evaluation.py
to let the model predict the corresponding types in the test data set. The results are written to theresults
directory in a txt file. The txt file contains four columns which are defined in the following way:
true type | prediction | confidence of prediction | rank of prediction
- To create a plot of the resulting prediction-accuracies, run the script
analyze_result.py
.