Instructions Creators: Xiaoru Dong, Linh Hoang
Preparation date: 2018-12-14, last updated 2019-04-18
Manuscript working title: Machine classification of inclusion criteria from Cochrane systematic reviews
Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider
These instructions describe the steps needed to replicate the results in the manuscript.
-
Programming Language: Python (version 3.0)
-
Please make sure that you have the following programs on your machine in order to run the scripts:
- Python 3.0: https://www.python.org/downloads/
- Jupyter Notebook: http://jupyter.org/install
-
The Python scripts are used to generate features and to create the Weka input files corresponding to the 3 feature extraction and selection approaches that we implemented in this study:
- Features generated by the bag of words feature extraction strategy.
- Features selected by the information gain feature selection strategy.
- Features selected by a manual analysis feature selection strategy.
-
Step 1: Download the script: https://github.com/infoqualitylab/InclusionCriteria/blob/master/bag_of_words_feature_extraction.ipynb
-
Step 2: Download the input file “Inclusion_Criteria_Annotation.csv” (one of the study’s data files), which is available at: https://doi.org/10.13012/B2IDB-5958960_V2. Note where you store the file.
-
Step 3: Open the script in Jupyter Notebook. Change the “path” variable in the script to the path of your own folder where you stored the input file.
-
Step 4: Run the script to get two output files: "AllWords.csv" and "AllWords_weka_input.arff"
-
Step 5: Use the "AllWords_weka_input.arff" file as the input in order to run the classification model in Weka (for how to run classification model in Weka, please read the Weka section below)
-
Step 1: Download the first script: https://github.com/infoqualitylab/InclusionCriteria/blob/master/generate_no_redundant_Weka_input.ipynb
-
Step 3: Open the script in Jupyter Notebook. Change the “path” variable in the script to the path of your own folder where you stored the input file.
-
Step 4: Run the script to get two output files: "AllWord_Noredundant.csv" and "AllWord_Noredundant_weka_input.arff"
-
Step 5: Use the "AllWord_Noredundant_weka_input.arff" file as input in order to run information gain in Weka (for how to run information gain in Weka, please read the Weka section below). After running information gain in Weka, save the "InformativeWords" from Weka to the same folder.
-
Step 6: Download the second script: https://github.com/infoqualitylab/InclusionCriteria/blob/master/information_gain_feature_selection.ipynb
-
Step 7: Open the script in Jupyter Notebook. Change the “path” variable in the script to the path of your own folder where you stored the input file.
-
Step 8: Run the script to get two output files: "WordsSelectedByInformationGain.csv" and "WordsSelectedByInformationGain_weka_input.arff"
-
Step 9: Use the "WordsSelectedByInformationGain_weka_input.arff" file as input in order to run classification model in Weka (for how to run classification model in Weka, please read the Weka section below)
-
Step 1: Download the script: https://github.com/infoqualitylab/InclusionCriteria/blob/master/manual_analysis_feature_selection.ipynb
-
Step 2: Download the input file “WordsSelectedByManualAnalysis.csv” (one of the study’s data files), which is available at: https://doi.org/10.13012/B2IDB-8659314_V1. Note where you store the file.
-
Step 3: Open the script in Jupyter Notebook. Change the “path” variable in the script to the path of your own folder where you stored the input file.
-
Step 4: Run the script to get one output file: "WordSelectedbyManualAnalysis_weka_input.arff"
-
Step 5: Use the "WordSelectedbyManualAnalysis_weka_input.arff" file as input in order to run classification model in Weka (for how to run classification model in Weka, please read the Weka section below)
- Please make sure that you have Weka on your machine in order to implement the classifiers: https://www.cs.waikato.ac.nz/ml/weka/downloading.html
- Step 1: Open Weka on your machine, select “Explorer” mode.
- Step 2: On the “Preprocess” tab:
--> Click “Open file” and select the Weka input file that you want to implement classification with. For example: if you want to implement a classifier with all features, select the “AllWords_weka_input.arff” Weka input file as shown in the screenshot below.
--> Click “All” to choose all of the words and use them as features to train the classifier as shown in the screenshot below.
- Step 3: On the “Classify” tab:
--> Click “Choose” to select the algorithm that you want to run. NOTE: We used three algorithms: Random Forest, J48 and Naïve Bayes. For example: if you want to run a classifier using “Random Forest” algorithm, select RandomForest as shown in the screenshot below:
--> Click “Percentage split” in the “Test options” and put 90% (this means we want to get 90% of our data set for training, 10% for testing).
--> Click “More Options...” and set the seed to 3.
--> Click “Start” to run the classifier:
- Step 4: Get the classifier results. Three measurements were reported in our manuscript: Precision, Recall and F-Measure as shown in the screenshot below.
- We also used Weka to run Information Gain feature selection. To do so:
--> On the “Preprocess” tab: Click “Open file” and select the Weka input file AllWord_Noredundant_weka_input.arff
--> On the “Select attributes” tab:
Click “Choose” and select “InfoGainAttributeEval” as shown in the screenshot below.
Click “Start” to run the Information Gain feature selection.
--> Weka generated a list of informative words selected by Information Gain feature selection strategy. We then used the python script (above) to generate the data file “WordsSelectedByInformationGain.csv” and the Weka input file “WordsSelectedByInformationGain_weka_input.arff” accordingly.
For any questions about the instruction, please contact:
Linh Hoang - [email protected].