PanClassif: A machine learning classifier pipeline for TCGA pancancer classification
This is a complete machine learning pipeline package to work with TCGA cancer RNA-seq gene count data.
- TCGA cancer & normal samples downloaded using TCGA2STAT
- smoothed version of the above collected data using knn-smoothing (Wagner et al., 2017)
- Dataset Mendeley Click Here
- Dataset Drive Link
Params
- homepath : (str) Path where you want to save all the generated files and folders.
- cancerpath : (str)
Path where all the cancer's cancer gene expression matrix are located. - normalpath : (str)
Path where all the cancer's normal gene expression matrix are located. - k : (int) The number of top genes you want to choose per cancer. (default: k=5) you can not put k less than 5
Params
- homepath : (str) Path where you want to save all the generated files and folders.
- cancerpath : (str) Path where all the cancer's cancer gene expression matrix are located.
- names : (list)
List of the cancer names found from
featSelect
function. - smoothed_cancer : (str) Path where all the cancer's smoothed cancer gene expression matrix are located.
- smoothed_normal : (str) Path where all the cancer's smoothed normal gene expression matrix are located.
- scale_mode (int): Here (0 is for Standardization and 1 for normalization) for data scalling
Params
- names : (list)
List of the cancer names found from
featSelect
function. - homepath : (str)
Path where you want to save all the generated files and folders.
Params
- homepath : (str) Path where you want to save all the generated files and folders
- classifer : (sklearn's classification model) Provide the classification model's instance you want to use. For example: RandomForestClassifier(n_estimators=100).
- Or, classifer : (str) If you want to use "Neural Network" then just type "NN". For example: classifier = "NN"
- mode : (str) There is two mode 1) binary 2) multi. Use "binary" for binary classification & "multi" for multiclass classification. (default: mode = "binary")
- save_model : (str) Optional parameter. Use it only if you want to save the model. For example: save_model = "your_model_name"
- homepath : (str) Path where you want to save all the generated files and folders
homepath = '/home'
cancerpath = '/home/cancer/'
normalpath = '/home/normal/'
smoothed_cancer = '/home/smoothed_cancer'
smoothed_normal = '/home/smoothed_normal'
import panclassif as pc
#You have to follow below order to work the code properly
names = pc.featSelect(homepath,cancerpath,normalpath, k=5)
pc.dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal)
pc.upsampled(names, homepath)
pc.binary_merge(names, homepath)
pc.multi_merge(names, homepath)
from sklearn.ensemble import RandomForestClassifier
pc.classification(homepath, RandomForestClassifier(n_estimators=100), mode="multi", save_model="RF")
pc.gsea(homepath)