PanClassif: A machine learning classifier pipeline for TCGA pancancer classification

This is a complete machine learning pipeline package to work with TCGA cancer RNA-seq gene count data.

PanClassif PyPi

Data prerequisition

TCGA cancer & normal samples downloaded using TCGA2STAT
smoothed version of the above collected data using knn-smoothing (Wagner et al., 2017)
Dataset Mendeley Click Here
Dataset Drive Link

Functions

featSelect(homepath, cancerpath, normalpath, k)

Params

homepath : (str) Path where you want to save all the generated files and folders.
cancerpath : (str)
Path where all the cancer's cancer gene expression matrix are located.
normalpath : (str)
Path where all the cancer's normal gene expression matrix are located.
k : (int) The number of top genes you want to choose per cancer. (default: k=5) you can not put k less than 5

dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal,scale_mode)

Params

homepath : (str) Path where you want to save all the generated files and folders.
cancerpath : (str) Path where all the cancer's cancer gene expression matrix are located.
names : (list) List of the cancer names found from featSelect function.
smoothed_cancer : (str) Path where all the cancer's smoothed cancer gene expression matrix are located.
smoothed_normal : (str) Path where all the cancer's smoothed normal gene expression matrix are located.
scale_mode (int): Here (0 is for Standardization and 1 for normalization) for data scalling

upsampled(names, homepath)

binary_merge(names, homepath)

multi_merge(names, homepath)

Params

names : (list) List of the cancer names found from featSelect function.
homepath : (str)
Path where you want to save all the generated files and folders.

classification(homepath, classifier, mode, save_model)

Params

homepath : (str) Path where you want to save all the generated files and folders
classifer : (sklearn's classification model) Provide the classification model's instance you want to use. For example: RandomForestClassifier(n_estimators=100).
Or, classifer : (str) If you want to use "Neural Network" then just type "NN". For example: classifier = "NN"
mode : (str) There is two mode 1) binary 2) multi. Use "binary" for binary classification & "multi" for multiclass classification. (default: mode = "binary")
save_model : (str) Optional parameter. Use it only if you want to save the model. For example: save_model = "your_model_name"

gsea(homepath)

homepath : (str) Path where you want to save all the generated files and folders

Example

homepath = '/home'
cancerpath = '/home/cancer/'
normalpath = '/home/normal/'

smoothed_cancer = '/home/smoothed_cancer'
smoothed_normal = '/home/smoothed_normal'

Data Load and Process Phase

import panclassif as pc 
#You have to follow below order to work the code properly 
names = pc.featSelect(homepath,cancerpath,normalpath, k=5)
pc.dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal)
pc.upsampled(names, homepath)
pc.binary_merge(names, homepath)
pc.multi_merge(names, homepath)

Classification Phase

from sklearn.ensemble import RandomForestClassifier
pc.classification(homepath, RandomForestClassifier(n_estimators=100), mode="multi", save_model="RF")

Gene enrichment check

pc.gsea(homepath)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
panclassif		panclassif
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanClassif: A machine learning classifier pipeline for TCGA pancancer classification

Data prerequisition

Functions

featSelect(homepath, cancerpath, normalpath, k)

dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal,scale_mode)

upsampled(names, homepath)

binary_merge(names, homepath)

multi_merge(names, homepath)

classification(homepath, classifier, mode, save_model)

gsea(homepath)

Example

Data Load and Process Phase

Classification Phase

Gene enrichment check

About

Releases

Packages

Contributors 2

Languages

License

Zwei-inc/panclassif

Folders and files

Latest commit

History

Repository files navigation

PanClassif: A machine learning classifier pipeline for TCGA pancancer classification

Data prerequisition

Functions

featSelect(homepath, cancerpath, normalpath, k)

dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal,scale_mode)

upsampled(names, homepath)

binary_merge(names, homepath)

multi_merge(names, homepath)

classification(homepath, classifier, mode, save_model)

gsea(homepath)

Example

Data Load and Process Phase

Classification Phase

Gene enrichment check

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages