Skip to content

Project of AGB class: Naive Bayes model to classify tumor types gene expression patterns

Notifications You must be signed in to change notification settings

joanmarticarreras/AGB_naive_bayes

Repository files navigation

AGB_naive_bayes

Project of AGB class: Naive Bayes model to classify tumor types gene expression patterns

INFO AT: http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/index.html

TO DO

  • Are there prior probabilities to take into account? No, all 1/8.
  • Get the sets and parse them.
  • Define and create the training sets and the test set (equal n).
  • Z-scores to descrete values.
  • Measure Likelyhoods.
  • Measure Mutual information.
  • Do we need pseudocounts? YES

Get the sets

mkdir sources

wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/brca_gene_zscore_full-filtered.txt -O ./sources/brca.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/coad_gene_zscore_full-filtered.txt -O ./sources/coad.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/hnsc_gene_zscore_full-filtered.txt -O ./sources/hnsc.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/kirc_gene_zscore_full-filtered.txt -O ./sources/kric.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/luad_gene_zscore_full-filtered.txt -O ./sources/luad.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/lusc_gene_zscore_full-filtered.txt -O ./sources/lusc.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/prad_gene_zscore_full-filtered.txt -O ./sources/prad.tbl && \
wget http://regulatorygenomics.upf.edu/courses/Master_AGB/Exercise_NaiveBayes/thca_gene_zscore_full-filtered.txt -O ./sources/thca.tbl

Look for funy values with AWK

gawk '{print $3}' coad.tbl | sort | uniq -c | more

gawk '{print $3}' coad.tbl | sort | uniq -c | tail

etc

Filter genes with NA or Inf in at least 1 samples

cat sources/* | egrep '\bNA\b|\bInf\b' | gawk '{print $1 }' | sort | uniq > not_valid_genes.txt

Load them into a hash, and just check in NoSoNaive.

We need to know the minumum number of samples in each cancer type, because it will be the number of samples used in our study.

gawk '{print FILENAME, NF}' f_* | sort | uniq | more

The number is 288.

Do parser.pl

  • Filter train & testing (144 each), where 288 is the minumum number of samples from the cancer type files.

    • WARNING: 144 --> 147 when applying pseudocounts.
  • The filter was pseudo-randomized (odds -> test, even -> train). Better against stratification than 0..143 vs 144..end.

  • Create a bunch of test and train files.

  • Create x2 test and train files sets.

  • Change proportion train vs test.

  • Implemented a k-fold validation, where k = 5

Do NotSoNaive.pl

  • Filter which genes have at leat 1 value as NA or Inf, and store it within a hash.
    • Filter from the test & training set those genes.
  • Calcule Liklihoods.
  • Calcule Conditional entropies.
  • Calcule entropies.
  • Information gain.
  • Use directories instead of a bunch of files.
  • Mean between sets.
  • Precision recall by cancer.
  • Calculate F.
  • Decide cut-off score/probability.

One command to rule them all

for int in 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1; do \
perl NotSoNaive.pl -dir set1/ -not not_valid_genes.txt -ig 0.5; done

for int in 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1; do \
perl -e '$int = shift @ARGV; $pos = 0; $tot = 0; while(<>) {@cols = split /\t/; if ($cols[1] eq $cols[2]) {$pos++; $tot++;} else {$tot++; }} print "$int\t", $pos / $tot, "\n";' "$int" results_${int}.tbl >> tp.tbl; done

for setnum in 0 1 2 3 4; do
    for int in 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1; do \
        perl NotSoNaive.pl -dir set${setnum}/ -not not_valid_genes.txt -ig $int >> result_${setnum}_${int}.tbl; \
    done; \
done


for setnum in 0 1 2 3 4; do
    for int in 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1; do \
        perl -e '$int = shift @ARGV; $setnum = shift @ARGV; $pos = 0; $tot = 0; while(<>) {@cols = split /\t/; if ($cols[1] eq $cols[2]) {$pos++; $tot++;} else {$tot++; }} print "$int\t$setnum\t", $pos / $tot, "\n";' "$int" "$setnum" result_${setnum}_${int}.tbl >> tp.tbl; \
    done; \
done

IG PRECISION PLOT REVERSED

I changed <= IG

for setnum in 0 1 2 3 4; do
    for int in 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1; do \
        perl bin/NotSoNaive.pl -dir set${setnum}/ -not not_valid_genes.txt -ig $int >> result_${setnum}_${int}_rev.tbl; \
    done; \
done



Information gain

  • 3rd quantile = 0.241287
  • Number of genes with IG > 3rd quantile = 4755

CANCER PLOTS

perl f_calculator.pl results/result_*_0.5.tbl > f_foreachcancer.tbl
library(ggplot2)
fcancer <- read.table(file="f_foreachcancer.tbl", header=T, sep="\t")
ggplot(fcancer) +
    geom_boxplot(aes(x=MEASURE, y=VALUE, fill=MEASURE)) +
    facet_wrap(~ CANCER) +
    theme_bw() +
    theme(axis.ticks.x=element_blank(),axis.text.x=element_blank()) +
    xlab("") + ylab("") + ylim(0,1)
ggsave(file="plots/f_foreach_cancer.svg")


# ZOOM
ggplot(fcancer) +
    geom_boxplot(aes(x=MEASURE, y=VALUE, fill=MEASURE)) +
    facet_wrap(~ CANCER) +
    theme_bw() +
    theme(axis.ticks.x=element_blank(),axis.text.x=element_blank()) +
    xlab("") + ylab("") + ylim(0.45,1)

ggsave(file="plots/f_foreach_cancer_ZOOM.svg")

IG PLOT

perl -e '
%data = ();
while(<>){
    chomp;
    ($ig, $set, $value) = split /\t/;
    $data{$ig}->{$set} = $value;
};
foreach my $ig (keys %data) {
    $mean = 0;
    $sd = 0;
    foreach my $set (keys %{$data{$ig} }) {
        $mean += $data{$ig}->{$set};
    }
    foreach my $set (keys %{$data{$ig} }) {
        $sd += (($data{$ig}->{$set} - ($mean/5)) ** 2)
    }     
    print "$ig\t", $mean / 5, "\t", sqrt($sd/4), "\n"
}' tp.tbl > tp_summary.tbl
library(ggplot2)
ig3 <- read.table(file="tp_summary.tbl")

ggplot(ig3, aes(x=V1, y=V2, group=1)) +
    stat_summary(fun.y="mean", geom="line", alpha=0.7) +
    geom_errorbar(aes(ymin=V2-V3, ymax=V2+V3)) +
    stat_summary(fun.y="mean", geom="point", alpha=0.8) +
    theme_bw() +
    ylim(0,1) +
    geom_hline(yintercept=0.125, linetype="dashed", alpha=0.7) +
    xlab("\nI.G. Threshold") + ylab("Precision\n")

Genes with IG > 0.65

AQP8|343 :1
GUCA2B|2981 :1
OLR1|4973 :1
SFTA1P|207107:1

SCORE PLOT

for i in 0 1 2 3 4; do
    gawk -v i=$i '{print i, $4}' results/result_${i}_0.5.tbl >> scores.tbl
done
ggplot(scores) +
    geom_density(aes(x=V2, fill=V1), alpha=0.7) +
    theme_bw() +
    xlab("\nScore") + ylab("density\n") +
    facet_grid( V1 ~ .) + scale_fill_discrete(name="Set")

If we set IGthreshold = 0.5 (our maximum), and we let vary SCthreshold, the precision augments.

  • Filter by threshold
mkdir kk

for thres in -200 -150 -100 -50; do
    perl kk.pl prova.tbl ${thres} > kk/prova_${thres}.tbl
done;
  • Do the analysis
perl bin/f_calculator.pl kk/prova_* > kk/f_foreachcancer.tbl

for int in 50 100 150 200; do \
perl -e '$int = shift @ARGV; $pos = 0; $tot = 0; while(<>) {@cols = split /\t/; if ($cols[1] eq $cols[2]) {$pos++; $tot++;} else {$tot++; }} print "$int\t", $pos / $tot, "\n";' "$int" prova_-${int}.tbl >> tp.tbl; done

Maybe it is better to work with scores, rather than probabilities due to perl overfloat issues.

-[x] Do the graphics

About

Project of AGB class: Naive Bayes model to classify tumor types gene expression patterns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages