Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster genes #13

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,8 @@ media/heatmap.jpg

.vscode
.connection_details.yaml
*.code-workspace
*.code-workspace
# files created by Genecatalog
*/Genecatalog/Filtered_genes_cpm.tsv.gz
*/Genecatalog/Kegg_counts.tsv.gz
*/Genecatalog/Kegg_cpm.tsv.gz
676 changes: 676 additions & 0 deletions Other/gene_graph.ipynb

Large diffs are not rendered by default.

374 changes: 366 additions & 8 deletions Python/Genecatalog.ipynb

Large diffs are not rendered by default.

92 changes: 92 additions & 0 deletions R/Analyze_genecatalog.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -610,7 +610,99 @@ MUSiCC: A marker genes based framework for metagenomic normalization and accurat



# Cluster genes

The file `r genecatalog_files$geneinfo ` contains the links between the genecatalog and the assembly. *genes* refer here to the cluster of genes in the genecatalog. respecitively one representative. *orf* I refer to all predicted genes on any contig.

The orfs are named as `<sample>_<contigNr>_<OrfNr>`

```{r}
orf_info= arrow::read_parquet(genecatalog_files$geneinfo)

orf_info %>%
unite(ContigName, Sample, ContigNr, sep = "_", remove = FALSE) -> orf_info

head(orf_info)
```

```{r}
# Then, count unique 'GeneNr' values for each 'ContigName'
genes_per_contig <- orf_info %>%
group_by(ContigName) %>%
summarise(Unique_Gene_Count = n_distinct(GeneNr)) %>%
arrange(desc(Unique_Gene_Count))



ggplot(genes_per_contig, aes(x = Unique_Gene_Count)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(
title = "Histogram of Unique Gene Counts",
x = "Unique Gene Counts",
y = "Frequency"
)

hist(genes_per_contig)
head(genes_per_contig)
````



```{r}
# Load the progress package
library(progress)

# Sort contigs by the number of genes in descending order
contig_gene_counts <- orf_info %>% group_by(ContigName) %>% summarize(NumGenes = n_distinct(GeneNr))
contig_gene_counts <- contig_gene_counts[order(-contig_gene_counts$NumGenes), ]

# Create a progress bar
pb <- progress_bar$new(
format = "[:bar] :percent Elapsed: :elapsed Estimated: :eta",
total = nrow(contig_gene_counts)
)

# Initialize an empty list to store all clustered_gene_lists
all_clustered_gene_lists <- list()

# Loop through each contig
for (contig in contig_gene_counts$ContigName) {
genes_of_contig <- orf_info %>% filter(ContigName == contig) %>% pull(GeneNr) %>% unique()

# Check if there are genes for the current contig

contig_gene_cov <- load_subset_of_genes(abundance_file, genes_of_contig) %>% t()
correlation_matrix <- cor(contig_gene_cov, method = "pearson")

correlation_matrix[is.na(correlation_matrix)]<-0

# Perform hierarchical clustering
clustered_genes <- hclust(dist(correlation_matrix), method = "single")

# Cut the dendrogram based on the threshold
gene_clusters <- cutree(clustered_genes, h = threshold)

# Get the gene names in each cluster
gene_names <- rownames(correlation_matrix)
clustered_gene_list <- lapply(unique(gene_clusters), function(cluster) {
genes_in_cluster <- gene_names[gene_clusters == cluster]
if (length(genes_in_cluster) > 1) {
return(genes_in_cluster)
}
})

# Remove null entries and append to the list
clustered_gene_list <- clustered_gene_list[!sapply(clustered_gene_list, is.null)]
all_clustered_gene_lists <- c(all_clustered_gene_lists, clustered_gene_list)


# Update the progress bar
pb$tick()
}

# Close the progress bar
pb$terminate()



````