diff --git a/VERSION b/VERSION
index 3e3c2f1e..eca07e4c 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-2.1.1
+2.1.2
diff --git a/docs/_static/tile_plot.png b/docs/_static/tile_plot.png
deleted file mode 100644
index 01a88cfb..00000000
Binary files a/docs/_static/tile_plot.png and /dev/null differ
diff --git a/docs/_static/tile_plot_dendro.png b/docs/_static/tile_plot_dendro.png
new file mode 100644
index 00000000..fd5a27b7
Binary files /dev/null and b/docs/_static/tile_plot_dendro.png differ
diff --git a/docs/_static/tile_plot_no_dendro.png b/docs/_static/tile_plot_no_dendro.png
new file mode 100644
index 00000000..63cfeab1
Binary files /dev/null and b/docs/_static/tile_plot_no_dendro.png differ
diff --git a/docs/user/PangenomeAnalyses/pangenomeCluster.md b/docs/user/PangenomeAnalyses/pangenomeCluster.md
index f543bf7b..6df623a5 100644
--- a/docs/user/PangenomeAnalyses/pangenomeCluster.md
+++ b/docs/user/PangenomeAnalyses/pangenomeCluster.md
@@ -124,17 +124,17 @@ flowchart TD
#### Specify the representative gene
-It's possible to indicate which gene is the representative gene by adding a column between the gene family name.
+It's possible to indicate which gene is the representative gene by adding a third column.
-Here is a minimal example of your clustering file:
+Here is a minimal example of your clustering file with the third column being the representative gene:
```
-Family_A Gene_2 Gene_1
+Family_A Gene_1 Gene_2
Family_A Gene_2 Gene_2
-Family_A Gene_2 Gene_3
+Family_A Gene_3 Gene_2
+Family_B Gene_4 Gene_4
Family_B Gene_5 Gene_4
-Family_B Gene_5 Gene_5
Family_C Gene_6 Gene_6
```
@@ -174,25 +174,42 @@ flowchart TD
#### Indicate fragmented gene
-It's also possible to indicate if the gene is fragmented, by adding a new column in last position. Fragmented gene are tag with an 'F' in the last column.
+You can indicate if a gene is fragmented by adding a new column. Fragmented genes are marked with an 'F' in this final column.
-You can add this column when you assume or not the representative gene. PPanGGOLiN will guess that this column is to precise the fragmented gene and assume if it must assert the representative gene
+The position of this column depends on whether you include a representative gene column:
+- Without a representative gene column, the fragmented gene column should be in the **third position**.
+- With a representative gene column, it should appear in the **fourth position**.
-Here is a minimal example of your clustering file with fragmented gene precise:
+##### Example 1: Clustering file without representative gene column (fragmented gene in 3rd column):
+```
+Family_A Gene_1
+Family_A Gene_2
+Family_A Gene_3 F
+Family_B Gene_4
+Family_B Gene_5
+Family_C Gene_6 F
+```
+##### Example 2: Clustering file with representative gene column (fragmented gene in 4th column):
```
-Family_A Gene_2 Gene_1
-Family_A Gene_2 Gene_2
-Family_A Gene_2 Gene_3 'F'
-Family_B Gene_5 Gene_4
-Family_B Gene_5 Gene_5
-Family_C Gene_6 Gene_6 'F'
+Family_A Gene_1 Gene_2
+Family_A Gene_2 Gene_2
+Family_A Gene_3 Gene_2 F
+Family_B Gene_4 Gene_4
+Family_B Gene_5 Gene_4
+Family_C Gene_6 Gene_6 F
```
```{warning}
-*The column order is
-important !* You must first provide the cluster identifier, the representative id, and then the gene id to finish with
-the fragmented status of the gene.
+*Attention: Column Order Matters!*
+
+Please ensure that your columns follow the correct order:
+1. Cluster identifier
+2. Gene ID
+3. Representative gene ID (if present)
+4. Fragmented status ('F' if the gene is fragmented, or leave blank)
+
+If no representative gene column is included, the fragmented status should be placed in the **third column**.
```
### Defragmentation
diff --git a/docs/user/PangenomeAnalyses/pangenomeFigures.md b/docs/user/PangenomeAnalyses/pangenomeFigures.md
index 0b17eecc..531164ad 100644
--- a/docs/user/PangenomeAnalyses/pangenomeFigures.md
+++ b/docs/user/PangenomeAnalyses/pangenomeFigures.md
@@ -17,26 +17,37 @@ ppanggolin draw -p pangenome.h5 --ucurve
#### Tile plot
-A tile plot is similar to a heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (the color depends on the number of gene copies) and uncolored if absent. The gene families are ordered by partition and then by a hierarchical clustering, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
+A tile plot is a kind of heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (either in blue or red if the gene family has multiple gene copies) and uncolored if absent. The gene families are ordered by partition and then by their number of presences (increasing order), and the genomes are ordered by a hierarchical clustering based on their shared gene families via a Jaccard distance (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
-This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you the gene identifier(s), the gene family and the genome corresponding to the tile.
+This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate the gene identifier(s), the gene family and the genome corresponding to the tile. As detailed below, additional metadata can be displayed.
-If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 500 genomes, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
+If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 60k gene families, the plot will not be drawn; if you have more than 32 767 gene families, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise. Beyond the workflow subcommand, you can generate the plot with any number of gene families or genomes. However, no browser currently supports visualizing a tile plot containing more than 65 535 gene families or more than 65 535 genomes (for more information, refer to [this Stack Overflow discussion](https://stackoverflow.com/questions/78431835/plotly-heatmap-has-limit-on-data-size)
+ ).
-It can be generated using the 'draw' subcommand as such :
+To generate a tile plot, use the 'draw' subcommand as follows:
```bash
ppanggolin draw -p pangenome.h5 --tile_plot
```
-![tile plot figure](../../_static/tile_plot.png)
+![Tile plot figure](../../_static/tile_plot_no_dendro.png)
-and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :
+If you prefer not to include 'cloud' gene families, which can make the plot difficult to render in a browser, you can use the `--nocloud` option:
```bash
ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
```
+To include a dendrogram on top of the tile plot, use the `--add_dendrogram` option:
+
+```bash
+ppanggolin draw -p pangenome.h5 --tile_plot --add_dendrogram
+```
+
+![Tile plot with dendrogram](../../_static/tile_plot_dendro.png)
+
+If you have added metadata to the gene elements of your pangenome (see [metadata documentation](../metadata.md) for details), you can display this metadata in the hover text by using the `--add_metadata` argument.
+
#### Rarefaction curve
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group (Tettelin et al., 2005). The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary, if you are still adding a lot of genes you may be still missing a lot of gene families.
diff --git a/docs/user/PangenomeAnalyses/pangenomeStat.md b/docs/user/PangenomeAnalyses/pangenomeStat.md
index e0512997..04d1c7c2 100644
--- a/docs/user/PangenomeAnalyses/pangenomeStat.md
+++ b/docs/user/PangenomeAnalyses/pangenomeStat.md
@@ -144,10 +144,19 @@ To generate these files, use the `write_pangenome` subcommand with the `--partit
`ppanggolin write_pangenome -p pangenome.h5 --partitions`
-#### Gene Families to Genes Associations
-The `gene_families.tsv` file mirrors the format provided by [MMseqs2](https://github.com/soedinglab/MMseqs2) through its `createtsv` subcommand. This file structure comprises three columns: the gene family name in the first column, the gene names in the second, and a third column that remains empty or contains an "F" to denote potential gene fragments instead of complete genes. This indication appears only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used.
+#### Gene Families to Gene Associations
+
+The `gene_families.tsv` file consists of four columns:
+
+1. **Gene Family ID**: The identifier for the gene family.
+2. **Gene ID**: The identifier for the gene.
+3. **Local ID** (if applicable): This column is used when gene IDs from annotation files are not unique. In such cases, `ppanggolin` assigns an internal ID to genes, and this column helps map the internal ID to the local ID from the annotation file.
+4. **Fragmentation Status**: This column indicates whether the gene is fragmented. It is either empty or contains an "F" to signify potential gene fragments instead of complete genes. This status is provided only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used, which is the default behavior.
To generate this file, use the `write_pangenome` subcommand with the `--families_tsv` flag:
-`ppanggolin write_pangenome -p pangenome.h5 --families_tsv`
+```bash
+ppanggolin write_pangenome -p pangenome.h5 --families_tsv
+```
+
diff --git a/ppanggolin/RGP/rgp_cluster.py b/ppanggolin/RGP/rgp_cluster.py
index 078dc19a..fa9b7218 100644
--- a/ppanggolin/RGP/rgp_cluster.py
+++ b/ppanggolin/RGP/rgp_cluster.py
@@ -132,7 +132,7 @@ def compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily]
:return: GRR value between 0 and 1
"""
- grr = len((rgp_a_families & rgp_b_families)) / mode(len(rgp_a_families), len(rgp_b_families))
+ grr = len(rgp_a_families & rgp_b_families) / mode(len(rgp_a_families), len(rgp_b_families))
return grr
@@ -147,7 +147,7 @@ def compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) -> float:
:return : Jaccard index
"""
- jaccard_index = len((rgp_a_families & rgp_b_families)) / len(rgp_a_families | rgp_b_families)
+ jaccard_index = len(rgp_a_families & rgp_b_families) / len(rgp_a_families | rgp_b_families)
return jaccard_index
@@ -351,7 +351,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
families_to_rgps = defaultdict(list)
for rgp in tqdm(rgps, total=len(rgps), unit="RGP", disable=disable_bar):
- families_to_rgps[tuple(sorted((f.ID for f in rgp.families)))].append(rgp)
+ families_to_rgps[tuple(sorted(f.ID for f in rgp.families))].append(rgp)
dereplicated_rgps = []
identical_region_count = 0
@@ -362,7 +362,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
families = set(rgps[0].families)
# identical regions object is considered on a contig border if all rgp are contig border
- is_contig_border = all([rgp.is_contig_border for rgp in rgps])
+ is_contig_border = all(rgp.is_contig_border for rgp in rgps)
# create a new object that will represent the identical rgps
identical_rgp = IdenticalRegions(name=f"identical_rgps_{identical_region_count}",
@@ -548,7 +548,7 @@ def cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str,
dereplicated_rgps = dereplicate_rgp(valid_rgps, disable_bar=disable_bar)
grr_graph = nx.Graph()
- grr_graph.add_nodes_from((rgp.ID for rgp in dereplicated_rgps))
+ grr_graph.add_nodes_from(rgp.ID for rgp in dereplicated_rgps)
# Get all pairs of RGP that share at least one family
diff --git a/ppanggolin/align/alignOnPang.py b/ppanggolin/align/alignOnPang.py
index 6b73fe7d..3cbbf35b 100644
--- a/ppanggolin/align/alignOnPang.py
+++ b/ppanggolin/align/alignOnPang.py
@@ -116,7 +116,7 @@ def map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path,
aln_file_clean = outdir / "alignment_input_seqs_to_all_pangenome_genes.tsv" # write the actual result file
logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')
- with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
+ with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
for line in alnFile:
line_splitted = line.split()
@@ -151,7 +151,7 @@ def map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path,
logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')
- with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
+ with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
for line in alnFile:
line_splitted = line.split()
diff --git a/ppanggolin/annotate/annotate.py b/ppanggolin/annotate/annotate.py
index 187d4208..a0d49b52 100644
--- a/ppanggolin/annotate/annotate.py
+++ b/ppanggolin/annotate/annotate.py
@@ -73,7 +73,7 @@ def create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: i
start, stop = coordinates[0][0], coordinates[-1][1]
- if any((dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs)):
+ if any(dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs):
if gene_name == "":
gene_name = gene_id
@@ -378,7 +378,7 @@ def combine_contigs_metadata(contig_to_metadata: Dict[str, Dict[str, str]]) -> T
all_tag_to_value.count(tag_and_value) == contig_count}
# Create a dictionary for shared metadata
- genome_metadata = {tag: value for tag, value in shared_tag_and_values}
+ genome_metadata = dict(shared_tag_and_values)
contig_to_uniq_metadata = {}
for contig, tag_to_value in contig_to_metadata.items():
@@ -793,7 +793,10 @@ def check_chevrons_in_start_and_stop(start: str, stop: str) -> Tuple[int, int, b
correct_putative_overlaps(org.contigs)
# GET THE FASTA SEQUENCES OF THE GENES
- if has_fasta and fasta_string != "":
+ if fasta_string == "":
+ has_fasta = False
+
+ if has_fasta:
contig_sequences = read_fasta(org, fasta_string.split('\n')) # _ is total contig length
for contig in org.contigs:
if contig.length != len(contig_sequences[contig.name]):
@@ -1073,9 +1076,10 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
futures.append(future)
for future in futures:
- org, flag = future.result()
+ org, has_dna_sequence = future.result()
pangenome.add_organism(org)
- if not flag:
+
+ if not has_dna_sequence:
pangenome.status["geneSequences"] = "No"
# decide whether we use local ids or ppanggolin ids.
@@ -1093,11 +1097,11 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
pangenome.parameters["annotate"]["use_pseudo"] = pseudo
pangenome.parameters["annotate"]["# read_annotations_from_file"] = True
- if any((genome.has_metadata() for genome in pangenome.organisms)):
+ if any(genome.has_metadata() for genome in pangenome.organisms):
pangenome.status["metadata"]["genomes"] = "Computed"
pangenome.status["metasources"]["genomes"].append("annotation_file")
- if any((contig.has_metadata() for contig in pangenome.contigs)):
+ if any(contig.has_metadata() for contig in pangenome.contigs):
pangenome.status["metadata"]["contigs"] = "Computed"
pangenome.status["metasources"]["contigs"].append("annotation_file")
@@ -1151,7 +1155,7 @@ def get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disa
msg = f"Fasta file for genome {org.name} did not have the contig {contig.name} " \
f"that was read from the annotation file. "
msg += f"The provided contigs in the fasta were : " \
- f"{', '.join([contig for contig in fasta_dict[org].keys()])}."
+ f"{', '.join(fasta_dict[org].keys())}."
raise KeyError(msg)
pangenome.status["geneSequences"] = "Computed"
diff --git a/ppanggolin/cluster/cluster.py b/ppanggolin/cluster/cluster.py
index b16c0a5f..83557cc5 100644
--- a/ppanggolin/cluster/cluster.py
+++ b/ppanggolin/cluster/cluster.py
@@ -9,6 +9,7 @@
from typing import Tuple, Dict, Set
from pathlib import Path
import time
+import gzip
# installed libraries
from networkx import Graph
@@ -116,7 +117,7 @@ def read_faa(faa_file_name: Path) -> Dict[str, str]:
"""
fam2seq = {}
head = ""
- with open(faa_file_name, "r") as faaFile:
+ with open(faa_file_name) as faaFile:
for line in faaFile:
if line.startswith('>'):
head = line[1:].strip().replace("ppanggolin_", "") # remove the eventual addition
@@ -160,7 +161,7 @@ def read_tsv(tsv_file_name: Path) -> Tuple[Dict[str, Tuple[str, bool]], Dict[str
"""
genes2fam = {}
fam2genes = defaultdict(set)
- with open(tsv_file_name, "r") as tsvfile:
+ with open(tsv_file_name) as tsvfile:
for line in tsvfile:
line = line.replace('"', '').replace("ppanggolin_", "").split()
# remove the '"' char which protects the fields, and the eventual addition
@@ -188,7 +189,7 @@ def refine_clustering(tsv: Path, aln_file: Path,
simgraph.add_node(fam, nbgenes=len(genes))
# add the edges
- with open(aln_file, "r") as alnfile:
+ with open(aln_file) as alnfile:
for line in alnfile:
line = line.replace('"', '').replace("ppanggolin_", "").split() # remove the eventual addition
@@ -341,42 +342,71 @@ def mk_local_to_gene(pangenome: Pangenome) -> dict:
return {} # local identifiers are not unique.
return local_dict
-
def infer_singletons(pangenome: Pangenome):
- """Creates a new family for each gene with no associated family
+ """
+ Creates a new family for each gene with no associated family.
- :param pangenome: Input pangenome
+ :param pangenome: Input pangenome object
"""
singleton_counter = 0
for gene in pangenome.genes:
if gene.family is None:
+ # Create a new family for the singleton gene
fam = GeneFamily(family_id=pangenome.max_fam_id, name=gene.ID)
+ fam.representative = gene
fam.add(gene)
- pangenome.add_gene_family(fam)
+
+
+
+ # Try to add the new family
+ try:
+ pangenome.add_gene_family(fam)
+ except KeyError:
+ raise KeyError(
+ f"Cannot create singleton family with name='{fam.name}' for gene '{gene.ID}': "
+ f"A family with the same name already exists. Check the gene '{gene.ID}' in input cluster file."
+ )
+
singleton_counter += 1
+
logging.getLogger("PPanGGOLiN").info(f"Inferred {singleton_counter} singleton families")
def get_family_representative_sequences(pangenome: Pangenome, code: int = 11, cpu: int = 1,
tmpdir: Path = None, keep_tmp: bool = False):
+
+ logging.getLogger("PPanGGOLiN").info("Retrieving protein sequences of family representatives.")
+
tmpdir = Path(tempfile.gettempdir()) if tmpdir is None else tmpdir
with create_tmpdir(tmpdir, "get_proteins_sequences", keep_tmp) as tmp:
- repres_path = tmp / "representative.fna"
- with open(repres_path, "w") as repres_seq:
+
+ repres_path = tmp / "representative.fna.gz"
+
+ with gzip.open(repres_path, mode="wt") as repres_seq:
+
for family in pangenome.gene_families:
+
+ if family.representative.dna is None:
+ raise ValueError(f'DNA sequence of representative gene {family.representative} is None. '
+ 'Sequence may not have been loaded correctly from the pangenome file or the pangenome has no gene sequences.')
+
repres_seq.write(f">{family.name}\n")
repres_seq.write(f"{family.representative.dna}\n")
+
translate_db = translate_genes(sequences=repres_path, tmpdir=tmp, cpu=cpu,
is_single_line_fasta=True, code=code)
+
outpath = tmp / "representative_protein_genes.fna"
cmd = list(map(str, ["mmseqs", "convert2fasta", translate_db, outpath]))
run_subprocess(cmd, msg="MMSeqs convert2fasta failed with the following error:\n")
- with open(outpath, "r") as repres_prot:
+
+ with open(outpath) as repres_prot:
lines = repres_prot.readlines()
while len(lines) > 0:
family_name = lines.pop(0).strip()[1:]
family_seq = lines.pop(0).strip()
family = pangenome.get_gene_family(family_name)
+
family.add_sequence(family_seq)
@@ -399,34 +429,56 @@ def read_clustering_file(families_tsv_path: Path) -> Tuple[pd.DataFrame, bool]:
:return: The processed DataFrame and a boolean indicating if any gene is marked as fragmented.
"""
- logging.getLogger("PPanGGOLiN").info(f"Reading {families_tsv_path.name} the gene families file ...")
+ logging.getLogger("PPanGGOLiN").info(
+ f"Reading clustering file to group genes into families: {families_tsv_path.as_posix()}"
+ )
+
+ # Detect compression type if any
_, compress_type = is_compressed(families_tsv_path)
- families_df = pd.read_csv(families_tsv_path, sep="\t", header=None,
- compression=compress_type if compress_type is not None else 'infer') # Set as infer to manage other compression type
+
+ # Read the file with inferred compression if necessary
+ families_df = pd.read_csv(
+ families_tsv_path,
+ sep="\t",
+ header=None,
+ compression=compress_type if compress_type is not None else 'infer',
+ dtype=str
+ )
+
+ # Process DataFrame based on the number of columns
if families_df.shape[1] == 2:
families_df.columns = ["family", "gene"]
families_df["representative"] = families_df.groupby('family')['gene'].transform('first')
families_df["is_frag"] = False
+
elif families_df.shape[1] == 3:
- if families_df[2].dropna().unique().tolist() == ['F']:
+ # Check if the third column is 'is_frag'
+ if families_df[2].dropna().eq('F').all():
families_df.columns = ["family", "gene", "is_frag"]
- families_df["representative"] = families_df.groupby('family')['gene'].transform('first')
families_df["is_frag"] = families_df["is_frag"].replace('F', True).fillna(False)
+ families_df["representative"] = families_df.groupby('family')['gene'].transform('first')
else:
families_df.columns = ["family", "gene", "representative"]
families_df["is_frag"] = False
+
elif families_df.shape[1] == 4:
families_df.columns = ["family", "representative", "gene", "is_frag"]
+
else:
- if families_df.shape[1] == 1:
- raise ValueError("Only one column found. This might be due to "
- "no tabulation separator found in gene families file")
- else:
- raise ValueError("Too much columns. You must at least give in first column the family identifier and "
- "as second column the gene identifier. More information in the documentation")
- if families_df["gene"].unique().shape[0] < families_df["gene"].shape[0]:
+ raise ValueError(
+ f"Unexpected number of columns ({families_df.shape[1]}). The file must have 2, 3, or 4 columns."
+ )
+
+ # Ensure columns are strings
+ families_df["family"] = families_df["family"].astype(str)
+ families_df["gene"] = families_df["gene"].astype(str)
+ families_df["representative"] = families_df["representative"].astype(str)
+
+ # Check for duplicate gene IDs
+ if families_df["gene"].duplicated().any():
raise Exception("It seems that there is duplicated gene id in your clustering.")
- return families_df[["family", "representative", "gene", "is_frag"]], bool(families_df["is_frag"].any())
+
+ return families_df[["family", "representative", "gene", "is_frag"]], families_df["is_frag"].any()
def read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singleton: bool = False,
@@ -447,11 +499,19 @@ def read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singlet
:param disable_bar: Allow to disable progress bar
"""
check_pangenome_former_clustering(pangenome, force)
- check_pangenome_info(pangenome, need_annotations=True, need_gene_sequences=True, disable_bar=disable_bar)
+
+ if pangenome.status["geneSequences"] == "No":
+ need_gene_sequences=False
+ else:
+ need_gene_sequences = True
+
+ check_pangenome_info(pangenome, need_annotations=True, need_gene_sequences=need_gene_sequences, disable_bar=disable_bar)
families_df, frag = read_clustering_file(families_tsv_path)
+
nb_gene_with_fam = 0
local_dict = mk_local_to_gene(pangenome)
+
def get_gene_obj(identifier):
try:
gene_obj = pangenome.get_gene(identifier)
@@ -460,18 +520,32 @@ def get_gene_obj(identifier):
return gene_obj
for _, row in tqdm(families_df.iterrows(), total=families_df.shape[0], unit="line", disable=disable_bar):
- fam_id, reprez_id, gene_id, is_frag = row['family'], row['representative'], row['gene'], row['is_frag']
+
+ fam_id, reprez_id, gene_id, is_frag = str(row['family']), str(row['representative']), str(row['gene']), bool(row['is_frag'])
+
gene = get_gene_obj(gene_id)
+
if gene is not None:
nb_gene_with_fam += 1
+
try:
fam = pangenome.get_gene_family(fam_id)
+
except KeyError: # Family not found so create and add
fam = GeneFamily(pangenome.max_fam_id, fam_id)
- fam.representative = get_gene_obj(reprez_id)
+ representative_gene = get_gene_obj(reprez_id)
+ if representative_gene is None:
+ raise KeyError(f"The gene {reprez_id} associated to family {fam_id} from the clustering file is not found in pangenome.")
+
+ fam.representative = representative_gene
+
pangenome.add_gene_family(fam)
gene.is_fragment = is_frag
fam.add(gene)
+ else:
+ raise KeyError(f"The gene {gene_id} associated to family {fam_id} from the clustering file is not found in pangenome.")
+
+
if nb_gene_with_fam < pangenome.number_of_genes: # not all genes have an associated cluster
if nb_gene_with_fam == 0:
raise Exception("No gene ID in the cluster file matched any gene ID from the annotation step."
@@ -487,7 +561,10 @@ def get_gene_obj(identifier):
f"You can either update your cluster file to ensure each gene has a cluster assignment, "
f"or use the '--infer_singletons' option to automatically infer a cluster for each non-clustered gene."
)
- get_family_representative_sequences(pangenome, code, cpu, tmpdir, keep_tmp)
+ if pangenome.status["geneSequences"] == "No":
+ logging.getLogger("PPanGGOLiN").info("The pangenome has no gene sequences so it is not possible to extract sequence of family representatives.")
+ else:
+ get_family_representative_sequences(pangenome, code, cpu, tmpdir, keep_tmp)
pangenome.status["genesClustered"] = "Computed"
if frag: # if there was fragment information in the file.
diff --git a/ppanggolin/context/searchGeneContext.py b/ppanggolin/context/searchGeneContext.py
index f86baf85..71f4f644 100644
--- a/ppanggolin/context/searchGeneContext.py
+++ b/ppanggolin/context/searchGeneContext.py
@@ -168,7 +168,7 @@ def search_gene_context_in_pangenome(pangenome: Pangenome, output: Path, sequenc
if len(gene_contexts) != 0:
logging.getLogger().info(
- f"There are {sum((len(gc) for gc in gene_contexts))} families among {len(gene_contexts)} gene contexts")
+ f"There are {sum(len(gc) for gc in gene_contexts)} families among {len(gene_contexts)} gene contexts")
output_file = output / "gene_contexts.tsv"
export_context_to_dataframe(gene_contexts, family_2_input_seqid, families_of_interest, output_file)
@@ -323,7 +323,7 @@ def compute_edge_metrics(context_graph: nx.Graph, gene_proportion_cutoff: float)
mean_transitivity = sum(
(transitivity * counter for transitivity, counter in transitivity_counter.items())) / sum(
- (counter for counter in transitivity_counter.values()))
+ counter for counter in transitivity_counter.values())
data['mean_transitivity'] = mean_transitivity
@@ -527,12 +527,10 @@ def compute_gene_context_graph(families: Iterable[GeneFamily], transitive: int =
for cc in nx.connected_components(contig_graph):
# If gene families are in the same connected component for the contig graph,
# they exist in the same context in at least one genome
- combination = []
- for family in cc.intersection({gene.family for gene in genes_of_interest}):
- # Family here are family of interest for the context and in the same connected component
- combination.append(family)
+ combination = list(cc.intersection({gene.family for gene in genes_of_interest}))
+ # Family here are family of interest for the context and in the same connected component
+
combs2orgs[frozenset(combination)].add(contig.organism)
-
return context_graph, combs2orgs
diff --git a/ppanggolin/figures/drawing.py b/ppanggolin/figures/drawing.py
index 41897dfc..408f6dfc 100644
--- a/ppanggolin/figures/drawing.py
+++ b/ppanggolin/figures/drawing.py
@@ -43,7 +43,9 @@ def launch(args: argparse.Namespace):
pangenome = Pangenome()
pangenome.add_file(args.pangenome)
if args.tile_plot:
- draw_tile_plot(pangenome, args.output, args.nocloud, disable_bar=args.disable_prog_bar)
+ draw_tile_plot(pangenome, args.output, args.nocloud, draw_dendrogram=args.add_dendrogram, disable_bar=args.disable_prog_bar,
+ add_metadata=args.add_metadata,
+ metadata_sources=args.metadata_sources)
if args.ucurve:
draw_ucurve(pangenome, args.output, soft_core=args.soft_core, disable_bar=args.disable_prog_bar)
if args.draw_spots:
@@ -81,8 +83,7 @@ def parser_draw(parser: argparse.ArgumentParser):
help="Output directory")
optional.add_argument("--tile_plot", required=False, default=False, action="store_true",
help="draw the tile plot of the pangenome")
- optional.add_argument("--nocloud", required=False, default=False, action="store_true",
- help="Do not draw the cloud in the tile plot")
+
optional.add_argument("--soft_core", required=False, default=0.95, type=restricted_float,
help="Soft core threshold to use")
optional.add_argument("--ucurve", required=False, default=False, action="store_true",
@@ -91,6 +92,31 @@ def parser_draw(parser: argparse.ArgumentParser):
help="draw plots for spots of the pangenome")
optional.add_argument("--spots", required=False, default='all', nargs='+',
help="a comma-separated list of spots to draw (or 'all' to draw all spots, or 'synteny' to draw spots with different RGP syntenies).")
+
+ optional.add_argument("--nocloud", required=False, default=False, action="store_true",
+ help="Do not draw the cloud genes in the tile plot")
+ optional.add_argument(
+ "--add_dendrogram",
+ required=False,
+ default=False,
+ action="store_true",
+ help="Include a dendrogram for genomes in the tile plot based on the presence/absence of gene families."
+ )
+
+ optional.add_argument(
+ "--add_metadata",
+ required=False,
+ default=False,
+ action="store_true",
+ help="Display gene metadata as hover text for each cell in the tile plot."
+ )
+
+ optional.add_argument("--metadata_sources",
+ default=None,
+ nargs="+",
+ help="Which source of metadata should be written in the tile plot. "
+ "By default all metadata sources are included.")
+
if __name__ == '__main__':
diff --git a/ppanggolin/figures/tile_plot.py b/ppanggolin/figures/tile_plot.py
index ce3f2cf9..3c416105 100644
--- a/ppanggolin/figures/tile_plot.py
+++ b/ppanggolin/figures/tile_plot.py
@@ -4,176 +4,459 @@
import logging
from collections import defaultdict
from pathlib import Path
+from itertools import cycle
+from typing import List, Tuple, Dict, Set, Optional
# installed libraries
-import numpy
-from scipy.spatial.distance import pdist
from scipy.sparse import csc_matrix
-from scipy.cluster.hierarchy import linkage, dendrogram
import plotly.graph_objs as go
-import plotly.offline as out_plotly
-import colorlover as cl
+import plotly.figure_factory as ff
+from plotly.subplots import make_subplots
+import numpy as np
# local libraries
from ppanggolin.formats import check_pangenome_info
+from ppanggolin.genome import Organism
from ppanggolin.pangenome import Pangenome
from ppanggolin.utils import jaccard_similarities
+def draw_tile_plot(pangenome: Pangenome,
+ output: Path,
+ nocloud: bool = False,
+ draw_dendrogram: bool = False,
+ add_metadata:bool=False,
+ metadata_sources:Optional[Set[str]]=None,
+ disable_bar: bool = False):
+ """
+ Draw a tile plot from a partitioned pangenome.
-def draw_tile_plot(pangenome: Pangenome, output: Path, nocloud: bool = False, disable_bar: bool = False):
+ :param pangenome: Partitioned pangenome.
+ :param output: Path to the output directory where the tile plot will be saved.
+ :param nocloud: If True, exclude the cloud partition from the plot.
+ :param draw_dendrogram: If True, include a dendrogram in the tile plot.
+ :param disable_bar: If True, disable the progress bar during processing.
"""
- Draw a tile plot from a partitioned pangenome
+
+ # Check if the pangenome has the required information and is partitioned
+ check_pangenome_info(pangenome, need_annotations=True, need_families=True, need_graph=True, disable_bar=disable_bar, need_metadata=add_metadata, sources=metadata_sources)
+ if pangenome.status["partitioned"] == "No":
+ raise Exception("Cannot draw the tile plot as the pangenome has not been partitioned.")
+
+ # Warn if there are more than 32767 genomes, as the output might be too large for browsers to handle
+ if pangenome.number_of_organisms > 32767:
+ logging.getLogger("PPanGGOLiN").warning(
+ "You requested to draw a tile plot for a large number of genomes (>32k). "
+ "This may result in a plot that is too large for web browsers to render."
+ )
+ if pangenome.number_of_gene_families > 32767 and not nocloud:
+ logging.getLogger("PPanGGOLiN").warning(
+ "You requested to draw a tile plot for a pangenome with a large number of families (>32k). "
+ "This may result in a plot that is too large for web browsers to render."
+ "You can use the --nocloud flag to exclude cloud families from the plot. "
+ )
+
+
+ logging.getLogger("PPanGGOLiN").info("Starting the process of drawing the tile plot...")
+
+ # Prepare the data structures required for generating the tile plot
+ families, org_index = prepare_data_structures(pangenome, nocloud)
+
+ # Build the presence-absence matrix for the families and generate the dendrogram if required
+ mat_p_a = build_presence_absence_matrix(families, org_index)
+ order_organisms, dendrogram_fig = generate_dendrogram(mat_p_a, org_index)
+
+ # Process the data to be displayed in the tile plot
+ binary_data, text_data, fam_order, separators = process_tile_data(families, order_organisms)
+
+ # Create the tile plot figure with or without the dendrogram
+ fig = create_tile_plot(binary_data, text_data, fam_order, separators, order_organisms, dendrogram_fig, draw_dendrogram)
+
+ # Save the plot to the specified output directory
+ filename = output / "tile_plot.html"
+ fig.write_html(filename)
- :param pangenome: Partitioned pangenome
- :param output: Path to output directory
- :param nocloud: Do not draw the cloud partition
- :param disable_bar: Allow to disable progress bar
+ logging.getLogger("PPanGGOLiN").info(f"Tile plot successfully created and saved to: '{filename}'.")
+
+ return fig
+
+def prepare_data_structures(pangenome: Pangenome, nocloud: bool) -> Tuple[set, dict]:
"""
+ Prepare data structures required for generating the tile plot.
- check_pangenome_info(pangenome, need_annotations=True, need_families=True, need_graph=True, disable_bar=disable_bar)
- if pangenome.status["partitioned"] == "No":
- raise Exception("Cannot draw the tile plot as your pangenome has not been partitioned")
- if pangenome.number_of_organisms > 500 and nocloud is False:
- logging.getLogger("PPanGGOLiN").warning("You asked to draw a tile plot for a lot of genomes (>500). "
- "Your browser will probably not be able to open it.")
- logging.getLogger("PPanGGOLiN").info("Drawing the tile plot...")
- data = []
- all_indexes = []
- all_columns = []
- fam2index = {}
- index2fam = {}
+ :param pangenome: Partitioned pangenome containing gene families and organism data.
+ :param nocloud: If True, exclude gene families belonging to the cloud partition.
+ :return: A tuple containing a set of gene families to be plotted and a dictionary mapping organisms to their indices.
+ """
+
+ # Exclude gene families in the cloud partition if 'nocloud' is True; otherwise, include all gene families
if nocloud:
families = {fam for fam in pangenome.gene_families if not fam.partition.startswith("C")}
else:
families = set(pangenome.gene_families)
+
+ # Get the organism index mapping from the pangenome
org_index = pangenome.get_org_index()
- index2org = {}
- for org, index in org_index.items():
- index2org[index] = org
- colors = {"pangenome": "black", "exact_accessory": "#EB37ED", "exact_core": "#FF2828", "soft_core": "#c7c938",
- "soft_accessory": "#996633", "shell": "#00D860", "persistent": "#F7A507", "cloud": "#79DEFF",
- "undefined": "#828282"}
+ return families, org_index
+
+def build_presence_absence_matrix(families: set, org_index: dict) -> csc_matrix:
+ """
+ Build the presence-absence matrix for gene families.
- logging.getLogger("PPanGGOLiN").info("start with matrice")
+ This matrix indicates the presence (1) or absence (0) of each gene family across different organisms.
+ :param families: A set of gene families to be included in the matrix.
+ :param org_index: A dictionary mapping each organism to its respective index in the matrix.
+ :return: A sparse matrix (Compressed Sparse Column format) representing the presence-absence of gene families.
+ """
+
+ # Initialize lists to store matrix data in a sparse format
+ data, all_indexes, all_columns = [], [], []
+
+ # Iterate through each gene family to populate the presence-absence matrix
for row, fam in enumerate(families):
+ # Find the indices of organisms that have the current gene family
new_col = [org_index[org] for org in fam.organisms]
- all_indexes.extend([row] * len(new_col))
- all_columns.extend(new_col)
- data.extend([1.0] * len(new_col))
- index2fam[row] = fam.name
- fam2index[fam.name] = row
+ all_indexes.extend([row] * len(new_col)) # Row index repeated for each presence
+ all_columns.extend(new_col) # Corresponding column indices for the organisms
+ data.extend([1.0] * len(new_col)) # All presences are marked with 1.0
+
+ # Construct the presence-absence matrix using Compressed Sparse Column format
+ mat_p_a = csc_matrix((data, (all_indexes, all_columns)), shape=(len(families), len(org_index)), dtype='float')
+
+ return mat_p_a
+
+def generate_dendrogram(mat_p_a: csc_matrix, org_index: dict) -> Tuple[List, go.Figure]:
+ """
+ Generate the order of organisms based on a dendrogram.
- mat_p_a = csc_matrix((data, (all_indexes, all_columns)), shape=(len(families), pangenome.number_of_organisms),
- dtype='float')
- dist = pdist(1 - jaccard_similarities(mat_p_a, 0).todense())
- hc = linkage(dist, 'single')
+ :param mat_p_a: Sparse matrix representing the presence-absence of gene families.
+ :param org_index: Dictionary mapping organism names to their respective indices in the matrix.
+ :return: A tuple containing the ordered list of organisms and the dendrogram figure.
+ """
+
+ # Extract organism names from the org_index dictionary
+ genom_names = [org.name for org in org_index]
- dendro_org = dendrogram(hc, no_plot=True)
- logging.getLogger("PPanGGOLiN").info("done with making the dendrogram to order the genomes on the plot")
+ # Create a mapping from organism names to organism objects
+ name_to_org = {org.name: org for org in org_index}
- order_organisms = [index2org[index] for index in dendro_org["leaves"]]
+ # Compute the distance matrix using Jaccard similarity
+ distance_matrice = 1 - jaccard_similarities(mat_p_a, 0).todense()
- binary_data = []
- text_data = []
- fam_order = []
+ # Create a dendrogram figure using the computed distance matrix
+ dendrogram_fig = ff.create_dendrogram(distance_matrice, labels=genom_names, orientation='bottom')
+
+ # Adjust the dendrogram figure to make it match with the heatmap later on
+ for i in range(len(dendrogram_fig['data'])):
+ dendrogram_fig['data'][i]['yaxis'] = 'y2' # Aligns dendrogram data on a secondary y-axis
+ dendrogram_fig['data'][i]['showlegend'] = False # Hides legends in the dendrogram
+
+ # Extract the ordered list of organisms from the dendrogram tick labels
+ order_organisms = [name_to_org[org_name] for org_name in dendrogram_fig['layout']['xaxis']['ticktext']]
+
+ return order_organisms, dendrogram_fig
+
+
+def process_tile_data(families: set, order_organisms: List) -> Tuple[List[List[float]], List[List[str]], List[str], List[Tuple[str, float]]]:
+ """
+ Process data for each tile in the plot.
+
+ :param families: A set of gene families to be processed.
+ :param order_organisms: The ordered list of organisms for the tile plot.
+ :return: A tuple containing binary data, text data, family order, and separators for the plot.
+ """
+ binary_data, text_data, fam_order = [], [], []
partitions_dict = defaultdict(list)
shell_subs = set()
+
+ # Group families by partition and identify shell subpartitions
for fam in families:
partitions_dict[fam.partition].append(fam)
if fam.partition.startswith("S"):
- shell_subs.add(fam.partition) # number of elements will tell the number of subpartitions
- ordered_nodes_p = sorted(partitions_dict["P"], key=lambda n: n.number_of_organisms, reverse=True)
- ordered_nodes_c = sorted(partitions_dict["C"], key=lambda n: n.number_of_organisms, reverse=True)
- sep_p = len(ordered_nodes_p) - 0.5
- separators = [sep_p]
- shell_na = None
- if len(shell_subs) == 1:
- ordered_nodes_s = sorted(partitions_dict[shell_subs.pop()], key=lambda n: n.number_of_organisms, reverse=True)
- ordered_nodes = ordered_nodes_p + ordered_nodes_s + ordered_nodes_c
- separators.append(separators[len(separators) - 1] + len(ordered_nodes_s))
- separators.append(separators[len(separators) - 1] + len(ordered_nodes_c))
- else:
- ordered_nodes = ordered_nodes_p
- for subpartition in sorted(shell_subs):
- if subpartition == "S_":
- shell_na = len(separators) - 1
- ordered_nodes_s = sorted(partitions_dict[subpartition], key=lambda n: n.number_of_organisms, reverse=True)
- ordered_nodes += ordered_nodes_s
- separators.append(separators[len(separators) - 1] + len(ordered_nodes_s))
- ordered_nodes += ordered_nodes_c
- separators.append(separators[len(separators) - 1] + len(ordered_nodes_c))
-
- logging.getLogger("PPanGGOLiN").info("Getting the gene name(s) and the number for each tile of the plot ...")
+ shell_subs.add(fam.partition)
+
+ ordered_nodes, separators = order_nodes(partitions_dict, shell_subs)
+
+ # Populate binary and text data for each family
for node in ordered_nodes:
fam_order.append(node.name)
data = set(node.organisms)
- binary_data.append([len(list(node.get_genes_per_org(org))) if org in data else numpy.nan for org in order_organisms])
- text_data.append([("\n".join(map(str, node.get_genes_per_org(org))))
- if org in data else numpy.nan for org in order_organisms])
+ binary_data.append([len(list(node.get_genes_per_org(org))) if org in data else np.nan for org in order_organisms])
+ text_data.append([("\n".join(map(str, node.get_genes_per_org(org))) if org in data else np.nan) for org in order_organisms])
- xaxis_values = [org.name for org in order_organisms]
+ # Generate hover text for the heatmap
+ text_data = get_heatmap_hover_text(ordered_nodes, order_organisms)
+
+ return binary_data, text_data, fam_order, separators
+
+
+def order_nodes(partitions_dict: dict, shell_subs: set) -> Tuple[List, List[Tuple[str, float]]]:
+ """
+ Order gene families based on their partitions.
+
+ :param partitions_dict: A dictionary where keys are partition names and values are lists of gene families in each partition.
+ :param shell_subs: A set of shell subpartition names.
+ :return: A tuple containing the ordered list of gene families and a list of partition separators.
+ """
+
+ # Sort persistent and cloud partitions by the number of organisms in descending order
+ ordered_nodes_p = sorted(partitions_dict["P"], key=lambda n: n.number_of_organisms, reverse=True)
+ ordered_nodes_c = sorted(partitions_dict["C"], key=lambda n: n.number_of_organisms, reverse=True)
+
+ partition_separators = [("Persistent", len(ordered_nodes_p) - 0.5)]
+ ordered_nodes = ordered_nodes_p
+
+ # Sort shell subpartitions and add them to the ordered nodes list
+ for subpartition in sorted(shell_subs):
+ partition_name = "Shell" if len(shell_subs) == 1 else f"Shell_{subpartition}"
+ ordered_nodes_s = sorted(partitions_dict[subpartition], key=lambda n: n.number_of_organisms, reverse=True)
+ ordered_nodes += ordered_nodes_s
+ partition_separators.append((partition_name, partition_separators[-1][1] + len(ordered_nodes_s)))
+
+ # Append cloud partition to the ordered nodes list
+ ordered_nodes += ordered_nodes_c
+ partition_separators.append(("Cloud", partition_separators[-1][1] + len(ordered_nodes_c)))
+
+ return ordered_nodes, partition_separators
+
+
+def create_partition_shapes(
+ separators: List[Tuple[str, float]],
+ xval_max: float,
+ heatmap_row: int,
+ partition_to_color: Dict[str, str]
+) -> List[dict]:
+ """
+ Create the shapes for plot separators to visually distinguish partitions in the plot.
+
+ :param separators: A list of tuples containing partition names and their corresponding separator positions.
+ :param xval_max: The maximum x-value for the plot.
+ :param heatmap_row: The row number of the heatmap.
+ :param partition_to_color: A dictionary mapping partition names to their corresponding colors.
+ :return: A list of shape dictionaries for Plotly to use in the plot.
+ """
- logging.getLogger("PPanGGOLiN").info("Done extracting names and numbers. Making the heatmap ...")
-
- heatmap = go.Heatmap(z=binary_data,
- x=xaxis_values,
- y=fam_order,
- text=text_data,
- zauto=False,
- zmin=1,
- zmax=2,
- autocolorscale=False,
- colorscale=[[0.50, 'rgb(100, 15, 78)'], [1, 'rgb(59, 157, 50)']],
- colorbar=dict(title='Presence/Absence',
- titleside='top',
- tickmode='array',
- tickvals=[1, 2],
- ticktext=['Presence', 'Multicopy'],
- ticks='outside'))
- shell_color = None
- if len(shell_subs) > 1:
- if "S_" not in shell_subs:
- shell_color = cl.interp(cl.flipper()['seq']['9']['Greens'][1:7], len(shell_subs))
- else:
- shell_color = cl.interp(cl.flipper()['seq']['9']['Greens'][1:7], len(shell_subs) - 1)
shapes = []
sep_prec = 0
- for nb, sep in enumerate(separators):
- if nb == 0:
- color = colors["persistent"]
- elif nb == (len(separators) - 1):
- color = colors["cloud"]
- elif len(shell_subs) > 1:
- if shell_na is not None and nb == shell_na:
- color = colors["shell"]
- else:
- color = shell_color.pop()
- else:
- color = colors["shell"]
- shapes.append(dict(type='line', x0=-1, x1=-1, y0=sep_prec, y1=sep, line=dict(dict(width=10, color=color))))
- shapes.append(dict(type='line', x0=pangenome.number_of_organisms, x1=pangenome.number_of_organisms, y0=sep_prec, y1=sep,
- line=dict(dict(width=10, color=color))))
- shapes.append(dict(type='line', x0=-1, x1=pangenome.number_of_organisms, y0=sep, y1=sep,
- line=dict(dict(width=1, color=color))))
+ xref = 'x1'
+ yref = f'y{heatmap_row}'
+
+ for partition_name, sep in separators:
+ color = partition_to_color[partition_name]
+
+ # Left vertical line for partition separator
+ shapes.append(dict(
+ type='line', x0=-1, x1=-1, y0=sep_prec, y1=sep,
+ line=dict(width=10, color=color), xref=xref, yref=yref,
+ name=partition_name, showlegend=True, legendgroup=partition_name
+ ))
+
+ # Right vertical line for partition separator
+ shapes.append(dict(
+ type='line', x0=xval_max, x1=xval_max, y0=sep_prec, y1=sep,
+ line=dict(width=10, color=color), xref=xref, yref=yref,
+ name=partition_name, showlegend=False, legendgroup=partition_name
+ ))
+
+ # Horizontal line across the partition boundary
+ shapes.append(dict(
+ type='line', x0=-1, x1=xval_max, y0=sep, y1=sep,
+ line=dict(width=1, color=color), xref=xref, yref=yref,
+ name=partition_name, showlegend=False, legendgroup=partition_name
+ ))
+
sep_prec = sep
- layout = go.Layout(title="presence/absence matrix",
- xaxis=go.layout.XAxis(ticktext=xaxis_values,
- title='genomes',
- tickvals=xaxis_values,
- automargin=True,
- tickfont=dict(size=10)),
- yaxis=go.layout.YAxis(ticktext=fam_order,
- tickvals=fam_order,
- title='gene families',
- automargin=True,
- tickfont=dict(size=10)),
- shapes=shapes,
- plot_bgcolor='#ffffff')
- logging.getLogger("PPanGGOLiN").info("Drawing the figure itself...")
-
- fig = go.Figure(data=[heatmap])
+ return shapes
+
+
+def metadata_stringify(gene) -> str:
+ """
+ Convert gene metadata to a formatted string.
+
+ :param gene: The gene object with potential metadata.
+ :return: A formatted string containing gene metadata information.
+ """
+ metadata_str = ''
+ if gene.has_metadata():
+ metadata_str = f'
{gene.ID} metadata'
+ for metadata in gene.metadata:
+ metadata_str += f"
metadata source: {metadata.source}
"
+ metadata_dict = metadata.to_dict()
+ metadata_str += '
'.join((f"{key}: {value}" for key, value in metadata_dict.items()))
+
+ return metadata_str
+
+def get_heatmap_hover_text(ordered_families: List, order_organisms: List) -> List[List[str]]:
+ """
+ Generate hover text for the heatmap cells.
+
+ :param ordered_families: The list of ordered gene families.
+ :param order_organisms: The list of ordered organisms.
+ :return: A 2D list of strings representing hover text for each heatmap cell.
+ """
+ text_data = []
+
+ for family in ordered_families:
+ text_per_family = []
+
+ for org in order_organisms:
+ if org in family.organisms:
+ # gene_count = len(list(family.get_genes_per_org(org)))
+ genes = ";".join(map(str, family.get_genes_per_org(org)))
+ names = ";".join((gene.name for gene in family.get_genes_per_org(org) if gene.name))
+
+ # Compile additional information about genes
+ extra_gene_info = f"genes:{genes}"
+ if names:
+ extra_gene_info += f'
names:{names}'
+
+ metadata = "
".join((metadata_stringify(gene) for gene in family.get_genes_per_org(org) if gene.has_metadata()))
+ extra_gene_info += metadata
+ else:
+ # gene_count = 0
+ extra_gene_info = np.nan # Using np.nan instead of numpy.nan for consistency with numpy import
+
+ # To get a more explicit hover. But is quite heavy on the finam html
+ # gene_info = f"genome:{org.name}
family:{family.name}
gene_count:{gene_count}
{extra_gene_info}"
+ # Light version:
+ gene_info = extra_gene_info
+ text_per_family.append(gene_info)
+
+ text_data.append(text_per_family)
+
+ return text_data
+
+def create_tile_plot(
+ binary_data: List[List[float]],
+ text_data: List[List[str]],
+ fam_order: List[str],
+ partition_separator: List[tuple],
+ order_organisms: List[Organism], # Replace 'Any' with the appropriate type if available
+ dendrogram_fig: go.Figure,
+ draw_dendrogram: bool
+) -> go.Figure:
+ """
+ Create the heatmap tile plot using Plotly.
+
+ :param binary_data: The binary presence-absence matrix data.
+ :param text_data: Hover text data for each cell in the heatmap.
+ :param fam_order: List of gene family names in the desired order.
+ :param partition_separator: List of tuples containing partition names and their separator positions.
+ :param order_organisms: List of organisms in the desired order.
+ :param dendrogram_fig: Plotly figure object for the dendrogram.
+ :param draw_dendrogram: Flag indicating whether to draw the dendrogram.
+ :return: A Plotly Figure object representing the tile plot.
+ """
+
+ xaxis_values = [org.name for org in order_organisms]
+
+ heatmap_color = {"presence":"#005AB5", # blue
+ "multicopy":'#DC3220' # red
+ }
+
+ green_colors = ['rgb(35,139,69)',
+ 'rgb(65,171,93)',
+ 'rgb(116,196,118)',
+ 'rgb(161,217,155)',
+ 'rgb(199,233,192)',
+ 'rgb(229,245,224)']
+
+ shell_color_generator = cycle(green_colors)
+
+ partition_to_color = {"Persistent": "#F7A507", "Cloud": "#79DEFF", "Shell_S1": "#00D860"}
+ partition_to_color.update({partition:next(shell_color_generator) for partition, _ in partition_separator if partition not in partition_to_color})
+
+
+ heatmap = [go.Heatmap(z=binary_data,
+ x=xaxis_values,
+ y=fam_order,
+ text=text_data,
+ zauto=False,
+ zmin=0,
+ zmax=2,
+ autocolorscale=False,
+ hovertemplate = 'genome: %{x}
family: %{y}
gene_count: %{z}
%{text} ',
+ colorscale=[[0, '#ffffff'],[0.33, '#ffffff'],
+ [0.33, heatmap_color['presence']],[0.66, heatmap_color['presence']],
+ [0.66, heatmap_color['multicopy']], [1, heatmap_color['multicopy']]],
+
+ colorbar=dict(title='Presence/Absence',
+ titleside='top',
+ tickmode='array',
+ tickvals=[0.33, 1, 1.66],
+ ticktext=['Absence', 'Presence', 'Multicopy'],
+ len=0.27,
+ outlinecolor='black',
+ outlinewidth=0.5,
+ ticks=None, orientation='v'))]
+
+
+
+ if draw_dendrogram:
+ heatmap_row = 2
+ fig = make_subplots(rows=2, cols=1,
+ shared_xaxes=True,
+ vertical_spacing=0.01,
+ row_heights=[0.2, 0.8])
+
+ for data in dendrogram_fig['data']:
+ fig.add_trace(data, row=1, col=1)
+
+ else:
+ heatmap_row = 1
+ fig = make_subplots(rows=1, cols=1)
+
+
+ heatmap[0]['x'] = dendrogram_fig['layout']['xaxis']['tickvals']
+
+ for data in heatmap:
+
+ fig.add_trace(data, row=heatmap_row, col=1)
+
+ layout = go.Layout(title="Presence-Absence Matrix",
+ plot_bgcolor='#ffffff')
+
+ if draw_dendrogram:
+ layout.xaxis2 = dendrogram_fig.layout.xaxis
+ else:
+ layout.xaxis = dendrogram_fig.layout.xaxis
+
fig.update_layout(layout)
- out_plotly.plot(fig, filename=output.as_posix() + "/tile_plot.html", auto_open=False)
- logging.getLogger("PPanGGOLiN").info(f"Done with the tile plot : '{output / 'tile_plot.html'}' ")
+
+
+
+ fig.update_xaxes(
+ ticklen=0,
+ title="Genomes"
+ )
+ fig.update_yaxes(
+ ticklen=0,
+ title='Gene Families',
+ tickfont=dict(size=10),
+ automargin=True,
+ )
+ if draw_dendrogram:
+
+ fig.layout.yaxis.title = None
+ fig.layout.yaxis2.title = dict(text='Gene Families')
+ fig.layout.xaxis.title = None
+
+ xmax = dendrogram_fig['layout']['xaxis']['tickvals'][-1] + dendrogram_fig['layout']['xaxis']['tickvals'][0] + 0.5
+ shapes = create_partition_shapes(partition_separator, xmax, heatmap_row, partition_to_color)
+
+ fig.update_layout(go.Layout(shapes=shapes,
+ showlegend=True,
+ ))
+
+ fig.update_layout(legend=dict(
+ title="Family Partition",
+ traceorder="reversed",
+ ))
+
+ fig.update_layout({'width':1000, 'height':1000,
+ })
+
+ return fig
+
+
\ No newline at end of file
diff --git a/ppanggolin/figures/ucurve.py b/ppanggolin/figures/ucurve.py
index 1abd3749..ec82f367 100644
--- a/ppanggolin/figures/ucurve.py
+++ b/ppanggolin/figures/ucurve.py
@@ -60,9 +60,10 @@ def draw_ucurve(pangenome: Pangenome, output: Path, soft_core: float = 0.95, di
marker=dict(color=colors["cloud"])))
else:
text = 'undefined' if has_undefined else "pangenome"
- undefined_values = []
- for nb_org in range(1, pangenome.number_of_organisms + 1):
- undefined_values.append(count[nb_org][text])
+ undefined_values = [count[nb_org][text]
+ for nb_org
+ in range(1, pangenome.number_of_organisms + 1)]
+
data_plot.append(go.Bar(x=list(range(1, pangenome.number_of_organisms + 1)), y=undefined_values, name=text,
marker=dict(color=colors[text])))
x = pangenome.number_of_organisms * soft_core
diff --git a/ppanggolin/formats/readBinaries.py b/ppanggolin/formats/readBinaries.py
index 9d2f3609..6c911fe7 100644
--- a/ppanggolin/formats/readBinaries.py
+++ b/ppanggolin/formats/readBinaries.py
@@ -2,8 +2,9 @@
# default libraries
import logging
+from optparse import Option
from pathlib import Path
-from typing import Dict, Any, Set, List, Tuple
+from typing import Dict, Any, Set, List, Tuple, Optional
from collections import defaultdict
# installed libraries
@@ -958,7 +959,7 @@ def get_need_info(pangenome, need_annotations: bool = False, need_families: bool
def check_pangenome_info(pangenome, need_annotations: bool = False, need_families: bool = False,
need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False,
need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False,
- need_metadata: bool = False, metatypes: Set[str] = None, sources: Set[str] = None,
+ need_metadata: bool = False, metatypes: Optional[Set[str]] = None, sources: Optional[Set[str]] = None,
disable_bar: bool = False):
"""
Defines what needs to be read depending on what is needed, and automatically checks if the required elements
diff --git a/ppanggolin/formats/writeFlatGenomes.py b/ppanggolin/formats/writeFlatGenomes.py
index 232240c7..30c31dd3 100644
--- a/ppanggolin/formats/writeFlatGenomes.py
+++ b/ppanggolin/formats/writeFlatGenomes.py
@@ -119,7 +119,7 @@ def manage_module_colors(modules: Set[Module], window_size: int = 100) -> Dict[M
random.seed(1)
color_mod_graph = nx.Graph()
- color_mod_graph.add_nodes_from((module for module in modules))
+ color_mod_graph.add_nodes_from(module for module in modules)
contig_to_mod_genes = defaultdict(set)
gene_to_module = {}
@@ -589,6 +589,11 @@ def write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = Fa
organism_args["annotation_sources"] = {}
if table:
+ # create _genePerOrg dict with get_org_dict methodbefore the multiprocessing to prevent putative errors.
+ # As this is used in multiprocessing when computing nb_copy_in_genome.
+ for family in pangenome.gene_families:
+ family.get_org_dict()
+
organism_args.update({"need_regions": need_dict['need_rgp'],
"need_modules": need_dict['need_modules'],
"need_spots": need_dict['need_spots']})
diff --git a/ppanggolin/formats/writeMSA.py b/ppanggolin/formats/writeMSA.py
index cc72f001..c4b37414 100644
--- a/ppanggolin/formats/writeMSA.py
+++ b/ppanggolin/formats/writeMSA.py
@@ -231,7 +231,7 @@ def write_whole_genome_msa(pangenome: Pangenome, families: set, phylo_name: Path
phylo_dict[org.name] = ""
for fam in families:
observed_genomes = set()
- with open(outdir / f"{fam.name}.aln", "r") as fin:
+ with open(outdir / f"{fam.name}.aln") as fin:
genome_id = ""
seq = ""
curr_len = 0
diff --git a/ppanggolin/formats/writeMetadata.py b/ppanggolin/formats/writeMetadata.py
index 207f9a0c..4f33db22 100644
--- a/ppanggolin/formats/writeMetadata.py
+++ b/ppanggolin/formats/writeMetadata.py
@@ -85,7 +85,7 @@ def desc_metadata(max_len_dict: Dict[str, int], type_dict: Dict[str, tables.Col]
:return: Formatted table
"""
desc_dict = {attr: tables.StringCol(itemsize=max_value) for attr, max_value in max_len_dict.items()}
- desc_dict.update({attr: col_type for attr, col_type in type_dict.items()})
+ desc_dict.update(dict(type_dict.items()))
return desc_dict
diff --git a/ppanggolin/formats/writeSequences.py b/ppanggolin/formats/writeSequences.py
index cb01d704..a7023812 100644
--- a/ppanggolin/formats/writeSequences.py
+++ b/ppanggolin/formats/writeSequences.py
@@ -251,7 +251,7 @@ def write_gene_protein_sequences(pangenome: Pangenome, output: Path, proteins: s
run_subprocess(cmd, msg="MMSeqs convert2fasta failed with the following error:\n")
if compress:
with write_compressed_or_not(outpath, compress) as compress_file:
- with open(outpath, "r") as sequence_file:
+ with open(outpath) as sequence_file:
shutil.copyfileobj(sequence_file, compress_file)
outpath.unlink()
logging.getLogger("PPanGGOLiN").info(f"Done writing the gene sequences : '{outpath}.gz'")
@@ -503,11 +503,10 @@ def write_regions_sequences(pangenome: Pangenome, output: Path, regions: str, fa
org_dict[elements[0]] = organisms_file.parent.joinpath(org_dict[elements[0]])
logging.getLogger("PPanGGOLiN").info(f"Writing {regions} rgp genomic sequences...")
- regions_to_write = []
+
if regions == "complete":
- for region in pangenome.regions:
- if not region.is_contig_border:
- regions_to_write.append(region)
+ regions_to_write = (region for region in pangenome.regions
+ if not region.is_contig_border)
else:
regions_to_write = pangenome.regions
diff --git a/ppanggolin/geneFamily.py b/ppanggolin/geneFamily.py
index 71100a95..ef56700b 100644
--- a/ppanggolin/geneFamily.py
+++ b/ppanggolin/geneFamily.py
@@ -439,7 +439,7 @@ def get_genes_per_org(self, org: Organism) -> Generator[Gene, None, None]:
if len(self._genePerOrg) == 0:
_ = self.get_org_dict()
if org not in self._genePerOrg:
- raise KeyError(f"Genome does not have the gene family: {self.name}")
+ raise KeyError(f"Genome {org.name} does not have the gene family: {self.name}")
yield from self._genePerOrg[org]
diff --git a/ppanggolin/genome.py b/ppanggolin/genome.py
index d5238f99..a299d442 100644
--- a/ppanggolin/genome.py
+++ b/ppanggolin/genome.py
@@ -222,7 +222,8 @@ def add_sequence(self, sequence):
:raise AssertionError: Sequence must be a string
"""
- assert isinstance(sequence, str), f"'str' type was expected but you provided a '{type(sequence)}' type object"
+ assert isinstance(sequence, str), f"'str' type was expected for dna sequence but you provided a '{type(sequence)}' type object"
+
self.dna = sequence
def string_coordinates(self) -> str:
@@ -909,14 +910,14 @@ def number_of_genes(self) -> int:
:return: Number of genes
"""
- return sum((contig.number_of_genes for contig in self.contigs))
+ return sum(contig.number_of_genes for contig in self.contigs)
def number_of_rnas(self) -> int:
""" Get number of genes in the organism
:return: Number of genes
"""
- return sum((contig.number_of_rnas for contig in self.contigs))
+ return sum(contig.number_of_rnas for contig in self.contigs)
@property
def contigs(self) -> Generator[Contig, None, None]:
diff --git a/ppanggolin/metadata.py b/ppanggolin/metadata.py
index 90dc8c9a..361e4b5a 100644
--- a/ppanggolin/metadata.py
+++ b/ppanggolin/metadata.py
@@ -165,7 +165,7 @@ def add_metadata(self, metadata: Metadata, metadata_id: int = None) -> None:
# or they are ridden
if metadata_id is None:
if self.has_source(metadata.source):
- metadata_id = max([meta_id for meta_id in self._metadata_getter[metadata.source].keys()]) + 1
+ metadata_id = max(self._metadata_getter[metadata.source].keys()) + 1
else:
# Set first as 1 for PANORAMA
metadata_id = 1
diff --git a/ppanggolin/mod/module.py b/ppanggolin/mod/module.py
index e2d9595d..fe2df7a0 100644
--- a/ppanggolin/mod/module.py
+++ b/ppanggolin/mod/module.py
@@ -72,7 +72,7 @@ def compute_modules(g: nx.Graph, multi: set, weight: float = 0.85, min_fam: int
"""
# removing families with low presence
- removed = set([fam for fam in g.nodes if fam.number_of_organisms < min_fam])
+ removed = {fam for fam in g.nodes if fam.number_of_organisms < min_fam}
modules = set()
c = 0
diff --git a/ppanggolin/nem/partition.py b/ppanggolin/nem/partition.py
index 952ab583..f3e1f451 100644
--- a/ppanggolin/nem/partition.py
+++ b/ppanggolin/nem/partition.py
@@ -117,7 +117,7 @@ def run_partitioning(nem_dir_path: Path, nb_org: int, beta: float = 2.5, free_di
no_nem = True
index_fam = []
- with open(nem_dir_path / "nem_file.index", "r") as index_nem_file:
+ with open(nem_dir_path / "nem_file.index") as index_nem_file:
for line in index_nem_file:
index_fam.append(line.split("\t")[1].strip())
@@ -126,8 +126,8 @@ def run_partitioning(nem_dir_path: Path, nb_org: int, beta: float = 2.5, free_di
log_likelihood = None
entropy = None
try:
- with open(nem_dir_path / f"nem_file_{str(kval)}.uf", "r") as partitions_nem_file, \
- open(nem_dir_path / f"nem_file_{str(kval)}.mf", "r") as parameters_nem_file:
+ with open(nem_dir_path / f"nem_file_{str(kval)}.uf") as partitions_nem_file, \
+ open(nem_dir_path / f"nem_file_{str(kval)}.mf") as parameters_nem_file:
parameters = parameters_nem_file.readlines()
log_likelihood = float(parameters[2].split()[3])
@@ -321,7 +321,7 @@ def evaluate_nb_partitions(organisms: set, output: Path = None, sm_degree: int =
newtmpdir = tmpdir / "eval_partitions"
if len(organisms) > chunk_size:
- select_organisms = set(random.sample(set(organisms), chunk_size))
+ select_organisms = set(random.sample(list(organisms), chunk_size))
else:
select_organisms = set(organisms)
@@ -366,24 +366,24 @@ def evaluate_nb_partitions(organisms: set, output: Path = None, sm_degree: int =
chosen_k = best_k if best_k >= 3 else chosen_k
if len(all_bics) > 0 and draw_icl:
- traces = [go.Scatter(x=[key for key in sorted(all_bics.keys())],
+ traces = [go.Scatter(x=sorted(all_bics.keys()),
y=[all_bics[key] for key in sorted(all_bics.keys())],
name="BIC",
- mode="lines+markers"), go.Scatter(x=[key for key in sorted(all_icls.keys())],
+ mode="lines+markers"), go.Scatter(x=sorted(all_icls.keys()),
y=[all_icls[key] for key in sorted(all_icls.keys())],
name="ICL",
mode="lines+markers"),
- go.Scatter(x=[key for key in sorted(all_lls.keys())],
+ go.Scatter(x=sorted(all_lls.keys()),
y=[all_lls[key] for key in sorted(all_lls.keys())],
name="log likelihood",
- mode="lines+markers"), go.Scatter(x=[key for key in sorted(all_bics.keys())],
+ mode="lines+markers"), go.Scatter(x=sorted(all_bics.keys()),
y=[all_bics[key] for key in sorted(all_bics.keys())],
name="BIC",
mode="lines+markers"),
- go.Scatter(x=[key for key in sorted(all_icls.keys())],
+ go.Scatter(x=sorted(all_icls.keys()),
y=[all_icls[key] for key in sorted(all_icls.keys())],
name="ICL",
- mode="lines+markers"), go.Scatter(x=[key for key in sorted(all_lls.keys())],
+ mode="lines+markers"), go.Scatter(x=sorted(all_lls.keys()),
y=[all_lls[key] for key in sorted(all_lls.keys())],
name="log likelihood",
mode="lines+markers")]
diff --git a/ppanggolin/pangenome.py b/ppanggolin/pangenome.py
index ee12d09d..3b39eba4 100644
--- a/ppanggolin/pangenome.py
+++ b/ppanggolin/pangenome.py
@@ -129,10 +129,10 @@ def get_gene(self, gene_id: str) -> Gene:
:return: Returns the gene that has the ID `gene_id`
- :raises AssertionError: If the `gene_id` is not an integer
+ :raises AssertionError: If the `gene_id` is not a string
:raises KeyError: If the `gene_id` is not in the pangenome
"""
- assert isinstance(gene_id, str), "Gene id should be an integer"
+ assert isinstance(gene_id, str), f"The provided gene id ({gene_id}) should be a string and not a {type(gene_id)}"
try:
gene = self._gene_getter[gene_id]
@@ -232,23 +232,23 @@ def get_gene_family(self, name: str) -> GeneFamily:
def add_gene_family(self, family: GeneFamily):
"""
- Get the :class:`ppanggolin.geneFamily.GeneFamily` object that has the given `name`.
- If it does not exist, creates it.
-
- :param family: The gene family to add in pangenomes
-
- :raises KeyError: Exception if family with the same name already in pangenome
- :raises Exception: Unexpected exception
+ Adds the given gene family to the pangenome. If a family with the same name already exists, raises a KeyError.
+
+ :param family: The gene family to add to the pangenome
+ :raises KeyError: If a family with the same name already exists
+ :raises Exception: For any unexpected exceptions
"""
try:
- _ = self.get_gene_family(family.name)
+ self.get_gene_family(family.name)
except KeyError:
+ # Family does not exist, so add it
self._fam_getter[family.name] = family
self.max_fam_id += 1
except Exception as error:
- raise Exception(error)
+ raise Exception(f"An unexpected error occurred when adding family {family} to pangenome: {str(error)}") from error
else:
- raise KeyError("Gene Family already exist")
+ raise KeyError(f"Cannot add family {family.name}: A family with the same name already exists.")
+
"""Graph methods"""
@property
@@ -772,7 +772,7 @@ def has_metadata(self) -> bool:
"""
Whether or not the pangenome has metadata associated with any of its elements.
"""
- return any(( status != "No" for status in self.status['metadata'].values()))
+ return any( status != "No" for status in self.status['metadata'].values())
def select_elem(self, metatype: str):
diff --git a/ppanggolin/projection/projection.py b/ppanggolin/projection/projection.py
index 53aaf0b0..511414cc 100644
--- a/ppanggolin/projection/projection.py
+++ b/ppanggolin/projection/projection.py
@@ -426,8 +426,7 @@ def annotate_fasta_files(genome_name_to_fasta_path: Dict[str, dict], tmpdir: str
future.add_done_callback(lambda p: progress.update())
futures.append(future)
- for future in futures:
- organisms.append(future.result())
+ organisms.extend(future.result() for future in futures)
return organisms
@@ -515,7 +514,7 @@ def get_gene_sequences_from_fasta_files(organisms, genome_name_to_annot_path):
msg = f"Fasta file for genome {org.name} did not have the contig {contig.name} " \
f"that was read from the annotation file. "
msg += f"The provided contigs in the fasta were : " \
- f"{', '.join([contig for contig in org_contig_to_seq])}."
+ f"{', '.join(org_contig_to_seq)}."
raise KeyError(msg)
for gene in contig.genes:
@@ -667,7 +666,7 @@ def annotate_input_genes_with_pangenome_families(pangenome: Pangenome, input_org
# Write specific gene ids in a file
with open(org_outdir / "specific_genes.tsv", "w") as fl:
fl.write('\n'.join(
- (gene.ID if gene.local_identifier == "" else gene.local_identifier for gene in lonely_genes)) + '\n')
+ gene.ID if gene.local_identifier == "" else gene.local_identifier for gene in lonely_genes) + '\n')
input_org_to_lonely_genes_count[input_organism] = len(lonely_genes)
@@ -770,7 +769,7 @@ def retrieve_gene_sequences_from_fasta_file(input_organism, fasta_file):
msg = f"Fasta file for input genome {input_organism.name} did not have the contig {contig.name} " \
f"that was read from the annotation file. "
msg += f"The provided contigs in the fasta were : " \
- f"{', '.join([contig for contig in contig_id2seq.keys()])}."
+ f"{', '.join(contig_id2seq.keys())}."
raise KeyError(msg)
@@ -913,7 +912,7 @@ def predict_spots_in_input_organisms(
# Check congruency with already computed spot and add spot id in node attributes
check_spots_congruency(graph_spot, initial_spots)
- new_spot_id_counter = max((s.ID for s in initial_spots)) + 1 if len(initial_spots) != 0 else 1
+ new_spot_id_counter = max(s.ID for s in initial_spots) + 1 if len(initial_spots) != 0 else 1
input_org_to_spots = {}
for input_organism, rgps in input_org_2_rgps.items():
@@ -939,7 +938,7 @@ def predict_spots_in_input_organisms(
exact_match=exact_match, compress=compress)
if len(input_org_spots) > 0:
- new_spot_id_counter = max((s.ID for s in input_org_spots)) + 1
+ new_spot_id_counter = max(s.ID for s in input_org_spots) + 1
input_org_to_spots[input_organism] = input_org_spots
@@ -1037,8 +1036,7 @@ def predict_spot_in_one_organism(
spots_of_the_cc = set()
for node in comp:
if "spots" in graph_spot.nodes[node]:
- spots_of_the_cc |= {
- spot for spot in graph_spot.nodes[node]["spots"]}
+ spots_of_the_cc |= set(graph_spot.nodes[node]["spots"])
if len(spots_of_the_cc) == 0:
# no spot associated with any node of the cc
@@ -1063,7 +1061,7 @@ def predict_spot_in_one_organism(
graph_spot.nodes[node]["spots"] = spots_of_the_cc
graph_spot.nodes[node]["spot_id"] = ';'.join(
- (str(spot) for spot in spots_of_the_cc))
+ str(spot) for spot in spots_of_the_cc)
graph_spot.nodes[node]["includes_RGPs_from_the_input_genome"] = True
for spot in spots_of_the_cc:
@@ -1120,7 +1118,7 @@ def project_and_write_modules(pangenome: Pangenome, input_organisms: Iterable[Or
for mod in pangenome.modules:
module_in_input_organism = any(
- (fam in input_organism_families for fam in mod.families))
+ fam in input_organism_families for fam in mod.families)
if module_in_input_organism:
counter += 1
diff --git a/ppanggolin/region.py b/ppanggolin/region.py
index f33f0d82..a9388385 100644
--- a/ppanggolin/region.py
+++ b/ppanggolin/region.py
@@ -660,9 +660,8 @@ def borders(self, set_size: int, multigenics) -> List[List[int, List[GeneFamily]
:return: Families that bordering spot
"""
- all_borders = []
- for rgp in self.regions:
- all_borders.append(rgp.get_bordering_genes(set_size, multigenics))
+ all_borders = [rgp.get_bordering_genes(set_size, multigenics)
+ for rgp in self.regions]
family_borders = []
for borders in all_borders:
@@ -747,7 +746,7 @@ def count_uniq_content(self) -> dict:
:return: Dictionary with a representative rgp as the key and number of identical rgp as value
"""
- return dict([(key, len(val)) for key, val in self._get_content().items()])
+ return {key: len(val) for key, val in self._get_content().items()}
def count_uniq_ordered_set(self):
"""
@@ -755,7 +754,7 @@ def count_uniq_ordered_set(self):
:return: Dictionary with a representative rgp as the key and number of identical rgp as value
"""
- return dict([(key, len(val)) for key, val in self._get_ordered_set().items()])
+ return {key: len(val) for key, val in self._get_ordered_set().items()}
class Module(MetaFeatures):
diff --git a/ppanggolin/utils.py b/ppanggolin/utils.py
index 88fd963f..b07a02a6 100755
--- a/ppanggolin/utils.py
+++ b/ppanggolin/utils.py
@@ -92,7 +92,7 @@ def check_tsv_sanity(tsv: Path):
:param tsv: Path to the tsv containing organims information
"""
try:
- input_file = open(tsv, "r")
+ input_file = open(tsv)
except OSError as ios_error:
raise OSError(ios_error)
except Exception as exception_error:
@@ -270,7 +270,7 @@ def read_compressed_or_not(file_or_file_path: Union[Path, BinaryIO, TextIOWrappe
return TextIOWrapper(z.open(file_list[0], "r"))
else: # Non-compressed file
if isinstance(file_or_file_path, Path):
- return open(file_or_file_path, "r")
+ return open(file_or_file_path)
else:
return file_or_file_path
diff --git a/ppanggolin/workflow/all.py b/ppanggolin/workflow/all.py
index 5c02bf73..f3b8d922 100644
--- a/ppanggolin/workflow/all.py
+++ b/ppanggolin/workflow/all.py
@@ -196,12 +196,13 @@ def launch_workflow(args: argparse.Namespace, panrgp: bool = True,
spot_time += time.time() - start_spot_drawing
if args.draw.tile_plot:
- if 1 < pangenome.number_of_organisms < 5000:
- nocloud = args.draw.nocloud if pangenome.number_of_organisms < 500 else True
- draw_tile_plot(pangenome, args.output, nocloud=nocloud, disable_bar=args.disable_prog_bar)
+ if pangenome.number_of_organisms < 65000 or pangenome.number_of_gene_families < 65000:
+ nocloud = args.draw.nocloud if pangenome.number_of_organisms < 32767 or pangenome.number_of_gene_families < 32767 else True
+ draw_tile_plot(pangenome, args.output, nocloud=nocloud, disable_bar=args.disable_prog_bar,
+ draw_dendrogram=args.draw.add_dendrogram, add_metadata=True)
else:
logging.getLogger("PPanGGOLiN").warning(
- 'Tile plot output have been requested but there are too many genomes to produce a viewable tile plot.')
+ 'Tile plot output have been requested but there are too many genomes or families to produce a viewable tile plot.')
if args.draw.ucurve:
draw_ucurve(pangenome, args.output, disable_bar=args.disable_prog_bar, soft_core=args.draw.soft_core)
@@ -227,7 +228,7 @@ def launch_workflow(args: argparse.Namespace, panrgp: bool = True,
write_pangenome_arguments.append('spot_modules')
# check that at least one output file is requested. if not write is not call.
- if any((getattr(args.write_pangenome, arg) is True for arg in write_pangenome_arguments)):
+ if any(getattr(args.write_pangenome, arg) is True for arg in write_pangenome_arguments):
# some parameters are set to false because they have not been computed in this workflow
write_pangenome_flat_files(pangenome, args.output, cpu=args.write_pangenome.cpu,
disable_bar=args.disable_prog_bar,
@@ -248,12 +249,12 @@ def launch_workflow(args: argparse.Namespace, panrgp: bool = True,
'Writing output pangenome flat file is skipped.')
write_genomes_arguments = ['proksee', "table", "gff"]
- if any((getattr(args.write_genomes, arg) is True for arg in write_genomes_arguments)):
+ if any(getattr(args.write_genomes, arg) is True for arg in write_genomes_arguments):
write_flat_genome_files(pangenome, args.output,
proksee=args.write_genomes.proksee,
table=args.write_genomes.table,
gff=args.write_genomes.gff,
- add_metadata=False,
+ add_metadata=True,
compress=args.write_genomes.compress,
disable_bar=args.disable_prog_bar, cpu=args.write_genomes.cpu)
else:
diff --git a/ppanggolin_env.yaml b/ppanggolin_env.yaml
index 30b058fa..efed3570 100644
--- a/ppanggolin_env.yaml
+++ b/ppanggolin_env.yaml
@@ -13,7 +13,6 @@ dependencies:
- plotly=5
- gmpy2=2
- pandas=2
- - colorlover=0.3
- numpy=1
- bokeh=3
# Tool that are not in python
diff --git a/pyproject.toml b/pyproject.toml
index 7864f533..3707014e 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -59,7 +59,6 @@ python_deps = [
"plotly==5.*",
"gmpy2==2.*",
"pandas==2.*",
- "colorlover==0.3",
"numpy==1.24",
"bokeh==3.*"
]
diff --git a/testingDataset/expected_info_files/myannopang_info.yaml b/testingDataset/expected_info_files/myannopang_info.yaml
index ca7dea34..03a04408 100644
--- a/testingDataset/expected_info_files/myannopang_info.yaml
+++ b/testingDataset/expected_info_files/myannopang_info.yaml
@@ -8,7 +8,7 @@ Status:
RGP_Predicted: false
Spots_Predicted: false
Modules_Predicted: false
- PPanGGOLiN_Version: 2.1.1
+ PPanGGOLiN_Version: 2.1.2
Content:
Genes: 47961
diff --git a/testingDataset/expected_info_files/mybasicpangenome_info.yaml b/testingDataset/expected_info_files/mybasicpangenome_info.yaml
index 48d59f36..466ddb63 100644
--- a/testingDataset/expected_info_files/mybasicpangenome_info.yaml
+++ b/testingDataset/expected_info_files/mybasicpangenome_info.yaml
@@ -8,7 +8,7 @@ Status:
RGP_Predicted: true
Spots_Predicted: true
Modules_Predicted: true
- PPanGGOLiN_Version: 2.1.1
+ PPanGGOLiN_Version: 2.1.2
Content:
Genes: 45429
diff --git a/testingDataset/expected_info_files/stepbystep_info.yaml b/testingDataset/expected_info_files/stepbystep_info.yaml
index 395b259a..ecebca05 100644
--- a/testingDataset/expected_info_files/stepbystep_info.yaml
+++ b/testingDataset/expected_info_files/stepbystep_info.yaml
@@ -8,7 +8,7 @@ Status:
RGP_Predicted: true
Spots_Predicted: true
Modules_Predicted: true
- PPanGGOLiN_Version: 2.1.1
+ PPanGGOLiN_Version: 2.1.2
Content:
Genes: 45429