A Python tool for processing pangenome structural variants and generating PLINK format files. This tool helps analyze structural variants in pangenomes by performing sequence alignment, k-mer window analysis, and converting results to PLINK format.
- Process VCF files containing structural variants
- Align variant sequences using MUSCLE
- K-mer based variant quantification
- Generate PLINK format files (ped/map/bfile)
- Parallel processing support
- Progress tracking and logging
# Create environment
conda create -n panherit python=3.8
conda activate panherit
# Install panherit
git clone https://github.com/PeixiongYuan/pangenome_heritability.git
cd pangenome_heritability
conda install -c conda-forge pandas numpy biopython click tqdm
pip install -e .
# Install MUSCLE and PLINK
mkdir -p ~/local/bin
cd ~/local/bin
wget https://github.com/rcedgar/muscle/releases/download/v5.3/muscle-linux-x86.v5.3
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20230116.zip
unzip plink_linux_x86_64_20230116.zip
chmod +x muscle-linux-x86.v5.3
mv muscle-linux-x86.v5.3 muscle
chmod +x plink
echo 'export PATH="$HOME/local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Install MAFFT
conda install conda-forge::mafft
The Pangenome Heritability Tool provides several commands to process variants, perform alignments, and generate PLINK files. Each step can be run independently or as part of a pipeline.
The tool provides four main commands:
process-vcf
: Process VCF file and group overlapping variantsrun-alignments
: Perform sequence alignments using MUSCLEprocess-kmers
: Generate and analyze k-mer windowsconvert-to-plink
: Convert results to PLINK formatrun-all
: Run the entire workflow in one command
Important: The VCF and reference FASTA files must use numeric chromosome identifiers (e.g., 1, 2, 3 for chromosomes) without additional prefixes or suffixes. Ensure your files adhere to this convention to avoid processing errors.
Example of a VCF File Header:
##source=YourTool
#CHROM POS ID REF ALT QUAL FILTER INFO
1 12345 rs123 A T 50 PASS .
2 67890 rs456 G C 99 PASS .
Example of a FASTA File:
>1
AGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
panherit run-all \
--vcf input.vcf \
--ref reference.fasta \
--out output_directory \
--window-size 4 \
--threads 4
Options:
--vcf
: Input VCF file containing structural variants--ref
: Reference genome FASTA file--out
: Output directory for processed variants and FASTA files--window-size
: Size of k-mer windows (default: 4)--threads
: Number of parallel threads (default: 1)
# Process VCF and generate FASTA sequences
panherit process-vcf \
--vcf input.vcf \
--ref reference.fasta \
--out output_directory
Options:
--vcf
: Input VCF file containing structural variants--ref
: Reference genome FASTA file--out
: Output directory for processed variants and FASTA files
# Perform sequence alignments
panherit run-alignments \
--grouped-variants output_directory/variants.fasta \
--ref reference.fasta \
--out alignments_directory \
--threads 4
Options:
--grouped-variants
: FASTA file from previous step--ref
: Reference genome FASTA file--out
: Output directory for alignments--threads
: Number of parallel threads (default: 1)
# Process k-mer windows
panherit process-kmers \
--alignments temp_alignments \
--window-size 4 \
--out kmers_directory
Options:
--alignments
: Directory containing alignment results--window-size
: Size of k-mer windows (default: 4)--out
: Output directory for k-mer results
# Generate PLINK files
panherit convert-to-plink \
--kmer-results kmers_directory \
--grouped-variants output_directory/variants.fasta \
--out plink_files
Options:
--kmer-results
: Directory containing k-mer analysis results--grouped-variants
: FASTA file from previous step--out
: Output directory for PLINK files
Each step produces specific output files:
output_directory/
├── variants.fasta # Grouped variant sequences
└── variants.log # Processing log file
/path/to/output/
├── temp_alignments/
│ ├── Group_2_59_input.fasta
│ ├── Group_2_59_aligned.fasta
├── error_logs/
│ ├── Group_2_59_input_error.log
kmers_directory/
├── windows.csv # K-mer window analysis
└── comparison.log # Processing log file
plink_files/
├── variants.bed # Binary genotype file
├── variants.bim # Variant information file
└── variants.fam # Sample information file
Complete pipeline example:
# 1. Process VCF
panherit process-vcf \
--vcf input.vcf \
--ref reference.fasta \
--out step1_output
# 2. Run alignments
panherit run-alignments \
--grouped-variants step1_output/variants.fasta \
--ref reference.fasta \
--out step2_output \
--threads 4
# 3. Process k-mers
panherit process-kmers \
--alignments step2_output \
--window-size 4 \
--out step3_output
# 4. Generate PLINK files
panherit convert-to-plink \
--kmer-results step3_output \
--out final_output
- Each command will create its output directory if it doesn't exist
- Log files are generated for each step
- Use
--help
with any command for detailed options - For large datasets, adjust thread count based on available resources
If any step fails, the tool will:
- Display an error message
- Log the error details
- Exit with a non-zero status code
Example error checking:
panherit process-vcf --vcf input.vcf --ref ref.fa --out output || {
echo "VCF processing failed"
exit 1
}
- Python 3.8+
- MUSCLE 5
- MAFFT V7.526
- PLINK 1.90
- External dependencies:
- pandas
- numpy
- biopython
- click
- tqdm
- Adjust thread count based on available CPU cores
- Ensure sufficient disk space for temporary files
Common issues and solutions are documented in our troubleshooting guide.
Contributions are welcome! Please read our contributing guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research, please cite:
[Citation information to be added]
For questions and support:
- Open an issue on GitHub
- Email: [email protected]