The GenomeSearch scans/analyses DNA files from popular family tree providers (23andMe, Ancestry.com, etc.), comparing the genome of the user with published literature on health risks/conditions that their gene variants are correlated to. SNP data is sourced from several sources, i.e. SNPedia, Ensembl, and GProfiler.
For security reasons, the user's patient data is not shared to the server, but remains on their machine (in the web browser IndexedDB.) The SNP data from published literature is provisioned by the server.
-
This repo is the GenomeSearch API/Orchestrator is a FastAPI-based (Python) server application (with Uvicorn), and which provides endpoints for managing and querying genome (gene variant) data (patient data is combined with SNP pairs data to show health risks.) SNP data is sourced from several sources, i.e. SNPedia, Ensembl, and GProfiler.
-
(The UI for React-based (TypeScript) UI client application (with NextJs), and which queries the GenomeSearch API is here: GenomeSearch UI)
-
Comparison of ancestry website DNA report with SNPedia data. The major/minor alleles of gene variants, their associated gene, chromosome position, etc..
-
Long-term: Cronjob to trigger Python script to download and parse weekly VCF releases of CIViC and ClinVar data, and upload such to a Postgres database. Separate tables will be generated for genome builds GRCh37 and GRCh38 (Genome Reference Consortium Human genome builds 37, and 38; also known as hg38; these builds relate to the 1000 genome project), as well as for mono-allelic variants and complex multi-allelic variants. Rhe tables will be augmented with allele frequencies from the ExAC and gnomAD datasets as these are often consulted when analyzing ClinVar variants.
-
Genome assembly GRCh38.p14 (Taxon: Homo sapiens (human); Synonym: hg38; Assembly type: Haploid with alt loci; Genes: 59,715; Chromosomes: 1 to 22, X, Y, and mT.)
-
NIH dbSNP gene details pages:
https://www.ncbi.nlm.nih.gov/snp/[rsid]
* Chromosomes
- This also serves to demonstrate:
- A FastAPI implementation (with Uvicorn)
- The controller/services approach.
- Pydantic typing.
- RESTful and WebSocket connections for real-time, low-latency communication. Also, Swagger documentation.
- Consideration of DRY/SOLID principles, and Gang of Four design patterns
- SQLite3Worker: This library implements a thread pool pattern with sqlite3 being the desired output. The library creates a queue to manage multiple queries sent to the database. (sqllite3 implementation lacks the ability to safely modify the sqlite3 database with multiple threads outside of the compile time options.) Instead of directly calling the sqlite3 interface, the Sqlite3Worker is called, and it inserts the query into a Queue.Queue() object. The queries are processed in the order that they are inserted into the queue (first in, first out). In order to ensure that the multiple threads are managed in the same queue, you will need to pass the same Sqlite3Worker object to each thread.
- VirtualEnv Python environment, with
.env
files.
python3 src/main.py
(effectively, unvicorn implementation within is: python -m uvicorn main:app --reload
)
-
Controllers:
- Handles the HTTP requests, processes input, and returns HTTP responses. It should call the service layer to perform business logic.
- Created
controllers/genome_controller.py
andcontrollers/websocket_controller.py
to handle the routing logic. - Moved route-specific logic from
routes.py
to the respective controllers. - Added detailed comments to the controllers to explain the purpose and functionality of each endpoint.
-
Services:
- Contains the business logic of the application. It performs operations and calls the repository layer to interact with the database.
- Created
services/genome_service.py
andservices/websocket_service.py
to handle the business logic. - Moved business logic from the controllers to the respective services.
- This separation allows for better unit testing and adherence to the Single Responsibility Principle (SRP).
-
Repositories:
- Handles direct interactions with the database. It performs CRUD operations and returns data to the service layer.
- Created
repositories/genome_repository.py
to handle database interactions. - Moved database interaction logic from
GenomeDatabaseManager.py
to the repository. - This follows the Repository Pattern, which abstracts the data access logic and provides a clean API for the domain layer.
-
Models:
- Created
models.py
to define data models using Pydantic. - This ensures data validation and type checking, improving the robustness of the application.
- Created
-
Routes:
- Defines the API endpoints and maps them to the appropriate controller functions.
-
Single Responsibility Principle (SRP):
- Each class and module has a single responsibility, making the code easier to understand and maintain.
-
Repository Pattern:
- Abstracts the data access logic, providing a clean API for the domain layer and promoting separation of concerns.
-
Domain-Driven Design (DDD):
- Organises the codebase into domains, with clear boundaries between the application, domain, and infrastructure layers.
-
Dependency Injection:
- Services and repositories are injected into controllers, promoting loose coupling and making the code more testable.
- Maintainability: The separation of concerns makes the codebase easier to understand and maintain.
- Scalability: The modular design allows for easy addition of new features and scaling of the application.
- Testability: The clear separation of business logic, data access, and routing logic makes unit testing more straightforward.
- Robustness: Data validation using Pydantic models ensures that the data conforms to the expected schema.
- Complexity: The initial setup and understanding of the architecture may be more complex compared to a monolithic design.
- Overhead: The additional layers and abstractions may introduce some overhead, but this is often outweighed by the benefits in larger applications.
- The family tree/ancestry websites (i.e. that provide DNA tests; based on saliva samples), do not always use the same lettering as SNPedia.
- The family tree/ancestry websites do test for many, but not every gene variant.
There are several options:
patients/
: Retrieves a list of the patients (that have been uploaded to the SQLite database.)snp_research/
: Retrieves the SNP Pairs data (from published literature, i.e. SNPedia); the underlying method is called when the Uvicorn FastAPI server is launched.patient_profile/
: Retrieves patient id, and patient name.patient_genome_data/
: Retrieves patient genotypes (gene varients, and the associated two alleles).patient_genome_data_expanded/
: Retrieves patient profile, and their genotypes, joined onpatient_id
.full_report/
: Retrieves thepatient_genome_data_expanded
, and published literature (join onrsid.
)
-
Clone the repository:
git clone https://github.com/mathematicuslucian/Genome=Search-api.git cd Genome=Search-api
-
Install the dependencies:
pip install -r requirements.txt
The
requirements.txt
file has dependencies to install (it was generated with the commandpipreqs src/ --force --ignore=tests
):- Python 3.8+
- FastAPI
- Uvicorn
- Pydantic
- SQLite3
For the tests, the following depdendencies are required:
- mock==5.1.0
- pandas==2.2.3
- pytest==8.3.4
-
Set up the environment variables:
cp .env.example .env
-
Setup the Python Environment (VirtualEnv):
- Create:
python -m venv genomebrowser
- Launch environment:
source genomebrowser/bin/activate
- Delete environment:
deactivate
andrm -r venv
- Create:
- Run the application:
uvicorn main:app --reload\
- Load a Patient Genome:
/genome/load_patient/{genome_file_name_with_path}
- If you provide the string
default
forgenome_file_name_with_path
, it will load the default patient data (genome_Lilly_Mendel_v4.txt
.)
- From your UI client app, or Postman, etc. call
http://127.0.0.1:8000/api
.
(The API documentation is available at /docs
(Swagger UI) and /redoc
(ReDoc), i.e. open a browser, and (for Swagger) open either http://127.0.0.1:8000/docs
, or http://127.0.0.1:8000/redoc
.)
Run Tests:
- Ensure
pytest
is installed in your environment. - Run the tests using the command:
pytest tests
Details:
The test
directory contains pytest
test cases for the GenomeBrowser
class methods, which include both positive and negative scenarios, as well as exception handling.
Fixtures:
genome_browser
: A pytest fixture to initialise theGenomeBrowser
instance with mock data.- Mocking the
patient_genome_df
attribute of theGenomeBrowser
class to use mock data instead of actual data files.
Examples of the Test Cases:
test_retrieve_data_by_column_positive
:- Positive case: valid column and key.
test_retrieve_data_by_column_positive_another_key
:- Positive case: another valid column and key.
test_retrieve_data_by_column_negative_column_not_found
:- Negative case: column not found.
test_retrieve_data_by_column_negative_key_not_found
:- Negative case: key not found.
test_retrieve_data_by_column_no_genome_data
:- Negative case: no genome data loaded.
test_retrieve_data_by_column_invalid_column_type
:- Negative case: invalid column type.
test_fetch_gene_variant_positive
:- Positive case: valid fetch_gene_variant.
test_fetch_gene_variant_negative_key_not_found
:- Negative case: fetch_gene_variant key not found.
test_fetch_gene_variant_invalid_key_type
:- Exception handling: invalid key type.
Published Genome
genome_Lilly_Mendel_v4.txt
Chromosomes data from ensembl: Source Chromosome 1 Chromosome 2 Chromosome 3 Chromosome 4 Chromosome 5 Chromosome 6 Chromosome 7 Chromosome 8 Chromosome 9 Chromosome 10 Chromosome 11 Chromosome 12 Chromosome 13 Chromosome 14 Chromosome 15 Chromosome 16 Chromosome 17 Chromosome 18 Chromosome 19 Chromosome 20 Chromosome 21 Chromosome 22 X Chromosome Y Chromosome
SNP Pairs Data
The major/minor alleles of gene variants, their associated gene, chromosome position, etc..
snp_data.csv
Columns: RSID, Magnitude, Risk, Notes
No approval is granted for third-party usage.