Skip to content

Commit

Permalink
Better logging and file fetching
Browse files Browse the repository at this point in the history
  • Loading branch information
Smat26 committed Jan 7, 2025
1 parent 40bb05c commit 4d7ecd6
Show file tree
Hide file tree
Showing 6 changed files with 85 additions and 78 deletions.
115 changes: 55 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,11 @@
# mmcif-gen

A versatile command-line tool for generating mmCIF files from various facility data sources. This tool supports both generic mmCIF file generation and specialized investigation file creation for facilities like PDBe, MAX IV, XChem, and ESRF.
A versatile command-line tool for generating any mmCIF files from various data sources. This tool can be to create:

## Features
1. Metadata mmcif files (To capture experimental metadata from different facilities)
2. Investigation mmcif files (like: https://ftp.ebi.ac.uk/pub/databases/msd/fragment_screening/investigations/)

- Generate mmCIF files from various data sources (SQLite, JSON, CSV, etc.)
- Create standardized investigation files for facility data
- Support for multiple facilities (PDBe, MAX IV, ESRF, XChem)
- Configurable transformations via JSON definitions
- Auto-fetching of facility-specific configurations
- Modular design for easy extension to new data sources
- Data enrichment capabilities
The tool has transformational mapping to convert data as it is stored at various facilities to corresponding catgories and items in mmcif format.

## Installation

Expand All @@ -27,108 +22,111 @@ The tool provides two main commands:
1. `fetch-facility-json`: Fetch facility-specific JSON configuration files
2. `make-mmcif`: Generate mmCIF files using the configurations

### Fetching Facility Configurations
### Fetching Facility JSON Files

The JSON operations files determine how the data would be mapped from the original source and translated into mmCIF format.

These files can be written, but can also be fetched from the github repository using simple commands.

```bash
# Fetch configuration for a specific facility
mmcif-gen fetch-facility-json dls-metadata

# Specify custom output directory
mmcif-gen fetch-facility-json dls-metadata -o ./configs
mmcif-gen fetch-facility-json dls-metadata -o ./mapping_operations
```

### Generating mmCIF Files
### Generating metadata mmCIF Files

Currently the valid facilities to generate mmcif files for are `pdbe`, `maxiv`, `dls`, and `xchem`.

The general syntax for generating mmCIF files is:

```bash
mmcif-gen make-mmcif <facility> [options]
```

Each facility has its own set of required parameters:
Each facility has its own set of required parameters, which can be checked by running the command with the `--help` flag.


```
mmcif-gen make-mmcif pdbe --help
```
#### Example Usage

#### DLS (Diamond Light Source)

```bash
# Using metadata configuration
mmcif-gen make-mmcif dls --json dls_metadata.json --output-folder ./out --id id_1234 --dls-json metadata-from-isypb.json
```
### Working with Investigation Files

Investigation files are a specialized type of mmCIF file that capture metadata across multiple experiments.

Investigation files are created in a very similar way:

#### PDBe

```bash
# Using model folder
mmcif-gen make-mmcif pdbe --model-folder ./models --output-folder ./out --identifier I_1234
mmcif-gen make-mmcif pdbe --json pdbe_investigation.json --model-folder ./models --output-folder ./out --id I_1234

# Using PDB IDs
mmcif-gen make-mmcif pdbe --pdb-ids 6dmn 6dpp 6do8 --output-folder ./out
mmcif-gen make-mmcif pdbe --json pdbe_investigation.json --pdb-ids 6dmn 6dpp 6do8 --output-folder ./out

# Using CSV input
mmcif-gen make-mmcif pdbe --csv-file groups.csv --output-folder ./out
mmcif-gen make-mmcif pdbe --json pdbe_investigation.json --csv-file groups.csv --output-folder ./out
```

#### MAX IV

```bash
# Using SQLite database
mmcif-gen make-mmcif maxiv --sqlite fragmax.sqlite --output-folder ./out --identifier I_5678
mmcif-gen make-mmcif maxiv --json maxiv_investigation.json --sqlite fragmax.sqlite --output-folder ./out --id I_5678
```

#### XChem

```bash
# Using SQLite database with additional information
mmcif-gen make-mmcif xchem --sqlite soakdb.sqlite --txt ./metadata --deposit ./deposit --output-folder ./out
mmcif-gen make-mmcif xchem --json xchem_investigation.json --sqlite soakdb.sqlite --txt ./metadata --deposit ./deposit --output-folder ./out
```

#### DLS (Diamond Light Source)

## Data Enrichment

For investigation files that need enrichment with additional data (e.g., ground state information):

```bash
# Using metadata configuration
mmcif-gen make-mmcif dls --json dls_metadata.json --output-folder ./out --identifier DLS_2024
# Using the miss_importer utility
python miss_importer.py --investigation-file inv.cif --sf-file structure.sf --pdb-id 1ABC
```

## Configuration Files
## Operation JSON Files

The tool uses JSON configuration files to define how data should be transformed into mmCIF format. These files can be:

1. Fetched from the official repository using the `fetch-facility-json` command
2. Created custom for specific needs
3. Modified versions of official configurations
1. Fetched files using the `fetch-facility-json` command
2. Modified versions of official configurations

### Configuration File Structure

```json
{
"source_category": "source_table_name",
"target_category": "_target_category",
"operations": [
{
"source_items": ["column1", "column2"],
"target_items": ["_target.item1", "_target.item2"],
"operation": "direct_transfer"
"source_category" : "_audit_author",
"source_items" : ["name"],
"target_category" : "_audit_author",
"target_items" : "_same",
"operation" : "distinct_union",
"operation_parameters" :{
"primary_parameters" : ["name"]
}
}
]
}
```

## Working with Investigation Files

Investigation files are a specialized type of mmCIF file that capture metadata across multiple experiments. To create investigation files:
Refer to existing JSON files in the `operations/` directory for examples.

1. Use the appropriate facility subcommand
2. Specify the investigation ID
3. Provide the required facility-specific data source

```bash
# Example for PDBe investigation
mmcif-gen make-mmcif pdbe --model-folder ./models --identifier INV_001 --output-folder ./investigations

# Example for MAX IV investigation
mmcif-gen make-mmcif maxiv --sqlite experiment.sqlite --identifier INV_002 --output-folder ./investigations
```

## Data Enrichment

For investigation files that need enrichment with additional data (e.g., ground state information):

```bash
# Using the miss_importer utility
python miss_importer.py --investigation-file inv.cif --sf-file structure.sf --pdb-id 1ABC
```

## Development

Expand Down Expand Up @@ -159,9 +157,6 @@ python -m unittest discover -s tests

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## License

[MIT License](LICENSE)

## Support

Expand Down
19 changes: 10 additions & 9 deletions facilities/dls.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,19 @@
from investigation_io import JsonReader
from typing import List
import sys
import os
import logging

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)


class InvestigationDLS(InvestigationEngine):

def __init__(self, json_path: str, investigation_id: str, output_path: str, transformation_json: str="./operations/dls/dls_metadata.json") -> None:
logging.info("Instantiating DLS Investigation subclass")
def __init__(self, json_path: str, id: str, output_path: str, transformation_json: str="./operations/dls/dls_metadata.json") -> None:
logging.info("Instantiating DLS subclass")
logging.info(f"Creating file id: {id}")
self.json_reader = JsonReader(json_path)
self.operation_file_json = transformation_json
super().__init__(investigation_id, output_path)
super().__init__(id, output_path)

def pre_run(self) -> None:
logging.info("Pre-running")
Expand All @@ -24,17 +25,17 @@ def dls_subparser(subparsers, parent_parser):
parser_dls = subparsers.add_parser("dls", help="Parameter requirements for creating investigation files from DLS data", parents=[parent_parser])

parser_dls.add_argument(
"--json",
help="Path to the .json file"
"--dls-json",
help="Path to the .json file created from ISYPB"
)

def run(json_path : str, investigation_id: str, output_path: str) -> None:
im = InvestigationDLS(json_path, investigation_id, output_path)
def run(dls_json_path : str, id: str, output_path: str, operation_json_path: str) -> None:
im = InvestigationDLS(dls_json_path, id, output_path, operation_json_path)
im.pre_run()
im.run()

def run_investigation_dls(args):
if not args.dls_json:
logging.error("DLS facility requires path to --dls-json file generated from ISYPB")
return 1
run(args.dls_json, args.investigation_id, args.output_folder,args.json)
run(args.dls_json, args.id, args.output_folder,args.json)
2 changes: 1 addition & 1 deletion facilities/maxiv.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ def run_investigation_maxiv(args):
if not args.sqlite:
logging.error("Max IV facility requires path to --sqlite file")
return 1
run(args.sqlite, args.investigation_id, args.output_folder,args.json)
run(args.sqlite, args.id, args.output_folder,args.json)



2 changes: 1 addition & 1 deletion facilities/pdbe.py
Original file line number Diff line number Diff line change
Expand Up @@ -490,7 +490,7 @@ def download_and_run_pdbe_investigation(pdb_ids: List[str], investigation_id: st

def run_investigation_pdbe(args):
if args.model_folder:
run(args.model_folder, args.investigation_id,args.output_folder, args.json)
run(args.model_folder, args.id,args.output_folder, args.json)
elif args.pdb_ids:
download_and_run_pdbe_investigation(args.pdb_ids, args.investigation_id, args.output_folder, args.json)
elif args.csv_file:
Expand Down
24 changes: 17 additions & 7 deletions mmcif_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,20 @@
import argparse
import json
import logging
from logging.handlers import RotatingFileHandler

import os
import pathlib
import requests
import sys
from typing import Dict, List, Optional

file_handler = RotatingFileHandler('mmcifgen.log', maxBytes=100000, backupCount=3)
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
file_handler.setLevel(logging.DEBUG)

logging.getLogger().addHandler(file_handler)

FACILITIES_URL = "https://raw.githubusercontent.com/PDBeurope/Investigations/main/operations/fetched_list.json"

class CLIManager:
Expand Down Expand Up @@ -67,7 +75,8 @@ def find_local_json(self, facility: str) -> Optional[str]:
possible_files = [
f"{facility}_metadata.json",
f"{facility}_metadata_hardcoded.json",
f"{facility}_operations.json"
f"{facility}_operations.json",
f"{facility}_investigation.json"
]

for file in possible_files:
Expand Down Expand Up @@ -168,20 +177,21 @@ def main():
print(f"Using local JSON file: {local_json}")

if args.command == "fetch-facility-json":
json_name = args.json_name.replace('_', '-')
json_name = args.json_name.split('.')[0]
available_jsons = []
for facility, jsons in cli_manager.fetch_facilities_data().items():
available_jsons.extend(jsons)
matching_jsons = [j for j in available_jsons if json_name in j.replace('_', '-')]
if not matching_jsons:

available_jsons_pruned = [j.split('/')[-1].split('.')[0] for j in available_jsons]
if json_name not in available_jsons_pruned:
print(f"No JSON found matching '{json_name}'")
print("\nAvailable JSONs:")
for json_path in available_jsons:
print(f" - {os.path.basename(json_path)}")
sys.exit(1)

cli_manager.fetch_facility_json(matching_jsons[0], args.output_dir)

index_of_match = available_jsons_pruned.index(json_name)
cli_manager.fetch_facility_json(available_jsons[index_of_match], args.output_dir)

elif args.command == "make-mmcif":
available_facilities = cli_manager.get_available_facilities()
Expand Down
1 change: 1 addition & 0 deletions operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,7 @@ def perform_operation(self, operation_data):
target_items = operation_data.get("target_items", [])
operation_parameters = operation_data.get("operation_parameters", {})
jq_filter = operation_parameters.get("jq", "")
logging.info(f"Category: {target_category}, Item(s): {target_items}, JQ Filter: {jq_filter}")

# Get filtered data from JSON reader
filtered_data = self.reader.jq_filter(jq_filter)
Expand Down

0 comments on commit 4d7ecd6

Please sign in to comment.