Skip to content

hubmapconsortium/asct-b-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASCT+B Generator

This program converts a simple CSV file into a HuBMAP CCF ASCT+B table.

In the sampledata directory, an included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a CSV file. The example file "demo-output.csv" was generated by the program.

Change Log

See the ChangeLog for the latest developments.

Known Issues

See the Issue Tracker for known issues.

Assumptions

The following assumptions are built into the program.

  1. The ASCT+B table format allows anatomical structures that are not "leaves" to contain biomarkers or references.
  2. All anatomical structures must be uniquely named, for example, there can not be two structures called "ovary" but there can be "left ovary" and "right ovary".
  3. Cell type is only one level.
  4. Commas can not be used in names for anatomical structures, cells, or features.
  5. It is assumed that the "author preferred name" is unique across anatomical structures and ontology IDs.

Data validation

The program performs the following data validation checks.

  1. Check that there is only one root to the anatomical structure.
  2. Enforce the parent requirements for anatomical structure. By default an anatomical structure can have multiple parents. For example, the primary ovarian follicle and the primordial ovarian follicle both have a granulosa cell layer. A command line argument can change this behavior such that anatomical structures can have only one parent.
  3. Check that anatomical structures, cells, biomarkers and references are appropriately defined. By default the program requires all features be explicitely defined, although a command line argument can disable this requirement.
  4. Check that anatomical structures, cell types, biomarkers and references all have unique names.
  5. Check that names do not contain commas.
  6. Check that biomarkers and references are only applied to anatomical structures and cell types.

Requirements

This program has only been tested on a Mac OS using Python 3. Although it should work on a Linux system.

The program requires the anytree Python package.

https://pypi.org/project/anytree/

The anytree package can be installed as follows.

python3 -m pip install anytree --user

Usage

usage: process.py [-h] [-m] [-u] [-d] [-v] input output

Generate ASCT+B table.

positional arguments:
  input          Input file
  output         Output file (CSV)

optional arguments:
  -h, --help     show this help message and exit
  -m, --missing  Ignore missing cell types, biomarkers and references. For example, if a cell type is marked as containing a biomarker that wasn't defined, this flag would prevent the program from exiting with an error and instead the ASCT+B table would be generated. When the flag isn't used, all features must be defined.
  -u, --unique   Make sure all anatomical structures have one and only one parent.
  -d, --dot      Output tree as a DOT file for plotting with Graphviz.
  -v, --verbose  Print the tree to the terminal.

To process the demo input file and generate a CSV file that can be opened by Excel

process.py <input CSV file> <output CSV file>
process.py demo-input.txt demo-output.csv

Input file (CSV)

The comma delimited file (tab separated is also supported) must contain a header line and the following twelve columns:

NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTE ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTUs REFERENCES

The Type value needs to be "AS" for anatomical structures and "CT" for cell types. It doesn't matter what type values are used for the other items, so long as it's not either AS or CT.

Children is a comma separated list of child anatomical structure (AS) objects. These children need to be either anatomical structures (AS). The Cells, Genes, Proteins, Proteoforms, etc fields should be comma separated lists of the appropriate objects (e.g., Cells, should be a comma separated list of relevant cells). In all cases the objects Name or Ref DOI should be used.

The first line in the input file is assumed to contain a header and is ignored.

The following example is incomplete and just included to exemplify the field values and usage:

NAME (REF DOI)	LABEL (REF DETAILS)	ID (REF NOTES)	NOTE	ABBR	TYPE	CHILDREN	CELLS	GENES	PROTEINS	PROTEOFORMS	LIPIDS	METABOLITES	FTU	REFERENCES (NAME/DOI)
ovary		UBERON:0000992	AS	central ovary, lateral ovary, medial ovary, mesovarium, ovarian ligament	hilum of ovary
central ovary			AS	central inferior ovary, central superior ovary	
lateral ovary			AS	lateral inferior ovary, lateral superior ovary	
medial ovary			AS	medial inferior ovary, medial superior ovary	
mesovarium		UBERON:0001342	AS		
ovarian ligament		UBERON:0008847	AS		
hilum of ovary			AS	ovarian artery, ovarian vein, pampiniform plexus, rete ovarii	hilar cell
corona radiata		CL:0000713	CT									doi:10.1093/oxfordjournals.humrep.a136365
hilar cell		CL:0002095	CT				alkaline phosphatase, acid phosphatase, non-specific esterase, inhibin, calretinin, melan-A, cholesterol esters					McKay et al 1961, Boss et al 1965, Mills et al 2020, Jungbluth et al 1998, Pelkey et al 1998
mural granulosa cell			CT									doi:10.1093/oxfordjournals.humrep.a136365
primary oocyte		CL:0000654	CT									doi:10.1093/oxfordjournals.humrep.a136365
secondary oocyte		CL:0000655	CT									doi:10.1093/oxfordjournals.humrep.a136365
columnar ovarian surface epithelial columnar cell			CT				calretinin, mesothelin					Mills et al 2020, Reeves et al 1971, Hummitzsch et al 2013, Blaustein et al 1979, McKay et al 1961
flattened cuboidal ovarian surface epithelial cell			CT				oviduct-specific glycoprotein-1, E-cadherin					Mills et al 2020, Reeves et al 1971, Hummitzsch et al 2013, Blaustein et al 1979, McKay et al 1961
oviduct-specific glycoprotein-1			Protein									
mesothelin			Protein									
E-cadherin			Protein									
doi:10.1093/oxfordjournals.humrep.a136365	PMID: 3558758		Reference									
McKay et al 1961	McKay, D., Pinkerton, J., Hertig, A. & Danziger, S. (1961). The Adult Human Ovary: A Histochemical Study. Obstetrics & Gynecology, 18(1), 13-39. 		Reference									

About

CLI to convert a simple CSV file into ASCT+B format

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages