Skip to content

Commit

Permalink
Update README.md and clean up (#13)
Browse files Browse the repository at this point in the history
* added changelog.md

* updated readme.md

* Update README.md

* Update README.md

Co-authored-by: Bruce W. Herr II <[email protected]>
  • Loading branch information
bhushankhope and bherr2 authored Jun 27, 2022
1 parent e1d512e commit 8b95238
Show file tree
Hide file tree
Showing 10 changed files with 340 additions and 109 deletions.
58 changes: 58 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Changelog

Changelog for the ASCT+B Generator

## 1.1.0 - 2022-06-25

* Added a feature to generate CSV files from TSV/CSV files.
* Added typings and docstrings.
* Modified the code according to PEP-8 style guide.

## 1.0.3 - 2022-05-25

### Added in 1.0.3

* Fixed a bug causing each reference to incorrectly span 5 columns. References now correctly span 3 columns.

## 1.0.2 - 2022-02-20

### Added in 1.0.2

* Added "note" and "ABBR" (abbreviation) fields for each feature. These columns are now required in the input file.
* The first 10 lines of the input file are now assumed to contain a descriptive header. This is duplicated to the output file. The 11th line in the input file is assumed to be a per-column header that is ignored.

## 1.0.1 - 2022-02-15:

### Added in 1.0.1

* Test to be sure entities (anatomical structures, cell types, biomarkers and references) don't have commas in their names.
* Added a command line argument users can set to force anatomical structures to be unique (ie., each has one and only one parent). When set, if the same sub-structure exists in multiple parent-structures, then each child structure would need to be uniquely named. By default, a child structure can have multiple parent structures.
* Added a command line argument users can set to cause the program to automatically create missing features, when features are used. For example, if a biomarker is assigned to a cell type, by default, the biomarker must be independently defined but now users can optionally disable this requirement. Anatomical structures must always be defined.

## 1.0 - 2022-02-14

### Added in 1.0

* Removed the (erroneous) assumption that an anatomical structure can only have a single parent, added more validation of the inputs, added debugging output options, and better handle command line arguments. Also various bug fixes.
* These are the significant differences from version v0.1 and v1.0.

1. The command line arguments have been greatly simplified.
2. The number of AS levels is automatically computed.
3. The input file has changed with this release. Cells are listed in a separate column from the children column.
4. Biomarkers and references can now be added to any anatomical structure.
5. The anytree Python module is required.
6. A header is autogenerate in the output file.
7. A DOT file can be generated to display the tree in Graphviz.
8. Lots of tests to validate the input file

## 1.0-beta - 2022-02-11

### Added in 1.0-beta

* This is a complete rewrite of the program. This version has none of the limitations from the alpha version, it includes more data validation, and requires less user intervention.

## 0.1-alpha - 2022-02-02

### Added in 0.1-alpha

* This is a proof of concept and not meant for production use.
88 changes: 12 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,16 @@
# ASCT+B Generator

This program converts a simple TSV file into a HuBMAP ASCT+B table.
This program converts a simple CSV file into a HuBMAP CCF ASCT+B table.

The included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a TSV file. The example file "demo-output.xls" was generated by the program.
In the sampledata directory, an included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a CSV file. The example file "demo-output.csv" was generated by the program.

## Version
## Change Log

May 25, 2022
See the [ChangeLog](CHANGELOG.md) for the latest developments.

- Fixed a bug causing each reference to incorrectly span 5 columns. References now correctly span 3 columns.
## Known Issues

Feb 20, 2022

- Added "note" and "ABBR" (abbreviation) fields for each feature. These columns are now required in the input file.
- The first 10 lines of the input file are now assumed to contain a descriptive header. This is duplicated to the output file. The 11th line in the input file is assumed to be a per-column header that is ignored.

Feb 15, 2022:

- Test to be sure entities (anatomical structures, cell types, biomarkers and references) don't have commas in their names.

- Added a command line argument users can set to force anatomical structures to be unique (ie., each has one and only one parent). When set, if the same sub-structure exists in multiple parent-structures, then each child structure would need to be uniquely named. By default, a child structure can have multiple parent structures.
- Added a command line argument users can set to cause the program to automatically create missing features, when features are used. For example, if a biomarker is assigned to a cell type, by default, the biomarker must be independently defined but now users can optionally disable this requirement. Anatomical structures must always be defined.

**v1.0** - Feb 14, 2022: Removed the (erroneous) assumption that an anatomical structure can only have a single parent, added more validation of the inputs, added debugging output options, and better handle command line arguments. Also various bug fixes.

These are the significant differences from version v0.1 and v1.0.

1. The command line arguments have been greatly simplified.
2. The number of AS levels is automatically computed.
3. The input file has changed with this release. Cells are listed in a separate column from the children column.
4. Biomarkers and references can now be added to any anatomical structure.
5. The anytree Python module is required.
6. A header is autogenerate in the output file.
7. A DOT file can be generated to display the tree in Graphviz.
8. Lots of tests to validate the input file

**v1.0-beta** - Feb 11, 2022: This is a complete rewrite of the program. This version has none of the limitations from the alpha version, it includes more data validation, and requires less user intervention.

**v0.1-alpha** - Feb 2, 2022: This is a proof of concept and not meant for production use.
See the [Issue Tracker](https://github.com/hubmapconsortium/asct-b-generator/issues?q=is%3Aissue+is%3Aopen+label%3A%22known+issue%22) for known issues.

## Assumptions

Expand Down Expand Up @@ -81,7 +54,7 @@ Generate ASCT+B table.
positional arguments:
input Input file
output Output file (TSV)
output Output file (CSV)
optional arguments:
-h, --help show this help message and exit
Expand All @@ -91,20 +64,20 @@ optional arguments:
-v, --verbose Print the tree to the terminal.
```

To process the demo input file and generate a TSV file that can be opened by Excel
To process the demo input file and generate a CSV file that can be opened by Excel

```
process.py <input TSV file> <output TSV file>
process.py <input CSV file> <output CSV file>
```


```
process.py demo-input.txt demo-output.xls
process.py demo-input.txt demo-output.csv
```

## Input file (TSV)
## Input file (CSV)

The tab delimited file must contain a header line and the following twelve columns:
The comma delimited file (tab separated is also supported) must contain a header line and the following twelve columns:

NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTE ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTUs REFERENCES

Expand Down Expand Up @@ -138,40 +111,3 @@ E-cadherin Protein
doi:10.1093/oxfordjournals.humrep.a136365 PMID: 3558758 Reference
McKay et al 1961 McKay, D., Pinkerton, J., Hertig, A. & Danziger, S. (1961). The Adult Human Ovary: A Histochemical Study. Obstetrics & Gynecology, 18(1), 13-39. Reference
```

## Known problems and limitations

1. The program should validate the biomarkers using the TYPE field designation.

1. Export ASCT+B table as CSV file.

1. Need to allow for case-independence. At present if a cell type is defined with upper cases and applied to a structure in lower case then the program will consider these different entities and throw an error.

1. Need better example and docs.

1. The program should allow for non-unique "author preferred name" field values.

1. Test if a parent has a child which is actually the parent.

1. Need to strip blank space for the left/right of each text field.

1. If there are dulicate references in a comma separated list of references (column 15) this causes an error claiming a feature hasn't been defined when actually the error is about duplications in the comma separated list of features. The program needs to test for duplications in feature lists.

1. The program requires UTF-8 encoding.

1. There is a bug in Excel where by when generating TSV or CSV files, it may incorrectly include a lot of empty COLUMNS. For example, if the input file only has 15 columns of data, Excel may generate a TSV or CSV file that includes 30 columns that correctly includes the 15 columns of data and another 15 empty columns. This causes the program to error. The program needs to include a workaround for this issue.

```
ERROR: incorrect number of fields in line. The tab-delimited line should contain the following 15 fields:
name, label, ID, node, abbreviation, feature type, children, cells, genes, proteins, proteoforms, lipids, metabolites, FTU, references
Number of fields found in line:
```

1. There is a bug in Excel where by when generating TSV or CSV files, it may incorrectly include a lot of empty ROWS. For example, if the input file only has 100 rows of data, Excel may generate a TSV or CSV file that includes 500 rows that correctly includes the 100 rows of data and another 400 empty rows. This causes the program to error. The program needs to include a workaround for this issue.

```
ERROR: all features must have a unique name.
[ type:]
[ type:]
```

66 changes: 33 additions & 33 deletions demo-input.txt → sampledata/demo-input.txt
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>"
Author Name(s):
Author ORCID(s):
Reviewer(s):
Reviewer ORCID(s):
General Publication(s):
Data DOI: Will be added after table is finalized and published
Date: 4/1/22
Version Number: v
NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTES ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTUs REFERENCES (DOI)
organ UBERON:1234 AS "sub struct 1, sub struct 2"
sub struct 1 UBERON:0001 AS "sub struct 3, sub struct 4"
sub struct 2 UBERON:0002 AS "sub struct 5, sub struct 6"
sub struct 3 as-3 UBERON:0003 AS "cell1,cell2,cell3"
sub struct 4 as-4 UBERON:0004 AS "cell1,cell2"
sub struct 5 as-5 UBERON:0005 AS "cell1,cell2,cell3"
sub struct 6 as-6 UBERON:0006 AS cell3
cell1 c1 CT gene1 "protein1,protein2" lipid1 metabolite1 ref1
cell2 c2 CT "gene1,gene2,gene3" "proteo1, proteo2" lipid2 ref2
cell3 c3 CT "gene1,gene3" protein2 proteo2 metabolite1 "ref1,ref2"
gene1 Gene
gene2 Gene
gene3 Gene
protein1 Protein
protein2 Protein
proteo1 Proteoform
proteo2 Proteoform
lipid1 Lipid
lipid2 Lipid
metabolite1 Metabolite
ref1 Reference
ref2 Reference
"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>"

Author Name(s):
Author ORCID(s):
Reviewer(s):
Reviewer ORCID(s):
General Publication(s):
Data DOI: Will be added after table is finalized and published
Date: 4/1/22
Version Number: v
NAME (REF DOI) LABEL (REF DETAILS) ID (REF NOTES) NOTES ABBR TYPE CHILDREN CELLS GENES PROTEINS PROTEOFORMS LIPIDS METABOLITES FTUs REFERENCES (DOI)
organ UBERON:1234 AS "sub struct 1, sub struct 2"
sub struct 1 UBERON:0001 AS "sub struct 3, sub struct 4"
sub struct 2 UBERON:0002 AS "sub struct 5, sub struct 6"
sub struct 3 as-3 UBERON:0003 AS "cell1,cell2,cell3"
sub struct 4 as-4 UBERON:0004 AS "cell1,cell2"
sub struct 5 as-5 UBERON:0005 AS "cell1,cell2,cell3"
sub struct 6 as-6 UBERON:0006 AS cell3
cell1 c1 CT gene1 "protein1,protein2" lipid1 metabolite1 ref1
cell2 c2 CT "gene1,gene2,gene3" "proteo1, proteo2" lipid2 ref2
cell3 c3 CT "gene1,gene3" protein2 proteo2 metabolite1 "ref1,ref2"
gene1 Gene
gene2 Gene
gene3 Gene
protein1 Protein
protein2 Protein
proteo1 Proteoform
proteo2 Proteoform
lipid1 Lipid
lipid2 Lipid
metabolite1 Metabolite
ref1 Reference
ref2 Reference
File renamed without changes.
20 changes: 20 additions & 0 deletions sampledata/demo-output.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>",,,,,,,,,,,,,,
,,,,,,,,,,,,,,
Author Name(s):,,,,,,,,,,,,,,
Author ORCID(s):,,,,,,,,,,,,,,
Reviewer(s):,,,,,,,,,,,,,,
Reviewer ORCID(s):,,,,,,,,,,,,,,
General Publication(s):,,,,,,,,,,,,,,
Data DOI:,Will be added after table is finalized and published,,,,,,,,,,,,,
Date:,4/1/22,,,,,,,,,,,,,
Version Number:,v,,,,,,,,,,,,,
AS/1,AS/1/LABEL,AS/1/ID,AS/1/NOTE,AS/1/ABBR,AS/2,AS/2/LABEL,AS/2/ID,AS/2/NOTE,AS/2/ABBR,AS/3,AS/3/LABEL,AS/3/ID,AS/3/NOTE,AS/3/ABBR,CT/1,CT/1/LABEL,CT/1/ID,CT/1/NOTE,CT/1/ABBR,BGene/1,BGene/1/LABEL,BGene/1/ID,BGene/1/NOTE,BGene/1/ABBR,BGene/2,BGene/2/LABEL,BGene/2/ID,BGene/2/NOTE,BGene/2/ABBR,BGene/3,BGene/3/LABEL,BGene/3/ID,BGene/3/NOTE,BGene/3/ABBR,BProtein/1,BProtein/1/LABEL,BProtein/1/ID,BProtein/1/NOTE,BProtein/1/ABBR,BProtein/2,BProtein/2/LABEL,BProtein/2/ID,BProtein/2/NOTE,BProtein/2/ABBR,BProteoform/1,BProteoform/1/LABEL,BProteoform/1/ID,BProteoform/1/NOTE,BProteoform/1/ABBR,BProteoform/2,BProteoform/2/LABEL,BProteoform/2/ID,BProteoform/2/NOTE,BProteoform/2/ABBR,BLipid/1,BLipid/1/LABEL,BLipid/1/ID,BLipid/1/NOTE,BLipid/1/ABBR,BMetabolites/1,BMetabolites/1/LABEL,BMetabolites/1/ID,BMetabolites/1/NOTE,BMetabolites/1/ABBR,REF/1,REF/1/DOI,REF/1/NOTES,REF/2,REF/2/DOI,REF/2/NOTES
organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 4,as-4,UBERON:0004,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 4,as-4,UBERON:0004,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 6,as-6,UBERON:0006,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 8b95238

Please sign in to comment.