Update README.md and clean up (#13)

* added changelog.md * updated readme.md * Update README.md * Update README.md Co-authored-by: Bruce W. Herr II <[email protected]>
hubmapconsortium · Jun 27, 2022 · 8b95238 · 8b95238
1 parent e1d512e
commit 8b95238
Show file tree

Hide file tree

Showing 10 changed files with 340 additions and 109 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,58 @@
+# Changelog
+
+Changelog for the ASCT+B Generator
+
+## 1.1.0 - 2022-06-25
+
+* Added a feature to generate CSV files from TSV/CSV files.
+* Added typings and docstrings.
+* Modified the code according to PEP-8 style guide.
+
+## 1.0.3 - 2022-05-25
+
+### Added in 1.0.3
+
+* Fixed a bug causing each reference to incorrectly span 5 columns. References now correctly span 3 columns.
+
+## 1.0.2 - 2022-02-20
+
+### Added in 1.0.2
+
+* Added "note" and "ABBR" (abbreviation) fields for each feature. These columns are now required in the input file.
+* The first 10 lines of the input file are now assumed to contain a descriptive header. This is duplicated to the output file. The 11th line in the input file is assumed to be a per-column header that is ignored.
+
+## 1.0.1 - 2022-02-15: 
+
+### Added in 1.0.1
+
+* Test to be sure entities (anatomical structures, cell types, biomarkers and references) don't have commas in their names. 
+* Added a command line argument users can set to force anatomical structures to be unique (ie., each has one and only one parent). When set, if the same sub-structure exists in multiple parent-structures, then each child structure would need to be uniquely named. By default, a child structure can have multiple parent structures.
+* Added a command line argument users can set to cause the program to automatically create missing features, when features are used. For example, if a biomarker is assigned to a cell type, by default, the biomarker must be independently defined but now users can optionally disable this requirement. Anatomical structures must always be defined.
+
+## 1.0 - 2022-02-14
+
+### Added in 1.0
+
+* Removed the (erroneous) assumption that an anatomical structure can only have a single parent, added more validation of the inputs, added debugging output options, and better handle command line arguments. Also various bug fixes.
+* These are the significant differences from version v0.1 and v1.0.
+
+    1. The command line arguments have been greatly simplified.
+    2. The number of AS levels is automatically computed.
+    3. The input file has changed with this release. Cells are listed in a separate column from the children column.
+    4. Biomarkers and references can now be added to any anatomical structure.
+    5. The anytree Python module is required.
+    6. A header is autogenerate in the output file.
+    7. A DOT file can be generated to display the tree in Graphviz.
+    8. Lots of tests to validate the input file
+
+## 1.0-beta - 2022-02-11
+
+### Added in 1.0-beta
+
+* This is a complete rewrite of the program. This version has none of the limitations from the alpha version, it includes more data validation, and requires less user intervention. 
+
+## 0.1-alpha - 2022-02-02
+
+### Added in 0.1-alpha
+
+* This is a proof of concept and not meant for production use.
diff --git a/README.md b/README.md
@@ -1,43 +1,16 @@
 # ASCT+B Generator
 
-This program converts a simple TSV file into a HuBMAP ASCT+B table.
+This program converts a simple CSV file into a HuBMAP CCF ASCT+B table.
 
-The included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a TSV file. The example file "demo-output.xls" was generated by the program.
+In the sampledata directory, an included file "demo-input.txt" was generated by Excel using the "demo-input.xlsx" file (Save As "Tab delimited Text"). The generated output is a CSV file. The example file "demo-output.csv" was generated by the program.
 
-## Version
+## Change Log
 
-May 25, 2022
+See the [ChangeLog](CHANGELOG.md) for the latest developments.
 
-- Fixed a bug causing each reference to incorrectly span 5 columns. References now correctly span 3 columns.
+## Known Issues
 
-Feb 20, 2022
-
-- Added "note" and "ABBR" (abbreviation) fields for each feature. These columns are now required in the input file.
-- The first 10 lines of the input file are now assumed to contain a descriptive header. This is duplicated to the output file. The 11th line in the input file is assumed to be a per-column header that is ignored.
-
-Feb 15, 2022: 
-
-- Test to be sure entities (anatomical structures, cell types, biomarkers and references) don't have commas in their names. 
-
-- Added a command line argument users can set to force anatomical structures to be unique (ie., each has one and only one parent). When set, if the same sub-structure exists in multiple parent-structures, then each child structure would need to be uniquely named. By default, a child structure can have multiple parent structures.
-- Added a command line argument users can set to cause the program to automatically create missing features, when features are used. For example, if a biomarker is assigned to a cell type, by default, the biomarker must be independently defined but now users can optionally disable this requirement. Anatomical structures must always be defined.
-
-**v1.0** - Feb 14, 2022: Removed the (erroneous) assumption that an anatomical structure can only have a single parent, added more validation of the inputs, added debugging output options, and better handle command line arguments. Also various bug fixes.
-
-These are the significant differences from version v0.1 and v1.0.
-
-    1. The command line arguments have been greatly simplified.
-    2. The number of AS levels is automatically computed.
-    3. The input file has changed with this release. Cells are listed in a separate column from the children column.
-    4. Biomarkers and references can now be added to any anatomical structure.
-    5. The anytree Python module is required.
-    6. A header is autogenerate in the output file.
-    7. A DOT file can be generated to display the tree in Graphviz.
-    8. Lots of tests to validate the input file
-
-**v1.0-beta** - Feb 11, 2022: This is a complete rewrite of the program. This version has none of the limitations from the alpha version, it includes more data validation, and requires less user intervention. 
-
-**v0.1-alpha** - Feb 2, 2022: This is a proof of concept and not meant for production use.
+See the [Issue Tracker](https://github.com/hubmapconsortium/asct-b-generator/issues?q=is%3Aissue+is%3Aopen+label%3A%22known+issue%22) for known issues.
 
 ## Assumptions
 
@@ -81,7 +54,7 @@ Generate ASCT+B table.
 
 positional arguments:
   input          Input file
-  output         Output file (TSV)
+  output         Output file (CSV)
 
 optional arguments:
   -h, --help     show this help message and exit
@@ -91,20 +64,20 @@ optional arguments:
   -v, --verbose  Print the tree to the terminal.
 ```
 
-To process the demo input file and generate a TSV file that can be opened by Excel
+To process the demo input file and generate a CSV file that can be opened by Excel
 
 ```
-process.py <input TSV file> <output TSV file>
+process.py <input CSV file> <output CSV file>
 ```
 
 
 ```
-process.py demo-input.txt demo-output.xls
+process.py demo-input.txt demo-output.csv
 ```
 
-## Input file (TSV)
+## Input file (CSV)
 
-The tab delimited file must contain a header line and the following twelve columns:
+The comma delimited file (tab separated is also supported) must contain a header line and the following twelve columns:
 
 NAME (REF DOI)	LABEL (REF DETAILS)	ID (REF NOTES)	NOTE	ABBR	TYPE	CHILDREN	CELLS	GENES	PROTEINS	PROTEOFORMS	LIPIDS	METABOLITES	FTUs	REFERENCES
 
@@ -138,40 +111,3 @@ E-cadherin			Protein
 doi:10.1093/oxfordjournals.humrep.a136365	PMID: 3558758		Reference									
 McKay et al 1961	McKay, D., Pinkerton, J., Hertig, A. & Danziger, S. (1961). The Adult Human Ovary: A Histochemical Study. Obstetrics & Gynecology, 18(1), 13-39. 		Reference									
 ```
-
-## Known problems and limitations
-
-1. The program should validate the biomarkers using the TYPE field designation.
-
-1. Export ASCT+B table as CSV file.
-
-1. Need to allow for case-independence. At present if a cell type is defined with upper cases and applied to a structure in lower case then the program will consider these different entities and throw an error.
-
-1. Need better example and docs.
-
-1. The program should allow for non-unique "author preferred name" field values.
-
-1. Test if a parent has a child which is actually the parent.
-
-1. Need to strip blank space for the left/right of each text field.
-
-1. If there are dulicate references in a comma separated list of references (column 15) this causes an error claiming a feature hasn't been defined when actually the error is about duplications in the comma separated list of features. The program needs to test for duplications in feature lists.
-
-1. The program requires UTF-8 encoding.
-
-1. There is a bug in Excel where by when generating TSV or CSV files, it may incorrectly include a lot of empty COLUMNS. For example, if the input file only has 15 columns of data, Excel may generate a TSV or CSV file that includes 30 columns that correctly includes the 15 columns of data and another 15 empty columns. This causes the program to error. The program needs to include a workaround for this issue. 
-
-   ```
-   ERROR: incorrect number of fields in line. The tab-delimited line should contain the following 15 fields: 
-   	name, label, ID, node, abbreviation, feature type, children, cells, genes, proteins, proteoforms, lipids, metabolites, FTU, references
-   	Number of fields found in line: 
-   ```
-
-1. There is a bug in Excel where by when generating TSV or CSV files, it may incorrectly include a lot of empty ROWS. For example, if the input file only has 100 rows of data, Excel may generate a TSV or CSV file that includes 500 rows that correctly includes the 100 rows of data and another 400 empty rows. This causes the program to error. The program needs to include a workaround for this issue. 
-
-   ```
-   ERROR: all features must have a unique name.
-   	[ type:]
-   	[ type:]
-   ```
-
diff --git a/demo-input.txt → sampledata/demo-input.txt b/demo-input.txt → sampledata/demo-input.txt
@@ -1,33 +1,33 @@
-"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>"														
-														
-Author Name(s):														
-Author ORCID(s):														
-Reviewer(s):														
-Reviewer ORCID(s):														
-General Publication(s):														
-Data DOI:	Will be added after table is finalized and published													
-Date:	4/1/22													
-Version Number:	v													
-NAME (REF DOI)	LABEL (REF DETAILS)	ID (REF NOTES)	NOTES	ABBR	TYPE	CHILDREN	CELLS	GENES	PROTEINS	PROTEOFORMS	LIPIDS	METABOLITES	FTUs	REFERENCES (DOI)
-organ		UBERON:1234			AS	"sub struct 1, sub struct 2"								
-sub struct 1		UBERON:0001			AS	"sub struct 3, sub struct 4"								
-sub struct 2		UBERON:0002			AS	"sub struct 5, sub struct 6"								
-sub struct 3	as-3	UBERON:0003			AS		"cell1,cell2,cell3"							
-sub struct 4	as-4	UBERON:0004			AS		"cell1,cell2"							
-sub struct 5	as-5	UBERON:0005			AS		"cell1,cell2,cell3"							
-sub struct 6	as-6	UBERON:0006			AS		cell3							
-cell1	c1				CT			gene1	"protein1,protein2"		lipid1	metabolite1		ref1
-cell2	c2				CT			"gene1,gene2,gene3"		"proteo1, proteo2"	lipid2			ref2
-cell3	c3				CT			"gene1,gene3"	protein2	proteo2		metabolite1		"ref1,ref2"
-gene1					Gene									
-gene2					Gene									
-gene3					Gene									
-protein1					Protein									
-protein2					Protein									
-proteo1					Proteoform									
-proteo2					Proteoform									
-lipid1					Lipid									
-lipid2					Lipid									
-metabolite1					Metabolite									
-ref1					Reference									
-ref2					Reference									
+"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>"														
+
+Author Name(s):														
+Author ORCID(s):														
+Reviewer(s):														
+Reviewer ORCID(s):														
+General Publication(s):														
+Data DOI:	Will be added after table is finalized and published													
+Date:	4/1/22													
+Version Number:	v													
+NAME (REF DOI)	LABEL (REF DETAILS)	ID (REF NOTES)	NOTES	ABBR	TYPE	CHILDREN	CELLS	GENES	PROTEINS	PROTEOFORMS	LIPIDS	METABOLITES	FTUs	REFERENCES (DOI)
+organ		UBERON:1234			AS	"sub struct 1, sub struct 2"								
+sub struct 1		UBERON:0001			AS	"sub struct 3, sub struct 4"								
+sub struct 2		UBERON:0002			AS	"sub struct 5, sub struct 6"								
+sub struct 3	as-3	UBERON:0003			AS		"cell1,cell2,cell3"							
+sub struct 4	as-4	UBERON:0004			AS		"cell1,cell2"							
+sub struct 5	as-5	UBERON:0005			AS		"cell1,cell2,cell3"							
+sub struct 6	as-6	UBERON:0006			AS		cell3							
+cell1	c1				CT			gene1	"protein1,protein2"		lipid1	metabolite1		ref1
+cell2	c2				CT			"gene1,gene2,gene3"		"proteo1, proteo2"	lipid2			ref2
+cell3	c3				CT			"gene1,gene3"	protein2	proteo2		metabolite1		"ref1,ref2"
+gene1					Gene									
+gene2					Gene									
+gene3					Gene									
+protein1					Protein									
+protein2					Protein									
+proteo1					Proteoform									
+proteo2					Proteoform									
+lipid1					Lipid									
+lipid2					Lipid									
+metabolite1					Metabolite									
+ref1					Reference									
+ref2					Reference									
diff --git a/demo-input.xlsx → sampledata/demo-input.xlsx b/demo-input.xlsx → sampledata/demo-input.xlsx
diff --git a/sampledata/demo-output.csv b/sampledata/demo-output.csv
@@ -0,0 +1,20 @@
+"Anatomical Structures, Cell Types and Biomarkers Table for <insert organ name>",,,,,,,,,,,,,,
+,,,,,,,,,,,,,,
+Author Name(s):,,,,,,,,,,,,,,
+Author ORCID(s):,,,,,,,,,,,,,,
+Reviewer(s):,,,,,,,,,,,,,,
+Reviewer ORCID(s):,,,,,,,,,,,,,,
+General Publication(s):,,,,,,,,,,,,,,
+Data DOI:,Will be added after table is finalized and published,,,,,,,,,,,,,
+Date:,4/1/22,,,,,,,,,,,,,
+Version Number:,v,,,,,,,,,,,,,
+AS/1,AS/1/LABEL,AS/1/ID,AS/1/NOTE,AS/1/ABBR,AS/2,AS/2/LABEL,AS/2/ID,AS/2/NOTE,AS/2/ABBR,AS/3,AS/3/LABEL,AS/3/ID,AS/3/NOTE,AS/3/ABBR,CT/1,CT/1/LABEL,CT/1/ID,CT/1/NOTE,CT/1/ABBR,BGene/1,BGene/1/LABEL,BGene/1/ID,BGene/1/NOTE,BGene/1/ABBR,BGene/2,BGene/2/LABEL,BGene/2/ID,BGene/2/NOTE,BGene/2/ABBR,BGene/3,BGene/3/LABEL,BGene/3/ID,BGene/3/NOTE,BGene/3/ABBR,BProtein/1,BProtein/1/LABEL,BProtein/1/ID,BProtein/1/NOTE,BProtein/1/ABBR,BProtein/2,BProtein/2/LABEL,BProtein/2/ID,BProtein/2/NOTE,BProtein/2/ABBR,BProteoform/1,BProteoform/1/LABEL,BProteoform/1/ID,BProteoform/1/NOTE,BProteoform/1/ABBR,BProteoform/2,BProteoform/2/LABEL,BProteoform/2/ID,BProteoform/2/NOTE,BProteoform/2/ABBR,BLipid/1,BLipid/1/LABEL,BLipid/1/ID,BLipid/1/NOTE,BLipid/1/ABBR,BMetabolites/1,BMetabolites/1/LABEL,BMetabolites/1/ID,BMetabolites/1/NOTE,BMetabolites/1/ABBR,REF/1,REF/1/DOI,REF/1/NOTES,REF/2,REF/2/DOI,REF/2/NOTES
+organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
+organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
+organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 3,as-3,UBERON:0003,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
+organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 4,as-4,UBERON:0004,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
+organ,,UBERON:1234,,,sub struct 1,,UBERON:0001,,,sub struct 4,as-4,UBERON:0004,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
+organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell1,c1,,,,gene1,,,,,,,,,,,,,,,protein1,,,,,protein2,,,,,,,,,,,,,,,lipid1,,,,,metabolite1,,,,,ref1,,,,,,,
+organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell2,c2,,,,gene1,,,,,gene2,,,,,gene3,,,,,,,,,,,,,,,proteo1,,,,,proteo2,,,,,lipid2,,,,,,,,,,ref2,,,,,,,
+organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 5,as-5,UBERON:0005,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
+organ,,UBERON:1234,,,sub struct 2,,UBERON:0002,,,sub struct 6,as-6,UBERON:0006,,,cell3,c3,,,,gene1,,,,,gene3,,,,,,,,,,protein2,,,,,,,,,,proteo2,,,,,,,,,,,,,,,metabolite1,,,,,ref1,,,ref2,,
diff --git a/demo-output.tree.txt → sampledata/demo-output.tree.txt b/demo-output.tree.txt → sampledata/demo-output.tree.txt
diff --git a/demo-output.xls → sampledata/demo-output.xls b/demo-output.xls → sampledata/demo-output.xls
diff --git a/demo-output.xls.dot → sampledata/demo-output.xls.dot b/demo-output.xls.dot → sampledata/demo-output.xls.dot