Updated README.

chadlaing · Oct 24, 2017 · e78e603 · e78e603
1 parent f94cd7b
commit e78e603
Show file tree

Hide file tree

Showing 3 changed files with 203 additions and 40 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,4 @@
 /TAGS
 /src/TAGS
 /.gitconfig
+/output.txt
diff --git a/README.md b/README.md
@@ -1,45 +1,207 @@
-# **_feht_** -- pronounced "fate", as the "eh" is Canadian
+# **_feht_**
+## pronounced "fate", as the "eh" is Canadian
 
 A commandline program to automatically identify markers predictive of groups. Can be used with binary data, genomic (single nucleotide variant) data, or arbitrary character data.
 
-
 If you are on Windows and prefer a GUI, check out [GenomeFisher](https://bitbucket.org/peterk87/genomefisher/wiki/Home)
 
-## Quick Start
+## Descriptiion
 
+### All commandline options 
 
-## Descriptiion
+    feht - predictive marker discovery
+
+    Usage: feht (-i|--infoFile FILE) (-d|--datafile FILE)
+                [--one "Group1Name Group1Item Group1Item Group1Item"]
+                [--two "Group2Name Group2Item Group2Item Group2Item"]
+                [-l|--delimiter [',', '\t' ...], DEFAULT='\t']
+                [-m|--mode ['binary', 'snp'], DEFAULT='binary']
+                [-c|--correction ['none', 'bonferroni'], DEFAULT='bonferroni']
+                [-f|--ratioFilter [Filter results by ratio (0.00-1.0), DEFAULT=0]]
+    Predictive marker discovery for groups; binary data, genomic data (single nucleotide variants), and arbitrary character data.
+
+    Available options:
+    -i,--infoFile FILE       File of metadata information
+    -d,--datafile FILE       File of binary or single-nucleotide variant data
+    --one "Group1Name Group1Item Group1Item Group1Item"
+                           Group1 column name, followed by optional Group1
+                           labels to include as part of the group
+    --two "Group2Name Group2Item Group2Item Group2Item"
+                           Group2 column name, followed by optional Group2
+                           labels to include as part of the group
+    -l,--delimiter [',', '\t' ...], DEFAULT='\t'
+                           Delimiter used for both the metadata and data file
+    -m,--mode ['binary', 'snp'], DEFAULT='binary'
+                           Mode for program data; either 'binary' or 'snp'
+    -c,--correction ['none', 'bonferroni'], DEFAULT='bonferroni'
+                           Multiple-testing correction to apply
+    -f,--ratioFilter [Filter results by ratio (0.00-1.0), DEFAULT=0]
+                           Display only those results greater than or equal to
+                           the value
+    -h,--help              Show this help text
+
+### File format 
+
+The program takes command line arguments, of which 2 are required: `-i`, which specifies  the information (eg. metadata) file, and `-d`, which specifies the data file. Both of these files need to be delimited with the same delimiter, eg. tab (`\t`, which is the default).
+
+The information file should be formatted with sample names in the first column, which does not require a header; sample names need to be identical in both the information and data files. All other columns require a header, and this header will be used as a metadata category, and all subsequent rows will be interpreted as values within that category. For example, the `data/test_metadata.txt` file included in this repository is as follows:
+
+    genomes	group	position
+    GenomeA	B	up
+    GenomeB	A	up
+    GenomeC	A	down
+    GenomeD	C	sideways
+    GenomeE	B	down
+    GenomeF	A	down
+    GenomeG	A	floating
+    GenomeH	A	up
+    GenomeI	B	sideways
+    GenomeJ	C	down
+    GenomeK	B	up
+
+The first column contains the sample names, `GenomeA, GenomeB ...` and though not required also contains a column header. Both `group` and `postion` will be interpreted as metadata categories, and `A, B, C` as values within metadata category `group`, and `up, down, sideways, floating` as values with metadata category `position`.
+
+The corresponding data file `data/test_binary.txt` looks as follows:
+
+        GenomeA	GenomeB	GenomeC	GenomeD	GenomeE	GenomeF	GenomeG	GenomeH	GenomeI	GenomeJ	GenomeK	GenomeL
+    binary1	0	-	0	0	1	0	0	1	0	0	1	1
+    binary2	0	0	0	0	0	0	0	0	0	0	0	0
+    binary3	1	1	0	0	0	0	1	1	1	1	0	0
+    binary4	1	0	0	1	1	0	0	-	0	0	1	0
+    ... (truncated for space)
+
+In the data file, the sample names are the column headers, and must exactly match those provided in the information (metadata) file. The first column in the data file lacks a column header, but contains labels for the data being examined, in this case `snp1, snp2, snp3, snp4 ...`. Each row represents values of the data being examined for each sample.
+
+### Performing comparisons
+
+#### All possible pairwise comparisons
+`feht` by default will perform all possible pairwise comparisons given the categories in the information file. In our example using the `data/test_metadata.txt` file, a separate comparison withing the `group` category of (`A vs. B`, `A vs. C`, `A vs (B and C)`,`B vs. C`, `B vs. (A and C)`, `C vs. (A and B)`) will be performed, and likewise withing the `position` category. With our test data, these comparisons can be run with:
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt
+
+If you wish to save the results to a file, pipe them to the filename of your choice:
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt > results.txt
+
+And will produce an output file sorted from "most" to "least" discriminatory data. In our example that looks like: 
+
+    [#-
+    Group1 category: group Group1: B
+    Group2 category: group Group2: C
+    ---
+    Name	GroupOne (+)	GroupOne (-)	GroupTwo (+)	GroupTwo (-)	pValue	Ratio
+    binary21	4	0	0	2	1.0	1.0
+    binary44	3	1	0	2	1.0	0.75
+    binary42	3	1	0	2	1.0	0.75
+    binary24	1	3	1	0	1.0	-0.75
+    ...
+    -#]
+
+    [#-
+    Group1 category: position Group1: sideways
+    Group2 category: position Group2: up
+    ---
+    Name	GroupOne (+)	GroupOne (-)	GroupTwo (+)	GroupTwo (-)	pValue	Ratio
+    binary9	2	0	1	4	1.0	0.8
+    binary47	2	0	1	4	1.0	0.8
+    binary8	0	2	4	1	1.0	-0.8
+    binary49	2	0	1	4	1.0	0.8
+    binary1	0	2	3	1	1.0	-0.75
+    ...
+    -#]
+
+    ...
+
+Each output block lists the categories that are being compared, and the values within the category that constitute the group. For example, the first output block above is a comparison between `B` and `C` within the `group` category. The output consists of seven columns, the first being the data label that was compared, and the next four showing the presence and absence of that particular datum among the two groups. In the first example above, for the datum `binary21`, `GroupOne` (which is `B` from the category `group`) contained four members that were positive for `binary21` and 0 that were negative. For `GroupTwo` (which is `C` from the category `group`) there were no members that were positive for `binary21` and two members that were negative. 
+
+The next column is the P-value, which shows by default the `bonferonni` corrected value. In this example, due to the small sample size and number of comparisons, the corrected value is not significant (eg. `1.0`). 
+
+The final column contains the ratio of the fraction of `GroupOne` positive minus the fraction of `GroupTwo` positive. In our example for `binary21` this is (4/4 - 0/2), which gives the result of `1.0`. The ratio provides an additional method for identifying data that are skewed between the groups under comparison. A value of `1.0` means that all of `GroupOne` was positive for the datum and all of `GroupTwo` was negative; conversely a ratio of `-1.0` means that all of `GroupOne` was negative, and all of `GroupTwo` was positive.
+
+#### Specifying groups
+By default all pairwise comparisons will be computed, but user-specified groups can be given as well. In our example, if we only wanted to compare `A` and `B` in the `group` category, we could specify both `GroupOne` and `GroupTwo` as follows:
+
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt --one "group A" --two "group B"
+
+More than one value per category can be specified, as follows:
+
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt --one "position up down" --two "position sideways floating"
+
+Easily performing a one vs. all comparison is done by specifying only `GroupOne`, which will then be compared to a group comprised of all non-specified values of the same category. For example:
+
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt --one "position up"
+
+The above will construct `GroupOne` as `up` and `GroupTwo` as `down sideways floating`.
+
+### Filtering the results
+By default, `feht` will output every result for every comparison. If you wish to limit the number of results, the `ratioFilter` can be used, where only results with a ratio greater than or equal to the value will be displayed. For example, in the first block of results above, if we set the `ratioFilter` to `1.0` as in the following:
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt -f 1
+
+Only a single result is returned:
+
+    [#-
+    Group1 category: group Group1: B
+    Group2 category: group Group2: C
+    ---
+    Name	GroupOne (+)	GroupOne (-)	GroupTwo (+)	GroupTwo (-)	pValue	Ratio
+    binary21	4	0	0	2	1.0	1.0
+    -#]
+
+### Specifying a delimiter
+
+By default the tab character ('\t') is used as a delimiter, but any single character can be used. To use the comma character (',') enter it using the `-l` argument ("el"), and single-quotes around the delimiter:
+
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt -l ','
+
+### Turning off multiple-testing correction
+
+If desired, the multiple-testing correction can be turned off by specifying "none" to the `-c` option. For example, to run a comparison with no correction:
+
+    feht -i data/test_metadata.txt -d data/test_binary.txt -c none
+
+### Built-in data types 
+`feht` by default operated on a table of binary data, but comes with built-in support for single-nucleotide variant (SNV) data.
+
+For each data entry, `feht` will convert the SNV into a binary comparison for all four nucleotides. Consider the provided data in `data/test_snps.txt`:
+
+        GenomeA	GenomeB	GenomeC	GenomeD	GenomeE	GenomeF	GenomeG	GenomeH	GenomeI	GenomeJ	GenomeK	GenomeL
+    snp1	T	-	T	T	C	A	A	C	A	A	C	C
+    snp2	A	T	T	A	A	A	T	A	A	T	A	T
+    snp3	C	G	T	T	A	A	G	C	C	C	A	A
+    snp4	C	T	A	C	G	T	T	-	A	A	C	A
+    ...
+
+If the following comparison is run:
+
+
+    feht -i data/test_metadata.txt -d data/test_snps.txt -m snp 
+
+The following is produced:
+
+    [#-
+    Group1 category: group Group1: B
+    Group2 category: group Group2: C
+    ---
+    Name	GroupOne (+)	GroupOne (-)	GroupTwo (+)	GroupTwo (-)	pValue	Ratio
+    snp32_a	0	4	2	0	1.0	-1.0
+    snp45_a	0	4	2	0	1.0	-1.0
+    snp41_t	0	4	2	0	1.0	-1.0
+    snp21_t	0	4	2	0	1.0	-1.0   
+    ...
+
+This shows that the data for `snp32` has been converted into binary, with the nucleotide under comparison appended. For `snp32_a` this represents all `A` characters in the data as positive, and all `C`, `T`, and `G` characters as negative. All four comparisons are carried out for each row of SNV data.
+
+### Missing data
+Within the `binary` mode, if data are not in `1` or `0` form, they will be ignored, and will not contribute to the calculations as either a positive or negative value; the total data for groups will be adjusted to accommodate the missing data. The same is true for `snp` mode data that is not `A, C, T, G` For example:
+
+
+        GenomeA	GenomeB	GenomeC	GenomeD	GenomeE	GenomeF	GenomeG	GenomeH	GenomeI	GenomeJ	GenomeK	GenomeL
+    snp1	T	-	T	T	C	A	A	C	A	A	C	C
+
+Contains 12 possible entries, but the one for `GenomeB` is of the form `-`. This entry will be ignored, and only 11 data points will be used. This means that if `GroupOne` normally has 4 members and contains `GenomeB`, for the `snp1` calculations, it will be as if it only contained 3 members. 
 
-The program takes command line arguments, 3 of which are required. For example:
-
-    ./feht --info=data/metadata.txt --datafile=data/data.tab --mode="snp" > output.txt
-
-Files can be either tab-delimited, or comma-delimited. The default is tab, but it can be changed by including
-
-    --delimter=","
-             
-in the command line options. Both the metadata file and the datafile must use the same delimiter.
-
-The info file is the delimited metadata, where the column headers denote the metadata categories, the row labels denote the subject names (which must exactly match the names in the datafile), and the cells are values for the metadata category for a given subject.
-
-The datafile is the actual data, where the column headers denote the subject names (which must exactly match the row labels in the info file), the row labels denote the factor name (which must be unique), and the cells are the factors for a particular subject. If the factors represent binary data, this can be specified as:
-
-    --mode="binary"
-
-If the mode is set as “snp”, using:
-
-    --mode="snp"
-
- then the factors are assumed to be genetic data of A, C, T, or G, and each of A vs not-A, C vs not-C, T vs not T, G vs not-G are computed, and (_c, _t, _g, _a) is appended to the factor name in the results, to denote the comparison that was significant.
-
-There are two additional, optional arguments: `--one`, and `--two`, which are used to specify the Metadata Category and values for groups of interest. If neither `--one` nor `--two` are specified, then all possible combinations for all metadata categories will be computed eg. if there is a Province column, then AB vs. not-AB, AB vs. BC, AB vs. SK, AB vs. MB ... will be computed, and this is done for every column in the `--info` file.
-
-If only `--one` is specified, then that group is compared against all others of the same category. Using our previous Province example, to compare AB and NB to a group consisting of all other provinces, the following would be run:
-
-    ./feht --info=data/metadata.txt --datafile=data/data.tab --mode="snp" --one="Province AB NB" > output.txt
-
-The options given to --one must be quoted, and the first word must contain no spaces and be an exact match to a Column Name in the `--info` sheet, followed by space-separated values that should be included as part of the comparison group.
-
-The same applies to specifying arguments to `--two`, and if specified in addition to `--one`, only those two groups will be compared. For example:
-
-    ./feht --info=data/metadata.txt --datafile=data/data.tab --mode="snp" --one="Province AB NB" --two="Province NS QC OE" > output.txt
diff --git a/src/UserInput.hs b/src/UserInput.hs
@@ -52,12 +52,12 @@ feht = UserInput
       <> help "File of binary or single-nucleotide variant data")
   <*> strOption
       (long "one"
-      <> metavar "Group1Name Group1Item Group1Item Group1Item"
+      <> metavar "\"Group1Name Group1Item Group1Item Group1Item\""
       <> value "all all"
       <> help "Group1 column name, followed by optional Group1 labels to include as part of the group")
   <*> strOption
       (long "two"
-      <> metavar "Group2Name Group2Item Group2Item Group2Item"
+      <> metavar "\"Group2Name Group2Item Group2Item Group2Item\""
       <> value "all all"
       <> help "Group2 column name, followed by optional Group2 labels to include as part of the group")
   <*> option auto
@@ -82,7 +82,7 @@ feht = UserInput
   <*> option auto
       (long "ratioFilter"
       <> short 'f'
-      <> metavar "[Filter results by ratio (0.00-1.0), DEFAULT=0]"
+      <> metavar "[Filter results by ratio (0.00-1.00), DEFAULT=0]"
       <> value 0
       <> help "Display only those results greater than or equal to the value"
       )