Merge pull request #2 from zifeishan/develop

BrainDump 0.1.2 - Better layout and more information in the automatic report (README.md) - Good-Turing estimator for extractions and features - Examine why features get high/low weight by looking at number of examples associated - Refined Documentation; add common diagnostics for KBP with BrainDump
zifeishan · Jan 2, 2015 · e4a470c · e4a470c
2 parents bdc0b05 + 1181797
commit e4a470c
Show file tree

Hide file tree

Showing 125 changed files with 16,527 additions and 105 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
 .DS_Store
+util/config-generator/venv/
diff --git a/README.md b/README.md
@@ -1,31 +1,44 @@
 BrainDump
 ====
 
-Creates automatic reports after each DeepDive run.
+BrainDump is the automatic report generator for [DeepDive](http://deepdive.stanford.edu/). With
+BrainDump, developers are able to get automated reports after each
+DeepDive run, and delve into features, supervision, corpus statistics,
+which makes error analysis for Knowledge Base Construction a bit
+easier.
+
+Please also refer to the [DeepDive tutorial](http://deepdive.stanford.edu/doc/basics/walkthrough/walkthrough-improve.html) that uses BrainDump.
 
 
 Installation
 ----
 
 ### Dependencies
 
-You need python library `click` to use the automatic configuration 
-functionality. Run: ```pip install click```
+You can *optionally* install python library `click`, to use the
+automatic configuration generation functionality. Run: 
+
+```
+pip install click
+```
 
 Alternatively, you can skip the automatic configuration functionality 
 and manually configure `braindump.conf`.
 
-### Install
+### Install BrainDump
 
 Run
 
 ```
 make
 ```
 
-to install `braindump` into `$HOME/local/bin/`. Be sure to include that in your PATH if you haven't:
+to install `braindump` into `$HOME/local/bin/`. Be sure to include that in your `PATH` if you haven't:
+
+```
+export PATH=$PATH:$HOME/local/bin/
+```
 
-`export PATH=$PATH:$HOME/local/bin/`
 
 Configuration
 ----
@@ -35,9 +48,9 @@ To integrate BrainDump into your DeepDive application, first you need a `braindu
 To set up `braindump.conf`, you can just run `braindump` once and it
 will generate a `braindump.conf` in the current directory through an
 interactive command line interface. Alternatively, modify the example
-provided in `examples/spouse_custom/` --- this sample configuration
+provided in `examples/tutorial_example/` --- this sample configuration
 file has been configured for
-`DEEPDIVE_HOME/examples/spouse_example/tsv_extractor`.
+`DEEPDIVE_HOME/examples/tutorial_example`.
 
 
 Integrating with your DeepDive application
@@ -86,24 +99,21 @@ Each automated report will be generated as a directory of files. Here is one exa
 ```
 APP_HOME/experiment-reports/v00001/:
 .
-├── README.md                     -- A short summary of the report
+├── README.md                     -- (USEFUL) A short summary of the report, with most information needed!
 ├── calibration                   -- Calibration plots
 │   ├── has_spouse.is_true.png
 │   └── has_spouse.is_true.tsv
-├── code                          -- Saved code for this run
+├── code                          -- Saved code for this run (default saves "application.conf" and "udf/")
 │   ├── application.conf
 │   └── udf
-│       ├── ext_has_spouse.py
-│       ├── ext_has_spouse_features.py
-│       └── ext_people.py
-├── dd-out                        -- A symbolink to the corresponding deepdive output directory
+├── dd-out                        -- A symbolic link to the corresponding deepdive output directory
 ├── dd-timestamp                  -- A timestamp of this deepdive run
 ├── features                      -- Features
 │   ├── counts                    -- A histogram of frequency for most frequent features
 │   │   └── has_spouse_features.tsv
 │   ├── samples                   -- A random sample of the feature table
 │   │   └── has_spouse_features.csv
-│   └── weights                   -- Features with highest and lowest weights. Important features.
+│   └── weights                   -- (USEFUL) Features with highest and lowest weights.
 │       ├── negative_features.tsv
 │       └── positive_features.tsv
 ├── inference                     -- A sample of inference results with >0.9 expectation. Can be used for Mindtagger input when configured.
@@ -118,33 +128,70 @@ APP_HOME/experiment-reports/v00001/:
 
 ```
 
+Example Configurations and Reports
+----
+
+You can browse some examples of configurations and reports, in `examples/` directory.
+
+```
+tutorial_example  -- BrainDump config and sample report for DeepDive tutorial
+kbp-small         -- BrainDump config and sample report for a small KBP application
+paleo-small       -- BrainDump config and sample report for a small Paleo application
+genomics          -- BrainDump config for a genomics application
+```
+
+Common Diagnostics in KBC applications
+----
+
+After each `braindump` run, go into `experiment-reports/latest/`, and do following diagnostics:
+
+### Diagnostics with README.md
+
+- Look at README.md in your report directory. *(It looks better on Github. Try to push it!)*
+- First, Understand your corpus statistics (how many documents and sentences do you have). Do they look correct?
+- Look at each variable's statics. You will be able to see statistics about mention candidates, positive and negative examples, extracted mentions and entities. Do they look correct? Do you have enough documents, positive and negative examples?
+  - Look at the **"Good-Turing estimation"** of probability that *next extracted mention is unseen.* This includes an [estimator](http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) and a [confidence interval](http://www.cs.princeton.edu/~schapire/papers/good-turing.ps.gz). If the estimator is too high (e.g. higher than 0.05), or the upper bound of the confidence interval is too high (e.g. higher than 0.1), you are far away from exhausting your domain of extraction, and you may need to **add more data**.
+
+- Look at top positive and negative features. Do they look reasonable? 
+  - If a positive feature looks insane, why does it get high weight? Look at its number of positive examples and negative examples. Does it have enough negative examples correlated with this feature? If not, **add more negative examples** that may have this feature (and others), by **adding more data** or **adding additional supervision rules**.
+  - Similarly, if a negative feature looks insane, why does it get low weight? Can you add positive examples with this feature?
+
+- Look at the Good-Turing estimator for features. Similar with above, if you have a high estimator, your features are quite sparse.
+
+### Other diagnostics
+
+If you configured `INFERENCE_SAMPLE_SCRIPT` properly --- see the [DeepDive tutorial](http://deepdive.stanford.edu/doc/basics/walkthrough/walkthrough-improve.html#braindump) and [Mindtagger docs](http://deepdive.stanford.edu/doc/basics/labeling.html) about how --- you can copy `inference/*.csv` into your MindTagger task as an input file of Mindtagger. Similarly you can do this for supervision results. See `examples/tutorial_example/bdconfigs/sample-inference.sh` to see how we configure this for DeepDive tutorial.
+
+You are recommended to push the whole report onto Github: it's not large, and Github has good visualization for it. You can not only browse a beautified `README.md`, but also `features/weights/positive_features.tsv` and `features/weights/negative_features.tsv` in a very friendly way, to understand if current features make sense, and if you need more examples (more data or supervision rule).
+
+For more diagnostics in KBC applications, please read [the feature engineering paper](http://arxiv.org/abs/1407.6439).
+
 
 Configuration Specification
 ----
 
-- APP_HOME: the base directory of the DeepDive application
-- DD_OUTPUT_DIR: the output folder of DeepDive
-- DBNAME: the name of your working database
-- PGUSER: the user name to connect to database
-- PGPASSWORD: the password of your database
-- PGPORT: the port to connect to the database
-- PGHOST: the host to connect to the database
-- FEATURE_TABLES: all tables that contain features. Separated by space. e.g. "f1 f2 f3"
-- FEATURE_COLUMNS: columns that contain features in the same order of FEATURE_TABLES. Separated by space. e.g. "c1 c2 c3"
-- VARIABLE_TABLES: all tables that contain DeepDive variables (as defined in table schema). Separated by space. e.g. "v1 v2"
-- VARIABLE_COLUMNS: variable columns in the same order of VARIABLE_TABLES. Separated by space. e.g. "v1 v2"
-- VARIABLE_WORDS_COLUMNS: if the variable is a mention, specify the words / description for the mention. This is used for a statistics with naive entity linking. If empty (""), do not count deduplicated mentions for that table. Separated by space. e.g. w1 ""
-- VARIABLE_DOCID_COLUMNS: specify if there is a field in the variable table that indicates doc_id. This is used to count how many documents have extractions. If empty (""), do not count for that table. Separated by space. e.g. "" did2
-- CODE_CONFIG: a config file that specifies what in $APP_HOME to save as codes, one file/folder per line. Default file is saving "application.conf" and "udf".
-- NUM_SAMPLED_FEATURES: the number of sampled features for each feature table specified in "FEATURE_TABLES"
-- NUM_SAMPLED_SUPERVISION: the number of sampled supervision examples for each variable specified in "FEATURE_COLUMNS"
-- NUM_SAMPLED_RESULT: the number of sampled inference results for each variable
-- NUM_TOP_ENTITIES: the number of top extracted entities listed
-- SENTENCE_TABLE: a table that contains all sentences
-- SENTENCE_TABLE_DOC_ID_COLUMN: document_id column of the sentence table
-- SEND_RESULT_WITH_GIT: whether to send the result with Git by creating a new commit in "experiment-reports" branch.
-- SEND_RESULT_WITH_GIT_PUSH: whether to automate the push in the "experiment-reports" branch.
-<!-- - SEND_RESULT_WITH_EMAIL: whether to send an email report (not implemented yet) -->
-- STATS_SCRIPT: specify the path to a script to override default statistics reporting. The script will run instead of the default statistics reporting procedure.
-- SUPERVISION_SAMPLE_SCRIPT: Specify a script to override default supervision sampling procedure.
-- INFERENCE_SAMPLE_SCRIPT: Specify a script to override default inference result sampling procedure.
+- `APP_HOME`: the base directory of the DeepDive application
+- `DD_OUTPUT_DIR`: the output folder of DeepDive
+- `DBNAME`: the name of your working database
+- `PGUSER`: the user name to connect to database
+- `PGPASSWORD`: the password of your database
+- `PGPORT`: the port to connect to the database
+- `PGHOST`: the host to connect to the database
+- `FEATURE_TABLES`: all tables that contain features. Separated by space. e.g. "f1 f2 f3"
+- `FEATURE_COLUMNS`: columns that contain features in the same order of FEATURE_TABLES. Separated by space. e.g. "c1 c2 c3"
+- `VARIABLE_TABLES`: all tables that contain DeepDive variables (as defined in table schema). Separated by space. e.g. "v1 v2"
+- `VARIABLE_COLUMNS`: variable columns in the same order of VARIABLE_TABLES. Separated by space. e.g. "v1 v2"
+- `VARIABLE_WORDS_COLUMNS`: if the variable is a mention, specify the words / description for the mention. This is used for a statistics with naive entity linking. If empty (""), do not count deduplicated mentions for that table. Separated by space. e.g. w1 ""
+- `VARIABLE_DOCID_COLUMNS`: specify if there is a field in the variable table that indicates doc_id. This is used to count how many documents have extractions. If empty (""), do not count for that table. Separated by space. e.g. "" did2
+- `CODE_CONFIG`: a config file that specifies what in $APP_HOME to save as codes, one file/folder per line. Default file is saving "application.conf" and "udf".
+- `NUM_SAMPLED_FEATURES`: the number of sampled features for each feature table specified in "FEATURE_TABLES"
+- `NUM_SAMPLED_SUPERVISION`: the number of sampled supervision examples for each variable specified in "FEATURE_COLUMNS"
+- `NUM_SAMPLED_RESULT`: the number of sampled inference results for each variable
+- `NUM_TOP_ENTITIES`: the number of top extracted entities listed
+- `SENTENCE_TABLE`: a table that contains all sentences
+- `SENTENCE_TABLE_DOC_ID_COLUMN`: document_id column of the sentence table
+- `SEND_RESULT_WITH_GIT`: whether to send the result with Git by creating a new commit in "experiment-reports" branch.
+- `SEND_RESULT_WITH_GIT_PUSH`: whether to automate the push in the "experiment-reports" branch.
+- `STATS_SCRIPT`: specify the path to a script to override default statistics reporting. The script will run instead of the default statistics reporting procedure.
+- `SUPERVISION_SAMPLE_SCRIPT`: Specify a script to override default supervision sampling procedure.
+- `INFERENCE_SAMPLE_SCRIPT`: Specify a script to override default inference result sampling procedure.
diff --git a/braindump.sh b/braindump.sh
@@ -27,41 +27,46 @@ for ((verNumber=1; ;verNumber+=1)); do
   # if directory doesn't exist.
   if [ ! -d "$REPORT_DIR/$versionName" ]; then
 
-    echo "Saving into $REPORT_DIR/$versionName/"
+    echo "[`date`] Saving into $REPORT_DIR/$versionName/"
 
     mkdir -p $REPORT_DIR/$versionName/
     cd $REPORT_DIR/$versionName/
 
-    echo "Saving mapping with DeepDive output directory..."
-    ln -s $DD_THIS_OUTPUT_DIR ./dd-out
-    echo $DD_TIMESTAMP > ./dd-timestamp
+    # Skip if last DeepDive run is not finished
+    if [ -d $DD_THIS_OUTPUT_DIR/calibration ]; then
+      echo "[`date`] Saving mapping with DeepDive output directory..."
+      ln -s $DD_THIS_OUTPUT_DIR ./dd-out
+      echo $DD_TIMESTAMP > ./dd-timestamp
 
-    echo "Saving code..."
-    mkdir -p code
-    bash $UTIL_DIR/save.sh $APP_HOME ./code/
+      echo "[`date`] Saving code..."
+      mkdir -p code
+      bash $UTIL_DIR/save.sh $APP_HOME ./code/
 
-    echo "Saving calibration..."
-    cp -r $DD_THIS_OUTPUT_DIR/calibration ./
+      echo "[`date`] Saving calibration..."
+      cp -r $DD_THIS_OUTPUT_DIR/calibration ./
+    else
+      echo "WARNING: last deepdive run $DD_THIS_OUTPUT_DIR seems incomplete, skipping..."
+    fi
 
-    echo "Saving statistics..."  
+    echo "[`date`] Saving statistics..."  
     bash $UTIL_DIR/stats.sh $VARIABLE_TABLES $VARIABLE_COLUMNS $VARIABLE_WORDS_COLUMNS 
 
-    echo "Diffing against last version..."
+    echo "[`date`] Diffing against last version..."
     if [ $verNumber -gt 1 ]; then
       mkdir -p changes
       lastVersionName=v`printf "%05d" $(expr $verNumber - 1)`
       bash $UTIL_DIR/diff.sh ../$lastVersionName/code/ ./code/ changes/code.diff
       bash $UTIL_DIR/diff.sh ../$lastVersionName/stats/ ./stats/ changes/stats.diff
     fi
 
-    echo "Saving features..."
+    echo "[`date`] Saving features..."
     mkdir -p features
 
     mkdir -p features/weights/
     bash $UTIL_DIR/feature/feature_weights.sh features/weights/
 
     num_features=${#FEATURE_TABLES[@]}
-    echo "Examining $num_features feature tables..."
+    echo "[`date`] Examining $num_features feature tables..."
     for (( i=0; i<${num_features}; i++ )); do
       table=${FEATURE_TABLES[$i]}
       column=${FEATURE_COLUMNS[$i]}
@@ -74,14 +79,14 @@ for ((verNumber=1; ;verNumber+=1)); do
       bash $UTIL_DIR/feature/feature_counts.sh $table $column features/counts/
     done
 
-    echo "Saving supervision & inference results..."
+    echo "[`date`] Saving supervision & inference results..."
     mkdir -p inference
     mkdir -p supervision
 
     if [[ -z "$SUPERVISION_SAMPLE_SCRIPT" ]]; then
       # Supervision sample script not set, use default
       num_variables=${#VARIABLE_TABLES[@]};
-      echo "Examining $num_variables variable tables...";
+      echo "[`date`] Examining $num_variables variable tables for supervision...";
       for (( i=0; i<${num_variables}; i++ )); do
         table=${VARIABLE_TABLES[$i]}
         column=${VARIABLE_COLUMNS[$i]}
@@ -98,9 +103,9 @@ for ((verNumber=1; ;verNumber+=1)); do
     fi
 
     if [[ -z "$INFERENCE_SAMPLE_SCRIPT" ]]; then
-      # Supervision sample script not set, use default
+      # Inference sample script not set, use default
       num_variables=${#VARIABLE_TABLES[@]};
-      echo "Examining $num_variables variable tables...";
+      echo "[`date`] Examining $num_variables variable tables for inference...";
       for (( i=0; i<${num_variables}; i++ )); do
         table=${VARIABLE_TABLES[$i]}
         column=${VARIABLE_COLUMNS[$i]}
@@ -113,7 +118,7 @@ for ((verNumber=1; ;verNumber+=1)); do
       bash $INFERENCE_SAMPLE_SCRIPT
     fi
 
-    echo "Generating README.md..."
+    echo "[`date`] Generating README.md..."
     bash $UTIL_DIR/generate_readme.sh $REPORT_DIR/$versionName/
 
     if [[ "$SEND_RESULT_WITH_GIT" = "true" ]]; then

diff --git a/examples/empty-code.conf b/examples/empty-code.conf
diff --git a/examples/genomics/braindump.conf b/examples/genomics/braindump.conf
@@ -0,0 +1,87 @@
+
+########## Conventions. Do not recommend to change. ###########
+
+# Set the utility files dir
+export UTIL_DIR="$HOME/local/braindump"
+
+# Report folder: use current
+export REPORT_DIR="$WORKING_DIR/experiment-reports"
+
+
+########## User-specified configurations ###########
+
+# Directories
+
+# Use absolute path if possible.
+# Avoid using "pwd" or "dirname $0", they don't work properly.
+# $WORKING_DIR is set to be the directory where braindump is running. 
+# (the directory that contains braindump.conf)
+export APP_HOME=$WORKING_DIR
+
+# Specify deepdive out directory (DEEPDIVE_HOME/out)
+export DD_OUTPUT_DIR="$HOME/repos/deepdive/out"
+
+# Database Configuration
+export DBNAME=$DBNAME
+export PGUSER=${PGUSER:-`whoami`}
+export PGPASSWORD=${PGPASSWORD:-}
+export PGPORT=${PGPORT:-5432}
+export PGHOST=${PGHOST:-localhost}
+
+# Specify all feature tables. 
+# e.g. FEATURE_TABLES=(f1 f2 f3)
+export FEATURE_TABLES=(dd_query_classify_gene_hpoterm_relations_features)
+export FEATURE_COLUMNS=(feature)
+
+# Specify all variable tables
+export VARIABLE_TABLES=(hpoterm_mentions gene_mentions gene_hpoterm_relations)
+export VARIABLE_COLUMNS=(is_correct is_correct is_correct)
+export VARIABLE_DOCID_COLUMNS=(doc_id doc_id doc_id)
+export VARIABLE_WORDS_COLUMNS=(words words "words_1,words_2")
+# Assume that in DeepDive, inference result tables will be named as [VARIABLE_TABLE]_[VARIABLE_COLUMN]_inference
+
+# If the variable is a mention, specify the words / description for the mention. 
+# This is used for a statistics with naive entity linking. If empty, do not count deduplicated mentions.
+# e.g. export VARIABLE_WORDS_COLUMNS=(w1 "" w3)
+# In the examples above, the second element is left empty
+#export VARIABLE_WORDS_COLUMNS=("word1, word2,rel")
+
+# Set variable docid columns to count distinct documents that have extractions
+# export VARIABLE_DOCID_COLUMNS=(doc_id)
+
+# Code configs to save
+export CODE_CONFIG="$WORKING_DIR/../empty-code.conf"
+
+# Number of samples
+export NUM_SAMPLED_FEATURES=100
+export NUM_SAMPLED_SUPERVISION=500
+export NUM_SAMPLED_RESULT=1000
+export NUM_TOP_ENTITIES=50
+
+# Specify some tables for statistics
+export SENTENCE_TABLE=sentences
+export SENTENCE_TABLE_DOC_ID_COLUMN=doc_id
+
+# Define how to send result. use "true" to activate.
+export SEND_RESULT_WITH_GIT=false
+# If true, push after commiting the report
+export SEND_RESULT_WITH_GIT_PUSH=false
+export SEND_RESULT_WITH_EMAIL=false
+
+######## CUSTOM SCRIPTS ###########
+# Leave blank for default stats report.
+# Set to a location of a script (e.g. $APP_HOME/your_script) to use it instead of default 
+
+# Self-defined scripts for stats. 
+export STATS_SCRIPT=
+export SUPERVISION_SAMPLE_SCRIPT=
+export INFERENCE_SAMPLE_SCRIPT=
+
+########## Conventions. Do not recommend to change. ###########
+
+# Hack: use the last DD run as output dir
+# Suppose out/ is under $DEEPDIVE_HOME/
+# You may need to manually change it based on need
+export DD_TIMESTAMP=`ls -t $DD_OUTPUT_DIR/ | head -n 1`
+export DD_THIS_OUTPUT_DIR=$DD_OUTPUT_DIR/$DD_TIMESTAMP
+
diff --git a/examples/genomics/env.sh b/examples/genomics/env.sh
@@ -0,0 +1,12 @@
+#! /bin/bash
+
+export DBNAME=genomics
+export PGUSER=senwu
+export PGPASSWORD=${PGPASSWORD:-}
+export PGHOST=raiders2.stanford.edu
+export PGPORT=6432
+
+export GPHOST=${GPHOST:-localhost}
+export GPPORT=${GPPORT:-15433}
+export GPPATH=${GPPATH:-/tmp}
+# . /lfs/local/0/senwu/software/greenplum/greenplum-db/before_greenplum.sh