Skip to content

Commit

Permalink
Merge pull request #2 from zifeishan/develop
Browse files Browse the repository at this point in the history
BrainDump 0.1.2

- Better layout and more information in the automatic report (README.md)
- Good-Turing estimator for extractions and features
- Examine why features get high/low weight by looking at number of examples associated
- Refined Documentation; add common diagnostics for KBP with BrainDump
  • Loading branch information
zifeishan committed Jan 2, 2015
2 parents bdc0b05 + 1181797 commit e4a470c
Show file tree
Hide file tree
Showing 125 changed files with 16,527 additions and 105 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.DS_Store
util/config-generator/venv/
129 changes: 88 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,44 @@
BrainDump
====

Creates automatic reports after each DeepDive run.
BrainDump is the automatic report generator for [DeepDive](http://deepdive.stanford.edu/). With
BrainDump, developers are able to get automated reports after each
DeepDive run, and delve into features, supervision, corpus statistics,
which makes error analysis for Knowledge Base Construction a bit
easier.

Please also refer to the [DeepDive tutorial](http://deepdive.stanford.edu/doc/basics/walkthrough/walkthrough-improve.html) that uses BrainDump.


Installation
----

### Dependencies

You need python library `click` to use the automatic configuration
functionality. Run: ```pip install click```
You can *optionally* install python library `click`, to use the
automatic configuration generation functionality. Run:

```
pip install click
```

Alternatively, you can skip the automatic configuration functionality
and manually configure `braindump.conf`.

### Install
### Install BrainDump

Run

```
make
```

to install `braindump` into `$HOME/local/bin/`. Be sure to include that in your PATH if you haven't:
to install `braindump` into `$HOME/local/bin/`. Be sure to include that in your `PATH` if you haven't:

```
export PATH=$PATH:$HOME/local/bin/
```

`export PATH=$PATH:$HOME/local/bin/`

Configuration
----
Expand All @@ -35,9 +48,9 @@ To integrate BrainDump into your DeepDive application, first you need a `braindu
To set up `braindump.conf`, you can just run `braindump` once and it
will generate a `braindump.conf` in the current directory through an
interactive command line interface. Alternatively, modify the example
provided in `examples/spouse_custom/` --- this sample configuration
provided in `examples/tutorial_example/` --- this sample configuration
file has been configured for
`DEEPDIVE_HOME/examples/spouse_example/tsv_extractor`.
`DEEPDIVE_HOME/examples/tutorial_example`.


Integrating with your DeepDive application
Expand Down Expand Up @@ -86,24 +99,21 @@ Each automated report will be generated as a directory of files. Here is one exa
```
APP_HOME/experiment-reports/v00001/:
.
├── README.md -- A short summary of the report
├── README.md -- (USEFUL) A short summary of the report, with most information needed!
├── calibration -- Calibration plots
│   ├── has_spouse.is_true.png
│   └── has_spouse.is_true.tsv
├── code -- Saved code for this run
├── code -- Saved code for this run (default saves "application.conf" and "udf/")
│   ├── application.conf
│   └── udf
│   ├── ext_has_spouse.py
│   ├── ext_has_spouse_features.py
│   └── ext_people.py
├── dd-out -- A symbolink to the corresponding deepdive output directory
├── dd-out -- A symbolic link to the corresponding deepdive output directory
├── dd-timestamp -- A timestamp of this deepdive run
├── features -- Features
│   ├── counts -- A histogram of frequency for most frequent features
│   │   └── has_spouse_features.tsv
│   ├── samples -- A random sample of the feature table
│   │   └── has_spouse_features.csv
│   └── weights -- Features with highest and lowest weights. Important features.
│   └── weights -- (USEFUL) Features with highest and lowest weights.
│   ├── negative_features.tsv
│   └── positive_features.tsv
├── inference -- A sample of inference results with >0.9 expectation. Can be used for Mindtagger input when configured.
Expand All @@ -118,33 +128,70 @@ APP_HOME/experiment-reports/v00001/:
```

Example Configurations and Reports
----

You can browse some examples of configurations and reports, in `examples/` directory.

```
tutorial_example -- BrainDump config and sample report for DeepDive tutorial
kbp-small -- BrainDump config and sample report for a small KBP application
paleo-small -- BrainDump config and sample report for a small Paleo application
genomics -- BrainDump config for a genomics application
```

Common Diagnostics in KBC applications
----

After each `braindump` run, go into `experiment-reports/latest/`, and do following diagnostics:

### Diagnostics with README.md

- Look at README.md in your report directory. *(It looks better on Github. Try to push it!)*
- First, Understand your corpus statistics (how many documents and sentences do you have). Do they look correct?
- Look at each variable's statics. You will be able to see statistics about mention candidates, positive and negative examples, extracted mentions and entities. Do they look correct? Do you have enough documents, positive and negative examples?
- Look at the **"Good-Turing estimation"** of probability that *next extracted mention is unseen.* This includes an [estimator](http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) and a [confidence interval](http://www.cs.princeton.edu/~schapire/papers/good-turing.ps.gz). If the estimator is too high (e.g. higher than 0.05), or the upper bound of the confidence interval is too high (e.g. higher than 0.1), you are far away from exhausting your domain of extraction, and you may need to **add more data**.

- Look at top positive and negative features. Do they look reasonable?
- If a positive feature looks insane, why does it get high weight? Look at its number of positive examples and negative examples. Does it have enough negative examples correlated with this feature? If not, **add more negative examples** that may have this feature (and others), by **adding more data** or **adding additional supervision rules**.
- Similarly, if a negative feature looks insane, why does it get low weight? Can you add positive examples with this feature?

- Look at the Good-Turing estimator for features. Similar with above, if you have a high estimator, your features are quite sparse.

### Other diagnostics

If you configured `INFERENCE_SAMPLE_SCRIPT` properly --- see the [DeepDive tutorial](http://deepdive.stanford.edu/doc/basics/walkthrough/walkthrough-improve.html#braindump) and [Mindtagger docs](http://deepdive.stanford.edu/doc/basics/labeling.html) about how --- you can copy `inference/*.csv` into your MindTagger task as an input file of Mindtagger. Similarly you can do this for supervision results. See `examples/tutorial_example/bdconfigs/sample-inference.sh` to see how we configure this for DeepDive tutorial.

You are recommended to push the whole report onto Github: it's not large, and Github has good visualization for it. You can not only browse a beautified `README.md`, but also `features/weights/positive_features.tsv` and `features/weights/negative_features.tsv` in a very friendly way, to understand if current features make sense, and if you need more examples (more data or supervision rule).

For more diagnostics in KBC applications, please read [the feature engineering paper](http://arxiv.org/abs/1407.6439).


Configuration Specification
----

- APP_HOME: the base directory of the DeepDive application
- DD_OUTPUT_DIR: the output folder of DeepDive
- DBNAME: the name of your working database
- PGUSER: the user name to connect to database
- PGPASSWORD: the password of your database
- PGPORT: the port to connect to the database
- PGHOST: the host to connect to the database
- FEATURE_TABLES: all tables that contain features. Separated by space. e.g. "f1 f2 f3"
- FEATURE_COLUMNS: columns that contain features in the same order of FEATURE_TABLES. Separated by space. e.g. "c1 c2 c3"
- VARIABLE_TABLES: all tables that contain DeepDive variables (as defined in table schema). Separated by space. e.g. "v1 v2"
- VARIABLE_COLUMNS: variable columns in the same order of VARIABLE_TABLES. Separated by space. e.g. "v1 v2"
- VARIABLE_WORDS_COLUMNS: if the variable is a mention, specify the words / description for the mention. This is used for a statistics with naive entity linking. If empty (""), do not count deduplicated mentions for that table. Separated by space. e.g. w1 ""
- VARIABLE_DOCID_COLUMNS: specify if there is a field in the variable table that indicates doc_id. This is used to count how many documents have extractions. If empty (""), do not count for that table. Separated by space. e.g. "" did2
- CODE_CONFIG: a config file that specifies what in $APP_HOME to save as codes, one file/folder per line. Default file is saving "application.conf" and "udf".
- NUM_SAMPLED_FEATURES: the number of sampled features for each feature table specified in "FEATURE_TABLES"
- NUM_SAMPLED_SUPERVISION: the number of sampled supervision examples for each variable specified in "FEATURE_COLUMNS"
- NUM_SAMPLED_RESULT: the number of sampled inference results for each variable
- NUM_TOP_ENTITIES: the number of top extracted entities listed
- SENTENCE_TABLE: a table that contains all sentences
- SENTENCE_TABLE_DOC_ID_COLUMN: document_id column of the sentence table
- SEND_RESULT_WITH_GIT: whether to send the result with Git by creating a new commit in "experiment-reports" branch.
- SEND_RESULT_WITH_GIT_PUSH: whether to automate the push in the "experiment-reports" branch.
<!-- - SEND_RESULT_WITH_EMAIL: whether to send an email report (not implemented yet) -->
- STATS_SCRIPT: specify the path to a script to override default statistics reporting. The script will run instead of the default statistics reporting procedure.
- SUPERVISION_SAMPLE_SCRIPT: Specify a script to override default supervision sampling procedure.
- INFERENCE_SAMPLE_SCRIPT: Specify a script to override default inference result sampling procedure.
- `APP_HOME`: the base directory of the DeepDive application
- `DD_OUTPUT_DIR`: the output folder of DeepDive
- `DBNAME`: the name of your working database
- `PGUSER`: the user name to connect to database
- `PGPASSWORD`: the password of your database
- `PGPORT`: the port to connect to the database
- `PGHOST`: the host to connect to the database
- `FEATURE_TABLES`: all tables that contain features. Separated by space. e.g. "f1 f2 f3"
- `FEATURE_COLUMNS`: columns that contain features in the same order of FEATURE_TABLES. Separated by space. e.g. "c1 c2 c3"
- `VARIABLE_TABLES`: all tables that contain DeepDive variables (as defined in table schema). Separated by space. e.g. "v1 v2"
- `VARIABLE_COLUMNS`: variable columns in the same order of VARIABLE_TABLES. Separated by space. e.g. "v1 v2"
- `VARIABLE_WORDS_COLUMNS`: if the variable is a mention, specify the words / description for the mention. This is used for a statistics with naive entity linking. If empty (""), do not count deduplicated mentions for that table. Separated by space. e.g. w1 ""
- `VARIABLE_DOCID_COLUMNS`: specify if there is a field in the variable table that indicates doc_id. This is used to count how many documents have extractions. If empty (""), do not count for that table. Separated by space. e.g. "" did2
- `CODE_CONFIG`: a config file that specifies what in $APP_HOME to save as codes, one file/folder per line. Default file is saving "application.conf" and "udf".
- `NUM_SAMPLED_FEATURES`: the number of sampled features for each feature table specified in "FEATURE_TABLES"
- `NUM_SAMPLED_SUPERVISION`: the number of sampled supervision examples for each variable specified in "FEATURE_COLUMNS"
- `NUM_SAMPLED_RESULT`: the number of sampled inference results for each variable
- `NUM_TOP_ENTITIES`: the number of top extracted entities listed
- `SENTENCE_TABLE`: a table that contains all sentences
- `SENTENCE_TABLE_DOC_ID_COLUMN`: document_id column of the sentence table
- `SEND_RESULT_WITH_GIT`: whether to send the result with Git by creating a new commit in "experiment-reports" branch.
- `SEND_RESULT_WITH_GIT_PUSH`: whether to automate the push in the "experiment-reports" branch.
- `STATS_SCRIPT`: specify the path to a script to override default statistics reporting. The script will run instead of the default statistics reporting procedure.
- `SUPERVISION_SAMPLE_SCRIPT`: Specify a script to override default supervision sampling procedure.
- `INFERENCE_SAMPLE_SCRIPT`: Specify a script to override default inference result sampling procedure.
41 changes: 23 additions & 18 deletions braindump.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,41 +27,46 @@ for ((verNumber=1; ;verNumber+=1)); do
# if directory doesn't exist.
if [ ! -d "$REPORT_DIR/$versionName" ]; then

echo "Saving into $REPORT_DIR/$versionName/"
echo "[`date`] Saving into $REPORT_DIR/$versionName/"

mkdir -p $REPORT_DIR/$versionName/
cd $REPORT_DIR/$versionName/

echo "Saving mapping with DeepDive output directory..."
ln -s $DD_THIS_OUTPUT_DIR ./dd-out
echo $DD_TIMESTAMP > ./dd-timestamp
# Skip if last DeepDive run is not finished
if [ -d $DD_THIS_OUTPUT_DIR/calibration ]; then
echo "[`date`] Saving mapping with DeepDive output directory..."
ln -s $DD_THIS_OUTPUT_DIR ./dd-out
echo $DD_TIMESTAMP > ./dd-timestamp

echo "Saving code..."
mkdir -p code
bash $UTIL_DIR/save.sh $APP_HOME ./code/
echo "[`date`] Saving code..."
mkdir -p code
bash $UTIL_DIR/save.sh $APP_HOME ./code/

echo "Saving calibration..."
cp -r $DD_THIS_OUTPUT_DIR/calibration ./
echo "[`date`] Saving calibration..."
cp -r $DD_THIS_OUTPUT_DIR/calibration ./
else
echo "WARNING: last deepdive run $DD_THIS_OUTPUT_DIR seems incomplete, skipping..."
fi

echo "Saving statistics..."
echo "[`date`] Saving statistics..."
bash $UTIL_DIR/stats.sh $VARIABLE_TABLES $VARIABLE_COLUMNS $VARIABLE_WORDS_COLUMNS

echo "Diffing against last version..."
echo "[`date`] Diffing against last version..."
if [ $verNumber -gt 1 ]; then
mkdir -p changes
lastVersionName=v`printf "%05d" $(expr $verNumber - 1)`
bash $UTIL_DIR/diff.sh ../$lastVersionName/code/ ./code/ changes/code.diff
bash $UTIL_DIR/diff.sh ../$lastVersionName/stats/ ./stats/ changes/stats.diff
fi

echo "Saving features..."
echo "[`date`] Saving features..."
mkdir -p features

mkdir -p features/weights/
bash $UTIL_DIR/feature/feature_weights.sh features/weights/

num_features=${#FEATURE_TABLES[@]}
echo "Examining $num_features feature tables..."
echo "[`date`] Examining $num_features feature tables..."
for (( i=0; i<${num_features}; i++ )); do
table=${FEATURE_TABLES[$i]}
column=${FEATURE_COLUMNS[$i]}
Expand All @@ -74,14 +79,14 @@ for ((verNumber=1; ;verNumber+=1)); do
bash $UTIL_DIR/feature/feature_counts.sh $table $column features/counts/
done

echo "Saving supervision & inference results..."
echo "[`date`] Saving supervision & inference results..."
mkdir -p inference
mkdir -p supervision

if [[ -z "$SUPERVISION_SAMPLE_SCRIPT" ]]; then
# Supervision sample script not set, use default
num_variables=${#VARIABLE_TABLES[@]};
echo "Examining $num_variables variable tables...";
echo "[`date`] Examining $num_variables variable tables for supervision...";
for (( i=0; i<${num_variables}; i++ )); do
table=${VARIABLE_TABLES[$i]}
column=${VARIABLE_COLUMNS[$i]}
Expand All @@ -98,9 +103,9 @@ for ((verNumber=1; ;verNumber+=1)); do
fi

if [[ -z "$INFERENCE_SAMPLE_SCRIPT" ]]; then
# Supervision sample script not set, use default
# Inference sample script not set, use default
num_variables=${#VARIABLE_TABLES[@]};
echo "Examining $num_variables variable tables...";
echo "[`date`] Examining $num_variables variable tables for inference...";
for (( i=0; i<${num_variables}; i++ )); do
table=${VARIABLE_TABLES[$i]}
column=${VARIABLE_COLUMNS[$i]}
Expand All @@ -113,7 +118,7 @@ for ((verNumber=1; ;verNumber+=1)); do
bash $INFERENCE_SAMPLE_SCRIPT
fi

echo "Generating README.md..."
echo "[`date`] Generating README.md..."
bash $UTIL_DIR/generate_readme.sh $REPORT_DIR/$versionName/

if [[ "$SEND_RESULT_WITH_GIT" = "true" ]]; then
Expand Down
Empty file added examples/empty-code.conf
Empty file.
87 changes: 87 additions & 0 deletions examples/genomics/braindump.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

########## Conventions. Do not recommend to change. ###########

# Set the utility files dir
export UTIL_DIR="$HOME/local/braindump"

# Report folder: use current
export REPORT_DIR="$WORKING_DIR/experiment-reports"


########## User-specified configurations ###########

# Directories

# Use absolute path if possible.
# Avoid using "pwd" or "dirname $0", they don't work properly.
# $WORKING_DIR is set to be the directory where braindump is running.
# (the directory that contains braindump.conf)
export APP_HOME=$WORKING_DIR

# Specify deepdive out directory (DEEPDIVE_HOME/out)
export DD_OUTPUT_DIR="$HOME/repos/deepdive/out"

# Database Configuration
export DBNAME=$DBNAME
export PGUSER=${PGUSER:-`whoami`}
export PGPASSWORD=${PGPASSWORD:-}
export PGPORT=${PGPORT:-5432}
export PGHOST=${PGHOST:-localhost}

# Specify all feature tables.
# e.g. FEATURE_TABLES=(f1 f2 f3)
export FEATURE_TABLES=(dd_query_classify_gene_hpoterm_relations_features)
export FEATURE_COLUMNS=(feature)

# Specify all variable tables
export VARIABLE_TABLES=(hpoterm_mentions gene_mentions gene_hpoterm_relations)
export VARIABLE_COLUMNS=(is_correct is_correct is_correct)
export VARIABLE_DOCID_COLUMNS=(doc_id doc_id doc_id)
export VARIABLE_WORDS_COLUMNS=(words words "words_1,words_2")
# Assume that in DeepDive, inference result tables will be named as [VARIABLE_TABLE]_[VARIABLE_COLUMN]_inference

# If the variable is a mention, specify the words / description for the mention.
# This is used for a statistics with naive entity linking. If empty, do not count deduplicated mentions.
# e.g. export VARIABLE_WORDS_COLUMNS=(w1 "" w3)
# In the examples above, the second element is left empty
#export VARIABLE_WORDS_COLUMNS=("word1, word2,rel")

# Set variable docid columns to count distinct documents that have extractions
# export VARIABLE_DOCID_COLUMNS=(doc_id)

# Code configs to save
export CODE_CONFIG="$WORKING_DIR/../empty-code.conf"

# Number of samples
export NUM_SAMPLED_FEATURES=100
export NUM_SAMPLED_SUPERVISION=500
export NUM_SAMPLED_RESULT=1000
export NUM_TOP_ENTITIES=50

# Specify some tables for statistics
export SENTENCE_TABLE=sentences
export SENTENCE_TABLE_DOC_ID_COLUMN=doc_id

# Define how to send result. use "true" to activate.
export SEND_RESULT_WITH_GIT=false
# If true, push after commiting the report
export SEND_RESULT_WITH_GIT_PUSH=false
export SEND_RESULT_WITH_EMAIL=false

######## CUSTOM SCRIPTS ###########
# Leave blank for default stats report.
# Set to a location of a script (e.g. $APP_HOME/your_script) to use it instead of default

# Self-defined scripts for stats.
export STATS_SCRIPT=
export SUPERVISION_SAMPLE_SCRIPT=
export INFERENCE_SAMPLE_SCRIPT=

########## Conventions. Do not recommend to change. ###########

# Hack: use the last DD run as output dir
# Suppose out/ is under $DEEPDIVE_HOME/
# You may need to manually change it based on need
export DD_TIMESTAMP=`ls -t $DD_OUTPUT_DIR/ | head -n 1`
export DD_THIS_OUTPUT_DIR=$DD_OUTPUT_DIR/$DD_TIMESTAMP

12 changes: 12 additions & 0 deletions examples/genomics/env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#! /bin/bash

export DBNAME=genomics
export PGUSER=senwu
export PGPASSWORD=${PGPASSWORD:-}
export PGHOST=raiders2.stanford.edu
export PGPORT=6432

export GPHOST=${GPHOST:-localhost}
export GPPORT=${GPPORT:-15433}
export GPPATH=${GPPATH:-/tmp}
# . /lfs/local/0/senwu/software/greenplum/greenplum-db/before_greenplum.sh
Loading

0 comments on commit e4a470c

Please sign in to comment.