Skip to content

Filter Expressions Testing for Train Dataset or Eval Dataset

Hu Zhanghao edited this page Jul 3, 2019 · 4 revisions

In Shifu, the filter expression are supported to filter training dataset and eval dataset. The filter expression follows the standard - http://commons.apache.org/proper/commons-jexl/reference/syntax.html. But the expression couldn't be verified until user run some steps - like stats, norm, eval. If the expression format is incorrect, or the variable in expression doesn't exists, it may bring unexpected result. For example, user may find logs like below:

Output(s):
Successfully stored 0 records (2180 bytes) in: "hdfs://.../..."

Counters:
Total records written : 0
Total bytes written : 2180
...

Since shifu-0.12.x, a test command is added to test the filters in training dataset and eval dataset. The command is like

  • $ shifu test -fitler [EvalSetNames] [-n numOfRecords]
    • If no EvalSetNames is specified, it will test the filter for training dataset
    • If need to test filters for multi eval set, just specify evalSet names with comma as delimiter - EvalTest1,EvalTest2,EvalTest3
    • By default, test command will test the filter expression against 100 records. If need to test on more records, use -n to change it.
    • * could be used as EvalSetNames. In that way, Shifu will test all possible filters in ModelConfig.json.

By leveraging the shifu test command, the filter expression could be validated in very early stage.

Clone this wiki locally