Skip to content
wu haifeng edited this page Aug 2, 2020 · 80 revisions

Welcome to the Shifu wiki!

Shifu is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. Shifu is designed for data scientists, simplifying the life-cycle of building machine learning models. While originally built for fraud modeling, Shifu is generalized for many other modeling domains.

Shifu provides a simple command-line interface for each step of the model building process, including

  • Statistic calculation & variable selection to determine the most predictive variables in your data
  • Variable normalization
  • Distributed variable selection based on sensitivity analysis
  • Distributed neural network model training
  • Distributed tree ensemble model training
  • Post training analysis & model evaluation

Shifu’s fast Hadoop-based, distributed neural network / logistic regression / gradient boosted trees training can reduce model training time from days to hours on TB data sets. Shifu integrates with Pig/MapReduce workflows on Hadoop, and Shifu-trained models can be integrated into production code with standard PMML format or native format with a simple Java API. Shifu leverages Hadoop, Pig, Akka, Encog and other open source projects.

Documents

Clone this wiki locally