Skip to content

Shifu 0.2.5 Stats Step Scalability Improvement

Zhang Pengshan (David) edited this page Apr 3, 2015 · 1 revision

Stats in Shifu 0.2.4

Default stats algorithm in Shifu 0.2.4 is 'SPDT'. While with big data in 100MM records and 1800 variables, stats job is failed. The reason is last Hadoop job cannot be scaled out well.

Stats in Shifu 0.2.5

In Shifu 0.2.5, new stats algorithm 'SPDTI' can scale very well. In Shifu 0.2.4, for 22MM records, running time is 50 minutes while in Shifu 0.2.5, the number is 20minutes. 100MM records with 1800 variables are also being tested. The running time is only 30 minutes.

Clone this wiki locally