MMLSpark v1.0.0-rc2
Highlights
New Features
Isolation Forest on Spark ⛺️
- Added LinkedIn's Isolation Forest outlier detection algorithm
- Read the original work for more info
CyberML 🧙♂️
- CyberML aims to provide open source tools for distributed cybersecurity workflows. This first release includes an algorithm that learns user-resource access patterns to detect anomalous access patterns. For more information see the docs
Cognitive Services for Big Data🧠
- Added
SpechToTextSDK
transformer. This new transformer transcribes raw audio files and live audio streams into text. Transcription supports realtime audio streaming, automatic splitting into utterances, and profanity detection. Supports several languages and Custom Speech Models. - added
TextSentimentV3
transformer to leverage new Cognitive Services v3 API - add save and load methods to AccessAnomalyModel (#905)
- stream robustness, output audio stream to file, and custom speech
- Add m3u8 streaming for
SpeechToTextSDK
- enable mp3 file streaming in stt sdk (#822)
Conditional K-Nearest Neighbors 🏡🏡
- Added
ConditionalKNN
estimator and model for efficient search of high dimensional KNNs with conditional predicates. - Added Conditional KNN demo here
- Find hidden artistic connections with the Mosaic application.
HTTP on Spark 🌐
- Added integration with python Requests to accelerate Python Requests with HTTP on Spark!
- Optimized HTTP on Spark asynchronous performance
Vowpal Wabbit on Spark 🐇
- add barrier mode support for VW (#832)
- add support for VW readable model, invert hash and re-using a previously trained VW Spark model (#821)
- support generic numeric types for weights and labels (#817)
LightGBM on Spark 🌳
- add featuresShapCol to LightGBMClassifierModel (#863)
- Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780)
- add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849)
- add delegate to monitor training (#847)
- Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791)
- Add option to add tolerance to improvement in metric evolution (#786)
- added pred leaf index for LightGBMClassifier
- Adding a new param for explicitly setting slot names. (#752)
- added the top_k param for voting parallel (#762)
- Adding a feature for positive and negative bagging fraction params. (#754)
Learn More
MosAIc Finds Hidden Connections in World Art (Article, Demo, Webinar) | Watch the Spark Summit Europe Keynote on MMLSpark | Learn about AI for Good and MMLSpark on the MSR Podcast |
New Docs for the Cognitive Services for Big Data | Read our New Paper on Conditional KNN Trees | Read our New Paper on Microservices in Databases |
Bug Fixes 🐞
- Updating regular Docker Images for helm chart. (#885)
- improve error message for invalid slot names (#897)
- categorical parameter regression on dense dataset caused by missing whitespace (#909)
- fix cyberml test imports
- add "s" to failing publicwasb download
- spark.executor.cores' default value based on master when counting workers (#855)
- fix flakiness in BiLSTM notebook
- make file type case insensitive
- Add support for URI parameters and default filetypes
- remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828)
- fix optional parsing for the CustomOutputParser (#835)
- Fix flakiness in io tests
- Improve codegen readability and added getters and setters to generated models
- move tests to a separate package and refactor common code
- added multiclass init score support (#805)
- LightGBMRanker should repartition by grouping column (#778)
- Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799)
- Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792)
- Fix subtle bug in reverse index creation
- add cap on max allowed port in network init (#759)
- added min_data_in_leaf parameter (#760)
- Reorder ADB Status Checks to fix flakiness
- increase library install timeout (#763)
- Fix an issue with the sparkContext not being instantiated at eval time
- Fix GH release bade display
- Codegen dataframe param fixes
Build 🏭
- bump version
- Ignore existing installation when running installPipPackageTask (#895)
- update ffmpeg on build server
- make python test loop easier:
- updating lightgbm to 2.3.180 (#850)
- split cog services on spark tests
- Split e2e and publishing (#836)
- Add Caching to build pipeline
- added isolation forest test to build pipeline (#800)
- exclude scala from fat jar
Code Style 🎶
- Removing redundant file in the root directory: sp.txt (#796)
- ball tree style fixes
Documentation 📘
- Adding section to readme for installing with apache livy (#785)
- Add fix for maven resolver
- Added two classification examples using Vowpal Wabbit (#733)
Maintenance 🔧
- add Roy to CODEOWNERS
- fix flaky analyze image test
- move build to new subscription (#888)
- Update codeowners file to fix helm owwners
- remove flaky lightGBM test and add retries to Cog service tests
- Update CODEOWNERS (#831)
- Add time in httpv2 tests to reduce flakiness on build VMs
- fixes to improve test flakiness
- updated lightgbm to 2.3.150 (#757)
- improve efficiency of lightgbm tests
- Add more cluster status checks
- fix flakiness in IdentifyFacesSuite
- bump heap size in build
- add default UA
Acknowledgements 🙌
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.
- Ilya Matiach @imatiach-msft
- Markus Cosowicz @eisber
- Lucy Zhang @zhang-lucy
- Roy Levin @rolevin
- Keunhyun Oh @ocworld
- James Verbus
- Christina Lee
- Anand Raman
- William T Freeman
- Lei Zhang
- Rohit Agrawal
- Nisheet Jain
- Chris Hoder
- Chris Templeman
- Chenhui Hu @chenhuims
- Ryan Hurey
- Jun Ki Min @loomlike
- Dotan Patrich,
- Addy Santo,
- Anil Francis Thomas,
- Amrit Bhattacharya,
- Moshe Israel
- Dalitso Banda
- Joan Fontanals @JoanFM
- Jack Gerrits @jackgerrits
- Akshaya Annavajhala
- Heiko Rahmel
- Felix Tran @felixtran39
- Stephanie Fu
- Parker Levy
- Casey Hillenburg
- Vick Wowo
- Brendan Walsh
- Nick Gonsalves
- Mindren Lu
- Nurudín Álvarez
- Guolin Ke
- Chris Smith @chris-smith-zocdoc
- David Lacalle Castillo @WaterKnight1998
- Fokko Driesprong @Fokko
- Diego Mazon
- Tommy Li @tommyzli
- Azure CAT
- Vowpal Wabbit Team
- Light GBM Team
- MSFT Garage Team
- MSR Outreach Team
- Speech SDK Team
Changes:
- 81e73a2 chore: add Roy to CODEOWNERS
- b12be50 build: bump version
- b431a61 fix: Updating regular Docker Images for helm chart. (#885)
- 96f0b77 fix: improve error message for invalid slot names (#897)
- 95c1f8a fix: categorical parameter regression on dense dataset caused by missing whitespace (#909)
- 040ad34 feat: add save and load methods to AccessAnomalyModel (#905)
- 8f8c504 fix: fix cyberml test imports
- 9aed004 chore: fix flaky analyze image test
- 826cfc2 fix: add "s" to failing publicwasb download
- 22e19e5 feat: CyberML (#890)
See More
* 54a623d build: Ignore existing installation when running installPipPackageTask (#895) * f1b4a94 chore: move build to new subscription (#888) * f07e558 Merge pull request #882 from ocworld/fix-rename-clusterutils-numcores * e741993 build: update ffmpeg on build server * 9f9ae53 feat: stream robustness, output audio stream to file, and custom speech * 0319650 build: make python test loop easier: * 65a13bc chore: Update codeowners file to fix helm owwners * 7409ba5 Add num tasks override parameter for LightGBM learners (#881) * 64481e9 fix: spark.executor.cores' default value based on master when counting workers (#855) * 4ae0fe8 reduce network communication overhead cost on reduce step for LightGBM learners (#869) * b413749 fixed shap values shape for multiclass case and improved pyspark API (#870) * 840781a unify APIs across LightGBM learner types and add SHAP feature importances to regressor (#864) * 84b392c re-disable flaky test (#866) * d86a937 build: updating lightgbm to 2.3.180 (#850) * 6bb4a45 feat: add featuresShapCol to LightGBMClassifierModel (#863) * 82e7a8e Bump Apache Spark to 2.4.5 * a0db5b3 build: split cog services on spark tests * 537b611 1) add functions for before/after batch training (#852) * ed435b8 feat: Add m3u8 streaming for `SpeechToTextSDK` * 4d99879 feat: add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849) * be366c5 feat: add delegate to monitor training (#847) * c695d7a add option for driver listen port * 99795bc fic: Codegen dataframe param fixes * 37e336e feat: add barrier mode support for VW (#832) * 9c9a93b fix: fix flakiness in BiLSTM notebook * 5d9410a fix: make file type case insensitive * 55765f8 chore: remove flaky lightGBM test and add retries to Cog service tests * b1e3797 fix: Add support for URI parameters and default filetypes * 5ae664a improvement: support numeric types (not just double) for weight/label (#817) * 9f15b6c feat: add support for VW readable model, invert hash and re-using a previous… (#821) * 038b26b fix: remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828) * 7dd4670 build: Split e2e and publishing (#836) * ca05d1b extended test case to validate duplicate passes parameter (#834) * 2ff6a36 fix: fix optional parsing for the CustomOutputParser (#835) * f9a56e8 chore: Update CODEOWNERS (#831) * c79dd12 chore: Add time in httpv2 tests to reduce flakiness on build VMs * c7eed5a build: Add Caching to build pipeline * c5b8b15 fix: Fix flakiness in io tests * 3abd9b4 chore:Split up io tests into 2 sections * 5489271 fix:remove error prone IO from notebook tests * b4a60e5 fix:remove error prone IO from notebook tests * 2455cbe chore: fixes to improve test flakiness * 6d7cfb5 fix: Improve codegen readability and added getters and setters to generated models * 015d4ea fix: move tests to a separate package and refactor common code * 6b2edc3 feat: enable mp3 file streaming in stt sdk (#822) * 8005c17 feat: Add `TextSentimentV3` Transformer (#812) * df0244c fix: added multiclass init score support (#805) * e745784 fix: LightGBMRanker should repartition by grouping column (#778) * f702921 feat: Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791) * 875f89d build: added isolation forest test to build pipeline (#800) * 290f5cf fix: Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799) * fb3ac99 docs: Adding section to readme for installing with apache livy (#785) * 7b8efa5 fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792) * 4c812d7 style: Removing redundant file in the root directory: sp.txt (#796) * bd2f71e feat: Integration of LinkedIn's Isolation Forest (#781) * 9c61053 feat: Add option to add tolerance to improvement in metric evolution (#786) * dbb2818 feat: Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780) * fde2d3c fix: Fix subtle bug in reverse index creation * 4b4af04 feat: add demo for `ConditionalKNN` * cf48d53 chore: remove keys from demo * 2618422 feat: Add `SpeechToTextSDK` Transformer * 4da1ff2 style: ball tree style fixes * 849527d feat: Add python bindings for `ConditionalBallTree` * d4d4ca8 feat: Add KNN and ConditionalKNN Estimators * 134ddb5 fix bug in serialization * a00c141 fix review points * 9cf33ce feat: added pred leaf index for LightGBMClassifier * 461d27d feat: added pred leaf index for LightGBMClassifier * 3a7a813 feat: added pred leaf index for LightGBMClassifier * f3d624d feat: Adding a new param for explicitly setting slot names. (#752) * 280cab7 Expose dump model method on MMLSpark-LightGBM so that models can be saved as json. * 3da5d4f fix: add cap on max allowed port in network init (#759) * 91652f2 fix: added min_data_in_leaf parameter (#760) * 6bb0429 chore: updated lightgbm to 2.3.150 (#757) * 344dbbd feat: added the top_k param for voting parallel (#762) * ae63497 chore: improve efficiency of lightgbm tests * d9568dc chore: Add more cluster status checks * a9b05b9 chore: fix flakiness in IdentifyFacesSuite * 988403f fix: Reorder ADB Status Checks to fix flakiness * e1dc2b3 fix: increase library install timeout (#763) * a47922f change labelGain description * 43b4e63 feat: Adding a feature for positive and negative bagging fraction params. (#754) * 087f290 docs: Add fix for maven resolver * 3da1d14 docs: Added two classification examples using Vowpal Wabbit (#733) * dece5ae chore: bump heap size in build * 8bb7d86 build: exclude scala from fat jar * 2465d4e fix: Fix an issue with the sparkContext not being instantiated at eval time * d091b37 chore: add default UA * 614a444 perf: remove async bottlenecks from HTTP on Spark * 3caf8f0 feat: Add wrappers for integrating with python Requests * 2fdfe3e added max_bin_by_feature, min_gain_to_split, max_delta_step parameters (#712) * 95b7ef0 Fix scalastyle * 5604602 Fix default case check. Add test cases for countCardinality * 491c01c change getTrainingCols from Option[DataType] -> Seq[DataType] * 25425a0 Use a case class instead of anonymous tuple * c58b216 Support the group column being a string * f22aa73 Fix: Fix GH release bade displayThis list of changes was auto generated.