SynapseML v0.11.0
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.11.0 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, Java, .NET, C#, and F#.
Highlights
ChatGPT and GPT-4 at Scale | Simple Deep Learning | LightGBM v2 |
Intelligent chat and embeddings. Simplified Prompting APIs. | Train custom image and text classifiers with ease | Higher performance, >10x lower memory footprint, same API |
View Notebook | Learn More | Try an example |
ONNX Model Hub | Causal Learning | Vowpal Wabbit v2 |
Embed >150 state of the art deep networks into your pipelines | Discover and measure causal treatment effects | New second generation integration |
Learn More | View Docs | Explore Samples |
New Features
General ✨
- R Support is no longer Beta! (#1586)
- Support for Spark 3.2.3
Open AI 🤖
- Add OpenAI Prompt Template support (#1843)
- Add Azure OpenAI embedding support (#1832)
- Add Azure Active Directory authentication for OpenAI (#1829)
- Add Null-value handling for OpenAI models (#1854)
Deep Learning 🕸
- Remove CNTK functionality and replace with ONNX (#1593)
- Add the
DeepTextClassifier
a simple API for fine tuning a wide array of Hugging Face 🤗 text transformers using PyTorch Lightning (#1591) - Add the
DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)
Azure Cognitive Services for Big Data 🧠
- Add
SpeakerEmotionInference
transformer to generate emotion annotation tags for emotive reading inSpeechToText
(#1691) - Add new AnalyzeText API (#1760)
- Support Azure Active Directory (AAD) authentication for the cognitive services (#1778, #1797)
- Move different cognitive services into sub packages (#1746)
- Add audiobook generation example (#1852)
- Add a notebook for advanced cognitive service usage (#1825)
- Upgrade MVAD to v1.1 (#1788)
- Remove MVAD's dependence on hardwired credentials and azure SDKs (#1629)
- Add word-level timing to
SpeechToTextSDK
andConversationTranscription
(#1801) - Add the
descriptionExcludes
parameter to AnalyzeImage (#1590)
Causal Learning 📈
- Add the causal
DoubleMLEstimator
for learning causal treatment effects from data (#1715) - Add a DoubleMLEstimator document and sample notebook (#1730)
- Fix DML regression bug, should remove both treatment and outcome columns as feature columns (#1820)
- Add TreatmentCol type checking (#1816)
- Update test to validate ATE value should be positive (#1821)
- Fix issue with missing causal test coverage (#1799)
LightGBM 🌳
- Add LightGBM streaming execution mode for more reliable performance with orders of magnitude less memory. (#1580)
- Add maxNumClasses param to LightGBMClassifier for multi-class (#1841)
- Added the
passThroughArgs
feature which allows users to set low level LGBM parameters before they are wrapped in SparkML (#1749)
Vowpal Wabbit 🐇
- Vowpal Wabbit v2 (#1579):
- Support Vowpal Wabbit input format using VowpalWabbitGeneric model
- Support additional algorithms & label types (multi-class, cost sensitive one against all): sample notebook
- Progressive validation (aka 1-step ahead) using VowaplWabbitGenericProgressive
- New Contextual Bandit Offline Policy Evaluation Notebook
- Data parallel training independent of cluster size
Additional Updates
Bug Fixes 🐞
- Support grayscale images in
toNDArray
(#1592) - Adjust learning rate in VW example notebook (#1853)
- Correct copy/paste error in acr cleanup (#1838)
- Fix synapse test config, and isolation forest notebook (#1833)
- Add spark config to fix ArrayStoreException (#1757)
- Fix breeze NoSuchMethodError (#1807)
- Fix
modelVersion
param in TextAnalytics (#1756) - Make logging infrastructure consistent and add logging checks (#1755)
- Fix website sidebars and vulnerabilities in packages (#1753)
- Remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
- Update isolation forest notebook (#1696)
- Remove error on invalid columns in DropColumns (#1695)
- Fix PyArrow failure in deeplearning test (#1689)
- Fix linked service setters on cog service base class (#1685)
- KernelSHAP throws error when the key type in the ZipMap output is LongType (#1656)
- Fix flaky translate tests (#1643)
- Fix speechToTextSuite serialization Fuzzing failure (#1626)
- Fix translator endpoint and update all endpoints for gov regions (#1623)
- Finder runtime issues (#1598)
- Clean up cluster if Databricks tests pass (#1599)
- Fix deep-learning test flakiness (#1600)
- Update
DotnetTestBase
assembly version (#1601) - Fix flaky forms test (#1584)
- Fix namespace import for Experimental (#1780)
Build 🏭
- Automate cleanup of Azure Container Registry images (#1787, #1751, #1735, #1814)
- Add bot to remove stale issues (#1602)
- Return values from TaskKeys (#1775)
- Remove unnecessary SbtPlugin settings (#1771)
- Simplify E2E test pipeline with a test matrix
- Add welcome message to new PRs/Issues (#1573, #1583)
- Add workflow to label new and reopened issues (#1571)
- Add a secret scanner (#1724)
- Add workflow to open GitHub issues after a comment (#1676)
- Add workflow to remove awaiting-response issue label on comment (#1674)
- Publish test jars so downstream projects can depend on test configurations and utilities
- Making build secrets optional and cached to remove 1 min latency on sbt commands (#1726)
- Add nightly build to catch flakes early (#1774)
- Automatically delete accumulated models in build (#1758, #1729, #1759)
- Add Dependabot for updating GitHub actions (#1608)
- Update build pipeline to ubuntu 20.04 (#1624)
Documentation 📘
- Add a hyperparameter tuning sample with HyperOpt (#1828)
- Add docs for new LightGBM
executionMode
parameter (#1779) - Add additional ONNX docs for model hub and slicing (#1781)
- Improve OpenAI notebook (#1596)
- Add dotnet installation & examples (#1567, #1570)
- Update deep vision docs (#1752)
- Add custom search engine creation video to website (#1581)
- Replace Boston housing dataset with California housing dataset (#1856)
- Improve overview section in README
- Fix typo in old versions of Interpretability - Explanation Dashboard.md (#1846)
- Add versioned docs (#1858, #1566)
- Fix Synapse installation instructions for Spark 3.2 (#1815)
- Update required spark and python version on website doc (#1812)
- Fix latex rendering issue in Data Balance Analysis (#1796)
- Fix Acrolinx issues (#1792, #1793, #1808, #1794)
- Pin binder to latest released version
- Improve python env creation instructions in developer readme (#1693)
- Remove unused docs and fix links
- Improve example notebooks
- fix command to launch Jupyter notebook (#1649)
- Add documentation for MLFlow logging and loading (#1641)
- Update spark version in Readme
- Fix .NET logo on website (#1604)
- Update v0.10.0 installation guidance (#1578)
Maintenance 🔧
- Bump spark to 3.2.3 (#1744)
- Update Scalatest and Scalactic dependencies (#1706)
- Keep GitHub actions up to date (#1839, #1766, #1737, #1680, #1688, #1777, #1773, #1770, #1786, #1610, #1611, #1612)
- Keep website up to date (#1826, #1809, #1785, #1719, #1709, #1609, #1740)
- Update container registry and service principal connections
- Maintain Synapse tests (#1835, #1823, #1810, #1803, #1658)
- Fix codecov.yaml
- Fix conda env creation flakiness in build (#1748)
- Bump SynapseML Version (#1857, #1738, #1628)
- Improve code style (#1736, #1716, #1790)
- Pin az and python versions in build (#1705)
- Fix ado issue linker configurations (#1704)
- Add
synapse-internal
to platform detector function (#1651) - Update OpenAI service to official deployment (#1619)
- Pin binder version (#1607)
Deprecations and Removals 🗑️
- Deprecate old TextAnalytics APIs (#1627)
- Remove old TextAnalytics APIs (#1622)
- Remove deprecated LIME APIs (#1620)
- Deprecate CNTK classes and ModelDownloader (#1712)
- Delete CNTK and related utils (#1743)
- Move ImageFeaturizer to
onnx
namespace (#1711)
Testing 💚
- Add Additional E2E testing infrastructure (#1727, #1769)
- Improve ONNX test reliability (#1713)
- Stabilize flaky tests (#1576, #1842)
- Remove Synapse E2E test exclusions (#1757, #1699, #1798, #1698, #1837)
- Add automated tests for getters and setters and improve test coverage (#1631)
Contributor Spotlight
We are excited to highlight the contributions of the following SynapseML contributors:
Scott Votaw | Serena Ruan | Haizhou (Dylan) Wang |
Scott Votaw is a Principal Engineer on the SynapseML team has solved some of SynapseML’s toughest challenges in record time. In this release, Scott contributed both the new LightGBM streaming execution mode, and fully replaced our deep learning stack with the ONNX Runtime. These efforts were massive lifts including huge changes to the LightGBM native libraries and complex dependency management jujitsu respectively. Scott brings his love for the craft to every project he works on so keep your eyes peeled for more amazing feats of engineering from him in future releases. | Serena is a Software Engineer II on the SynapseML team and operates on a separate plane of existence than the rest of us mere mortals. Following up on prior major contributions like .NET support, form recognition, translation, and creating the SynapseML Website, Serena contributed the Simple Deep Learning package for this release. This package makes it easy to train modern deep text and vision networks from Hugging Face and torchvision on Spark clusters. Serena seeks only the most difficult engineering challenges and her contributions have laid the groundwork for many more deep-learning based algorithms in SynapseML. | Haizhou (Dylan) is a Senior Software engineer in the CSX Data team and a first-time contributor to the SynapseML library. Dylan contributed the new SynapseML causal learning package for the v0.11 release. This package helps users discover the effectiveness of things like medical treatments or economic policies even without controlled experiments. With his elegant contributions, Dylan has laid the foundation for more causal collaborations with the EconML library. |
Markus Cozowicz | Brendan Walsh | Jessica Wang |
Markus is a Principal Applied Scientist who (just!) joined the SynapseML team. Despite only recently coming on board officially, Markus has long been a prolific contributor to the library and built the Vowpal Wabbit and Isolation Forest integrations. In this release, Markus contributed the second generation of the Vowpal Wabbit integration, improving its generality and applicability. He also expanded the OpenAI integration to support embeddings and simplified prompt templating. Our team is incredibly lucky to have such a consistent and thoughtful collaborator. | Brendan is a Senior Engineer on the SynapseML team who recently joined after a long tenure on the Cognitive Services team where he developed their containerized cognitive service effort and co-authored the SynapseML publication on large-scale microservices. Brendan used this expertise to onboard Emotion Detection for text to speech models. He then went on to use this new emotive reading capability to create and donate thousands of audiobooks to the open source. You can learn more about Brendan’s awesome technical philanthropy efforts at https://aka.ms/audiobook. | Jessica is Software Engineer who recently joined the SynapseML team. Already, Jessica has grown into the role of the SynapseML benevolent “doc”tator. This release Jessica has worked hard to ensure that the SynapseML notebooks work across a wide variety of Spark platforms and are easy and simple to get started with. This work requires knowledge of the entire library’s surface area, and we are thankful Jessica has worked so hard to learn this breadth of content. If you have been following notebook examples from https://aka.ms/spark you have Jessica to thank! |
Kyle Rush | Avrilia Floratou | Jason Wang |
Kyle is a Senior Software Engineer on the SynapseML team with a penchant for architecture and a streak of taking on big responsibility behind the scenes. Kyle has been instrumental in expanding our testing infrastructure to new platforms so that the lights stay on even as the number of contributions increases. This often requires nontrivial code and delicate cross-team collaboration, and Kyle has both the engineering might and the charismatic finesse to make sure these systems can be spun up successfully. | Avrilia is Principal Scientist Manager in the Grey Systems Lab, first-time SynapseML contributor, and a delightful collaborator. This release, Avrilia contributed the first prototype of the simplified OpenAI prompting transformer. This contribution makes it easy to ask ChatGPT and other LLMs questions about large datasets and to create new LLM-derived columns in databases. You can learn more about her work through the OpenAI Docs and prompting demo | Jason Wang is a Principal Software Engineering on the CSX Data team and has a long history of not only contributing huge features to SynapseML, but actively maintaining his contributions. This release, Jason’s work on the ONNX model hub protocol enables quick access to over >150 pretrained deep networks from the Java and Scala ecosystems. Jason has also been instrumental in fixing the most difficult and arduous bugs, some even stemming from the core Spark runtime. Finally, we deeply appreciate Jason’s leadership in the community: he consistently encourages and helps others contribute, and his impact extends far beyond his own personal contributions. |
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external, who helped create this version of SynapseML
Eric Dettinger, Markus Weimer, Serena Ruan @serena-ruan, Scott Votaw @svotaw, Haizhou (Dylan) Wang @dylanw-oss, Puneet Pruthi @ppruthi, Markus Cozowicz @eisber, Brendan Walsh @BrendanWalsh, Jessica Wang @JessicaXYWang, Kyle Rush @k-rush, Avrilia Floratou, Jason Wang @memoryz, Mark Niehaus @niehaus59, Keerthi Yanda @KeerthiYandaOS, Ilya Matiach @imatiach-msft, Kashyap Patel @ms-kashyap, Martha Laguna @martthalch @marthalc, Sarah Shy @sarahshy, @ocworld, @adityakode, @nightscape, Alexandra Savelieva @alsavelv, Tom Finley, Jeff Zheng, James Verbus @jverbus, Chris Hoder, Misha Desai, Nellie Gustafsson, Eren Orbey, Beverly Kodhek, Louise Han @jr-MS, Raj Rikhy, Marcos Campos, Mike Estee, Brice Chung, Justyna Lucznik, Kim Manis, Mitrabhanu Mohanty, Bogdan Crivet, Anand Raman, William T. Freeman, Akshaya Annavajhala (AK), Guolin Ke, Spark.NET Team, ONNX Team, Azure Global, Vowpal Wabbit Team, LightGBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team, MLflow Team
Learn More
Changes:
- 7b23764 docs: make v0.11.0 docs (#1858)
- 795bbec docs: Add audiobook generation example (#1852)
- 8f9e970 chore: bump version to 0.11.0 (#1857)
- 09648a3 docs: replace Boston housing dataset with California housing dataset (#1856)
- 2b2419f fix: OpenAI completion & prompting null handling (#1854)
- 66f30e3 fix: adjust learning rate in VW example notebook (#1853)
- d64563f chore: Upgrading ONNX version to fix assembly bug (#1849)
- a64b6e0 chore: fix mcr connection
- d23dd46 chore: update build service principal
- 6876d53 chore: Update the build pipeline
See More
- c1aa5a5 docs: Fix type in old version of Interpretability - Explanation Dashboard.md (#1846)
- 48f9c4c feat: OpenAI Prompt Template support (#1843)
- fbbb433 chore: fix form recognition tests (#1842)
- b90425c build: bump amannn/action-semantic-pull-request from 5.0.2 to 5.1.0 (#1839)
- 31c4ea3 fix: correct copy/paste error in acr cleanup (#1838)
- c99796f feat: add maxNumClasses param to LightGBMClassifier for multi-class (#1841)
- eb0bbe3 fix: modify synapse test config, modify isolation forest notebook for testing (#1833)
- 8af5112 docs: remove hyperOpt exclusion - mlflow on synapse (#1837)
- d66796d build: bump http-cache-semantics from 4.1.0 to 4.1.1 in /website (#1826)
- 7038e1d feat: add Azure OpenAI embedding support (#1832)
- 6c6d89b chore: turn off failing synapse tests (#1835)
- eb35581 docs: add hyperopt sample (#1828)
- 2262a9b feat: add aad auth for openAI (#1829)
- a7e20ce docs: Add a notebook for advanced cognitive service usage (#1825)
- 3eed94c test: Update test to validate ATE value should be positive with the test data (#1821)
- 8aa4ae1 chore: re-enable E2E tests for Synapse-Extension (#1823)
- 1dcb588 docs: update spark3.2 installation on Synapse (#1815)
- e36643f fix: Fix DML regression bug, should remove both treatment and outcome columns as feature columns (#1820)
- 4ef8e30 fix: Add spark config to fix ArrayStoreException in Synapse - Add back HyperparameterTuning nbs to test pipeline (#1757)
- eb40372 fix: Add TreatmentCol Type check at the very beginning (#1816)
- af0a218 docs: update required spark and python version on website doc (#1812)
- e0c5364 build: bump ua-parser-js from 0.7.31 to 0.7.33 in /website (#1809)
- e212f5d chore: add retry to commands (#1814)
- bd1e0a6 fix: breeze NoSuchMethodError (#1807)
- 13ff346 chore: Disable synapse-extension tests, add params to pipeline (#1810)
- 4c8d2e9 doc: apply diffs from website/docs to website/versioned_docs (#1808)
- 7d8d6fd replicating the unit test data (#1806)
- fb47138 docs: DoubleMLEstimator document and sample notebook (#1730)
- 8ba77e4 build: Return values from TaskKeys (#1775)
- 94ed685 chore: disable Interpretability tests (#1803)
- 54e7ac6 feat: add setting for getting word level timing information from SpeechToText (#1801)
- d8d523c feat: annual Vowpal Wabbit improvements (#1579)
- 01e31dc test: fix issue with missing causal tests (#1799)
- 9d92349 test: remove interpretability exclusion (#1798)
- d308dc4 refactor: enhancement to aad token (#1797)
- 333dedb feat: upgrade MVAD to v1.1 (#1788)
- 54a7496 chore: fix typo in chron build def
- c653ed7 docs: Clean latex - Data Balance Analysis (#1796)
- 76e7b73 chore: linx fixes for README and features (#1794)
- fc3a7a6 chore: Re-enable e2e tests, add cron schedule to build definition (#1774)
- a95fad4 docs: Add docs for LightGBM execution mode (#1779)
- 44bcbf1 improve documentation - bug, typo, correctness (#1791)
- 77da4b4 chore: acrolinx fixes for reference, mlflow and getting_started in 0.10.2 (#1793)
- 8cc6a16 docs: mollify acrolinx (#1792)
- 974e36a docs: Added more up-to-date ONNX docs (#1781)
- dc57dea feat: add aad authentication support for cognitive services (#1778)
- dd1563f build: bump json5 from 2.2.1 to 2.2.3 in /website (#1785)
- 3d8c84d chore: fix style (#1790)
- 53788bd fix: small tweaks to clean_acr (#1787)
- 851efdc build: bump actions/upload-artifact from 3.1.1 to 3.1.2 (#1786)
- 9978c3b build: bump ossf/scorecard-action from 2.1.1 to 2.1.2 (#1777)
- 421e3fe Fix: fix annamespace import for Experimental (#1780)
- 8dc4a58 build: bump ossf/scorecard-action from 2.1.0 to 2.1.1 (#1773)
- 6143514 build: Remove unnecessary SbtPlugin settings (#1771)
- de21ada chore: disable synapse-internal tests
- d0a9f20 feat: Causal DoubleMLEstimator (#8) (#1715)
- 7ab63a1 fix: Update synapse-extension test environment, enable cleanup of old arti… (#1769)
- be02bf7 build: bump ossf/scorecard-action from 2.0.6 to 2.1.0 (#1770)
- 0bf4772 build: bump actions/upload-artifact from 3.1.0 to 3.1.1 (#1766)
- bb6e37b Update codeql.yml (#1765)
- 8a30fd4 Create codacy.yml
- d9810b1 Create codeql.yml
- 630a442 Create scorecards.yml
- 7785cb5 feat: add new AnalyzeText API (#1760)
- cf14041 chore: delete old models before tests rather than after (#1759)
- 9e32a99 fix: fix failing SpeakerEmotionInferenceSuite
- 4a25954 fix: delete too many anomaly models (#1758)
- adad80d chore: remove cntk and downloader tests from build
- 046689d chore: fix codecov.yaml
- 3a3be32 feat: Add LightGBM streaming execution mode (#1580)
- b205cc4 fix: fix modelVersion param in TextAnalytics (#1756)
- b797d6c fix: make logging infrastructure consistent and add logging checks (#1755)
- 557470b fix: fix website sidebars and vulnerabilities in packages (#1753)
- 9c98609 docs: update deepvision docs on website (#1752)
- 629da63 refactor: move different cognitive services into sub packages (#1746)
- 37f2e90 chore: fix clean acr (#1751)
- c6cc0a8 fix: Add docs for passThroughArgs (#1749)
- b6ef511 docs: Pinning binder to latest released version
- 2a89e13 feat: Delete CNTKand related utils (#1743)
- 98add7a chore: fix conda env creation (#1748)
- 558f5d8 chore: bump spark to 3.2.3 (#1744)
- 70843d5 chore: bump docusaurus (#1740)
- 2d06b94 build: bump loader-utils from 2.0.3 to 2.0.4 in /website (#1719)
- aa69541 docs: removing beta tag from R
This list of changes was auto generated.