Despite massive improvements to machine learning frameworks, research and hardware, preparing training dataset largely remains a manual process. Data scientists have to either label massive amount of files by hand or outsource the task to contract workers. This bottleneck is becoming more apparent as deep learning is more accessible than ever due to various open source tools available. The Snorkel project started at Stanford in 2016 aims to solve this problem by programtically label, build and manage training data with weak supervision.
In this project, we will walk through the process of using Snorkel to build a training set for classifying text messages as spam or not spam. Additional goal of this project is to demonstrate the basic components and concepts of Snorkel, but also to dive into some of the actual process of iteratively developing real applications using Snorkel.