Stars automates data science stacks.
Stars' end game is composing scalable data science stacks with predictability and reproducibility.
What follows is an overview introducing Stars' capabilities. Each section also lists, by way of example, specific components and their roles in achieving the system's overall goals.
Stars is used to deliver best in class data science solutions. It has been used in multiple federally funded research projects for several years. Our team does semantic web analytics using graph databases like Blazegraph in the bioinformatics domain. We've used Spark to parallelize generation of Gensim word2vec models over a corpus of medical literature. We share computing notebooks via Zeppelin with collaborators at federal agencies and other partner institutions. To do this work, we use:
- Analytics: Mapreduce, streaming, graph, machine learning, and SQL at scale.
- Notebooks: Connect the best in notebook computing to enable scalability and collaboration.
- Visualization: Service for generating visualizations suitable for embedding in data science notebooks.
System | Version | Role | Description |
---|---|---|---|
Zeppelin | 0.7.x | Notebooks | Collaborative notebook computing interface. |
Ligthning | 1.0.1 | Visualization | Lightweight visualization server. |
Livy | 0.4.0 | Notebooks | Allows Jupyter Lab to connect to Spark. |
Blazegraph | 2.1.4 | Analytics | RDF database and SPARQL endpoint. |
Spark | 1.2.2 | Analytics | Mapreduce, Graph, ML, Streaming engine. |
Underlying applications is a highly scalable container orchestration platform with support for long running tasks, scheduled jobs, and support of dynamic service discovery.
- Discovery: Services are discovered and routed automatically via DNS.
- Services: Long running services are managed with Marathon.
- Scheduler: Scheduled tasks are managed with Chronos.
- Orchestration: Delegate data center management to a scalable orchestrator like Mesos
System | Version | Role | Description |
---|---|---|---|
Mesos-DNS | 0.6.0 | Discovery | DNS name resolution for launched contaiers. |
Chronos | 3.0.0 | Scheduler | Scheduled hierarchical tasks for Mesos. |
Marathon | 1.4.7 | Services | Long running service manager for Mesos. |
Mesos | 1.3.x | Orchestration | Container orchestration and data center OS. |
Zookeeper | 3.4.6 | Configuration | Distributed configuration management. |
For all of this to be useful, there need to be programming tools. We have found a small set adequate for the needs of the communities we work with but will soon be adding more. Scala and R are top of the list.
- Languages/Compilers: Suites of commonly used programming tools.
System | Version | Role | Description |
---|---|---|---|
Java JDK | 1.8.0 | Languages | Required by many modern big data stacks. |
Maven | 3.3.9 | Languages | Build and artifact management for Java. |
Python | 3.6 | Languages | Among the most common data science languages. |
DevOps is the merger of software development and operations. We will make infrastructure code. We now manage the entire architecture of Stars as repeatable rules encoded in software.
- DevOps: Automate core system architecture components with Ansible
- Containers: Automate application level data science stacks with Docker and Ansible.
System | Version | Role | Description |
---|---|---|---|
Docker | 1.12.6 | Containers | Compose, share custom machine environments. |
Ansible | 2.2.0.0 | Automation | Automate infrastructure architecture. |
Stars is going to the cloud. More on this soon.
Management of services within the container orchestrator uses Marathon. The interface makes it easy to control resource allocation to applications including CPUs, memory, and number of instances. It also takes care of restarting failed instances, supports Docker containers, and several sophisticated service deployment scenarios to support micro-services and continuous deployment.
The Mesos interface shows individual tasks started by frameworks. It also lets users navigate to each tasks' sandbox, or output area. That might not sound interesting but it's vital for troubleshooting. In the configuration below, there are three Mesos frameworks active: Marathon, Chronos, and Zeppelin. The majority of the listed tasks are long running services managed from Marathon. The tasks with numeric labels only are Spark jobs started by a Zeppelin notebook interpreter.
Get the code...
pip install ansible=2.2.0.0
git clone [email protected]:stevencox/stars.git
cd stars/cluster/system
At this point you'll want to provision some master nodes and some workers.
Three masters is a good number for an HA system. Zookeeper uses a protocol that requires a concensus so there are numbers of machines that just don't make a lot of sense. One is fine for testing. Three is good for production.
The number of workers is entirely up to your usage scenario. It's a good idea to:
- Make the /var partition 50GB or greater. Docker uses this as its storage.
- Launch docker containers with the "--rm" option to remove containers.
Stars has been tested on CentOS 7.
- Vault: Create an Ansible vault to hold secrets:
- Follow instructions here to create a vault
- Be sure to name the vault password file ~/.vault_password.txt
- Variables: Set variables appropriately for each machine role in cluster/system/group_vars
- Staging: Add staging machines to cluster/system/staging
- Production: Add production machines to cluster/system/production
└── system
├── ci.yml
├── conf
│ ├── chronos
│ └── marathon
├── group_vars
│ ├── all
│ ├── ci
│ ├── masters
│ └── workers
├── host_vars
│ └── hostname1
├── masters.yml
├── production
├── README.md
├── roles
│ ├── blazegraph
│ ├── common
│ ├── docker
│ ├── java-se
│ ├── jeffhung.java-se
│ ├── jeffhung.localrepo
│ ├── jenkins
│ ├── lightning
│ ├── livy
│ ├── localrepo
│ ├── marathon
│ ├── maven
│ ├── mesos
│ ├── mesos-dns
│ ├── nginx
│ ├── python
│ ├── spark
│ ├── stars
│ ├── zeppelin
│ └── zookeeper
├── site.yml
├── staging
├── webservers.yml
├── workers.retry
└── workers.yml
You'll have a toolkit like this:
├── bin
│ ├── backup
│ ├── compose
│ ├── nuke-docker
│ ├── restart
│ ├── run
│ └── stars
- compose: Clone Ansible roles to create a playbook. This will be configurable in the future.
bin/compose
- stars: Deploy the entire system or individual components.
- In general:
cd system ../bin/stars <host-role> <environment>
- Deploy system segments:
../bin/stars workers staging ../bin/stars masters staging
- Deploy the entire system to an environment:
../bin/stars site production
- In general:
- backup: Backup apps and tasks from Marathon and Chronos. Will create conf directory in $PWD.
bin/backup
- restart: Restart services.