Python Big Data Scientific Computing Kit

This ansible script deploys a server with a collection of Python Big Data and Scientific Computing tools and libraries, preconfigured for running on a local Spark cluster.

Included packages:

Base Python
- Anaconda Python Distribution 2.7 (http://docs.continuum.io/anaconda/)
Development tool
- Jupyter (https://jupyter.org/)
Data visualization
- Bokeh (http://bokeh.pydata.org/)
- Pygal (http://www.pygal.org)
- Seaborn (http://stanford.edu/~mwaskom/software/seaborn/)
Text and Sentiment
- Gensim (https://radimrehurek.com/gensim/)
- vaderSentiment (https://pypi.python.org/pypi/vaderSentiment)
Machine learning
- Gensim (https://radimrehurek.com/gensim/)
- Tensorflow (http://tensorflow.org/)
Distributed processing
- Spark 1.3.1 (http://spark.apache.org)
- Luigi (http://luigi.readthedocs.org)
- mrjob (https://pythonhosted.org/mrjob/)
Data access
- Python HDFS (https://pypi.python.org/pypi/hdfs/)
Web scraper
- Scrapy (http://scrapy.org)
GIS
- Geojson (https://pypi.python.org/pypi/geojson)
- Geocoder (https://pypi.python.org/pypi/geocoder)
Javascript visualization libraries
- D3.js (http://d3js.org)
- DC.js (https://dc-js.github.io/dc.js/)
- NVD3.js (http://nvd3.org)
- Dimple.js (http://dimplejs.org)
- Crossfilter.js (http://square.github.io/crossfilter/)

Installation

Setup a server or VM with CentOS7
Ensure FQDN is configured correctly. Spark requires the host system hostname to be resolvable, quickest fix is to ensure the hostname resolves as 127.0.0.1 by adding an entry in /etc/hosts:
```
127.0.0.1 localhost.localdomain localhost pydatalab.server.local pydatalab
```
Create an ansible hosts inventory (assuming server hostname is your hostname is :
```
[master]
pydatalab.server.local
```
Execute ansible
```
ansible-playbook -i hosts playbook.yml
```
Jupyter should be running at pydatalab.server.local:8888
Default login for pydatalab user is pydatalab:pydatalab

Integration with Hortonworks Hadoop

This ansible script detects whether it is being installed on Hortonworks Data Platform and will create a Jupyter kernel with the right environment variables set and configured to use Spark on HDP.

Supported platforms

This script have been tested on:

CentOS 7.1
Red Hat Enterprise Linux 7.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Python Big Data Scientific Computing Kit

Installation

Integration with Hortonworks Hadoop

Supported platforms

Files

README.md

Latest commit

History

README.md

File metadata and controls

Python Big Data Scientific Computing Kit

Installation

Integration with Hortonworks Hadoop

Supported platforms