Lab - Apache Spark and RADAnalytics on Openshift

Lets take the Apache Spark big data cluster projects for a test drive on openShift - as documented in the http://radanalytics.io/ website.

The example Ophicleide Word2vec application is a cloud native application that provides users with an easy path to train and query Word2vec models from their web browser - https://ophicleide.github.io/

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words - https://en.wikipedia.org/wiki/Word2vec

The application loads text from the given URL's, creates Word2vec models from the text, and allows you to query for synonyms for any given word.

Step 1: Create an Apache Spark Cluster using Oshinko Console

When you login to openshift, you should be able to see the Spark Cluster button on the left hand side. It is a console extension that was deployed for this hackathon.

Create a project as developer user:

oc new-project oshinko --description="Oshinko Apache Spark Cluster" --display-name="Oshinko Apache Spark Cluster"

Select the Spark Cluster button, and select Deploy to create an apache spark cluster. Create as many Apache Spark workers as you want (one is enough to run the demo)

You will see deployments automatically start once the images have been pulled for the spark master and spark workers containers.

Expose the user interface for the spark master and browse to it:

oc expose svc spark-ui

You should see these running containers with more worker replicas depending on the number you selected above:

Step 3: Create the Ophicleide demo application

The demo application connects to the Apache Spark master, so take note of its URI (we can use the kubernetes service name)

-- master URL
spark://spark.oshinko.svc.cluster.local:7077

Create an application project as developer user

oc new-project ophicleide --description="Ophicleide Word2vec" --display-name="Ophicleide Word2vec"

Create image streams and application template in this namesapce

oc create -f https://ophicleide.github.io/documentation/ophicleide-setup-list.yaml

The results of the trained models are stored in a mongo-db, create one now including a ophicleide database (this uses a persistent volume)

oc new-app --template=mongodb-persistent -p DATABASE_SERVICE_NAME=mongodb -p MONGODB_USER=user -p MONGODB_PASSWORD=password -p MONGODB_DATABASE=ophicleide -p MONGODB_ADMIN_PASSWORD=password -p VOLUME_CAPACITY=1Gi -l name=mongodb

Create the application using the template, passing in the SPARK master URI, MONGO db connection URL and a WEB_ROUTE_HOSTNAME as appropriate:

oc new-app --template=ophicleide -p SPARK="spark://spark.oshinko.svc.cluster.local:7077" -p MONGO="mongodb://user:password@mongodb/ophicleide" -p WEB_ROUTE_HOSTNAME="ophicleide.apps.ose3sandbox.com"

Start the builds for the model training containers and web user interface:

oc start-build ophicleide-training
oc start-build ophicleide-web

Once the builds complete successfully, you should see the following deployments and running pods completed:

Step 4: Train some Word2vec models and query

See the Ophicleide Word2vec site for a descriptions of training models https://ophicleide.github.io/. If you select the http://ophicleide.192.168.137.3.xip.io/#!/models and click add model:

here are some URL's you can use to add in some online book text based examples (also many other books on guttenberg.org):

--  The Complete Works of Jane Austen, by Jane Austen
https://www.gutenberg.org/cache/epub/31100/pg31100.txt

-- Dracula, by Bram Stoker
http://www.gutenberg.org/cache/epub/345/pg345.txt

-- Mr. Spaceship, by Philip K. Dick
http://www.gutenberg.org/cache/epub/32522/pg32522.txt

The Status should change from running to ready once the spark job has completed. You can see the training logs in the ophicleide-training container that is part of the ophicleide pod as well as the running spark job in the apache master web-ui. Once complete, your models are now ready to query for

Create some queries using the Create Query button. Use the folowing words to see their embedded relationships and weightings. You can only query for words that were found during the training:

--  The Complete Works of Jane Austen, by Jane Austen
query: darcy
query: days

-- Dracula, by Bram Stoker
query: blood
query: castle
query: days

-- Mr. Spaceship, by Philip K. Dick
query: speed
query: proxima
query: days

You should see the results displayed:

If you want to login to the mongo-db to see which words were found for each model:

oc rsh oc rsh mongodb-1-h85jz
mongo -u admin -p password --authenticationDatabase admin
> use ophicleide
> show collections
models
queries

db.models.find(	{ "name" : "jane austin" } );
db.models.find(	{ "name" : "dracula" } );
db.models.find(	{ "name" : "mr spaceship" } );

##Summary##

Congratulations ! You have successfully:

deployed Oshinko console and Apache spark cluster onto OpenShift
demonstrated a big data application that trains models for performing Word2vec queries against online books

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xtra2_Spark-radanalytics.md

xtra2_Spark-radanalytics.md

Files

xtra2_Spark-radanalytics.md

Latest commit

History

xtra2_Spark-radanalytics.md

File metadata and controls