Skip to content

Latest commit

 

History

History
678 lines (420 loc) · 34.8 KB

USER_GUIDE.md

File metadata and controls

678 lines (420 loc) · 34.8 KB

What is DLAB?

DLab is an essential toolset for analytics. It is a self-service Web Console, used to create and manage exploratory environments. It allows teams to spin up analytical environments with best of breed open-source tools just with a single click of the mouse. Once established, environment can be managed by an analytical team itself, leveraging simple and easy-to-use Web Interface.

See more at dlab.opensource.epam.com.


CONTENTS


Login

Create project

Setting up analytical environment and managing computational power

        Create notebook server

                Manage libraries

                Create image

        Stop Notebook server

        Terminate Notebook server

        Deploy Computational resource

        Stop Standalone Apache Spark cluster

        Terminate Computational resource

        Scheduler

        Collaboration space

                Manage Git credentials

                Git UI tool (ungit)

Administration

          Manage roles

          Project management

          Environment management

                Multiple Cloud endpoints

                Manage DLab quotas

DLab billing report

Web UI filters


Login

As soon as DLab is deployed by an infrastructure provisioning team and you received DLab URL, your username and password – open DLab login page, fill in your credentials and hit Login.

DLab Web Application authenticates users against:

  • OpenLdap;

  • Cloud Identity and Access Management service user validation;

  • KeyCloak integration for seamless SSO experience *;

    • NOTE: in case has been installed and configured to use SSO, please click on "Login with SSO" and use your corporate credentials
Login error messages Reason
Username or password is invalid The username provided:
doesn’t match any LDAP user OR
there is a type in the password field
Please contact AWS administrator to create corresponding IAM User The user name provided:
exists in LDAP BUT:
doesn’t match any of IAM users in AWS
Please contact AWS administrator to activate your Access Key The username provided:
exists in LDAP BUT:
IAM user doesn’t have a single Access Key* created OR
IAM user’s Access Key is Inactive

* Please refer to official documentation from Amazon to figure out how to manage Access Keys for your AWS Account: http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html

To stop working with DLab - click on Log Out link at the top right corner of DLab.

After login user sees warning in case of exceeding quota or close to this limit.

Exceeded quota

Close to limit


Create project

When you log into DLab Web interface, the first thing you need to do is to create a new project.

To do this click on “Upload” button on “Projects” page, select your personal public key (or click on "Generate" button), endpoint, group, 'Use shared image' select enable or disable and hit “Create” button. Do not forget to save your private key.

Upload or generate user key

Please note, that you need to have a key pair combination (public and private key) to work with DLab. To figure out how to create public and private key, please click on “Where can I get public key?” on “Projects” page. DLab build-in wiki page guides Windows, MasOS and Linux on how to generate SSH key pairs quickly.

Creation of Project starts after hitting "Create" button. This process is a one-time operation for each Data Scientist and it might take up-to 10 minutes for DLab to setup initial infrastructure for you. During this process project is in status "Creating".

'Use shared image' enabled means, that an image of particular notebook type is created while first notebook of same type is created in DLab. This image will be availble for all DLab users. This image is used for provisioning of further notebooks of same type within DLab. 'Use share image' disabled means, that image of particular notebook type is created while first notebook of same type is created in DLab. This AMI is available for all users withing same project.

As soon as Project is created, Data Scientist can create notebook server on “List of Resources” page. The message “To start working, please create new environment” is appeared on “List of Resources” page:

Main page


Setting up analytical environment and managing computational power

Create notebook server

To create new analytical environment from “List of Resources” page click on "Create new" button.

The "Create analytical tool" popup shows up. Data Scientist can choose the preferred project, endpoint and analytical tool. Adding new analytical toolset is supported by architecture, so you can expect new templates to show up in upcoming releases. Currently by means of DLab, Data Scientists can select between any of the following templates:

  • Jupyter
  • Apache Zeppelin
  • RStudio
  • RStudio with TensorFlow
  • Jupyter with TensorFlow
  • Deep Learning (Jupyter + MXNet, Caffe, Caffe2, TensorFlow, CNTK, Theano, Torch and Keras)
  • JupyterLab
  • Superset (implemented on GCP)

Create notebook

After specifying desired template, you should fill in the “Name” and “Instance shape”.

Keep in mind that "Name" field – is just for visual differentiation between analytical tools on “List of resources” dashboard.

Instance shape dropdown, contains configurable list of shapes, which should be chosen depending on the type of analytical work to be performed. Following groups of instance shapes are showing up with default setup configuration:

Select shape

These groups have T-Shirt based shapes (configurable), that can help Data Scientist to either save money* and leverage not very powerful shapes (for working with relatively small datasets), or that could boost the performance of analytics by selecting more powerful instance shape.

* Please refer to official documentation from Amazon that helps you to understand what instance shapes are the most preferable in your particular DLAB setup. Also, you can use AWS calculator to roughly estimate the cost of your environment.

* Please refer to official documentation from GCP that helps you to understand what instance shapes are the most preferable in your particular DLAB setup. Also, you can use GCP calculator to roughly estimate the cost of your environment.

* Please refer to official documentation from Microsoft Azure that helps you to understand what virtual machine shapes are the most preferable in your particular DLAB setup. Also, you can use Microsoft Azure calculator to roughly estimate the cost of your environment.

You can override the default configurations of local spark. The configuration object is referenced as a JSON file. To tune spark configuration check off "Spark configurations" check box and insert JSON format in the text box.

After you Select the template, fill in the Name and specify desired instance shape - you need to click on "Create" button for your analytical toolset to be created. Corresponding record shows up in your dashboard:

Dashboard

As soon as notebook server is created, status changes to Running:

Running notebook

When you click on the name of your Analytical tool in the dashboard – analytical tool popup shows up:

Notebook info

In the header you see version of analytical tool, its status and shape.

In the body of the dialog:

  • Up time
  • Analytical tool URL
  • Git UI tool (ungit)
  • Shared bucket for all users
  • Project bucket for project members

To access analytical tool Web UI you use direct URL's (your access is established via reverse proxy, so you don't need to have Edge node tunnel up and running).

Manage libraries

On every analytical tool instance you can install additional libraries by clicking on gear icon gear in the "Actions" column for a needed Notebook and hit "Manage libraries":

Notebook manage_libraries

After clicking you see the window with 3 fields:

  • Field for selecting an active resource to install libraries
  • Field for selecting group of packages (apt/yum, Python 2, Python 3, R, Java, Others)
  • Field for search available packages with autocomplete function except for Java. java library you should enter using the next format: "groupID:artifactID:versionID"

Install libraries dialog

You need to wait for a while after resource choosing till list of all available libraries is received.

Libraries list loading

Note: Apt or yum packages depends on your DLab OS family.

Note: In group Others you can find other Python (2/3) packages, which haven't classifiers of version.

Resource select_lib

After selecting library, you can see it in the midle of the window and can delete it from this list before installation.

Resource selected_lib

After clicking on "Install" button you see process of installation with appropriate status.

Resources libs_status

Note: If package can't be installed you see "Failed" in status column and button to retry installation.

Create image

Out of each analytical tool instance you can create an AMI image (notebook should be in Running status), including all libraries, which have been installed on it. You can use that AMI to speed-up provisioining of further analytical tool, if you want to re-use existing configuration. To create an AMI click on a gear icon gear in the "Actions" menu for a needed Notebook and hit "Create AMI":

Notebook create_ami

On "Create AMI" popup you should fill:

  • text box for an AMI name (mandatory)
  • text box for an AMI description (optional)

Create AMI

After clicking on "Create" button the Notebook status changes to "Creating image". Once an image is created the Notebook status changes back to "Running".

To create new analytical environment from custom image click on "Create new" button on “List of Resources” page.

“Create analytical tool” popup shows up. Choose project, endpoint, template of a Notebook for which the custom image has been created:

Create notebook from AMI

Before clicking "Create" button you should choose the image from "Select AMI" and fill in the "Name" and "Instance shape".

NOTE: This functionality is implemented for AWS and Azure.


Stop Notebook server

Once you have stopped working with an analytical tool and you need to release Cloud resources for the sake of the costs, you might want to stop the notebook. You are able to start the notebook later and proceed with your analytical work.

To stop the Notebook click on a gear icon gear in the "Actions" column for a needed Notebook and hit "Stop":

Notebook stopping

Hit "OK" in confirmation popup.

NOTE: Connected Data Engine Service becomes Terminated while connected (if any) Data Engine (Standalone Apache Spark cluster) becomes Stopped.

Notebook stop confirm

After you confirm your intent to stop the notebook - the status changes to "Stopping" and later becomes "Stopped".


Terminate Notebook server

Once you have finished working with an analytical tool and you need don't neeed cloud resources anymore, for the sake of the costs, we recommend to terminate the notebook. You are not able to start the notebook which has been terminated. Instead, you have to create new Notebook if you need to proceed with your analytical activities.

NOTE: Make sure you back-up your data (if exists on Notebook) and playbooks before termination.

To terminate the Notebook click on a gear icon gear in the "Actions" column for a needed Notebook and hit "Terminate":

NOTE: If any Computational resources have been linked to your notebook server – they are automatically terminated if you terminate the notebook.

Confirm termination of the notebook and afterwards notebook status changes to "Terminating":

Notebook terminating

Once corresponding instances become terminated in Cloud console, status finally changes to "Terminated":

Notebook terminated


Deploy Computational resource

After deploying Notebook node, you can deploy Computational resource and it is automatically linked with your Notebook server. Computational resource is a managed cluster platform, that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark on cloud to process and analyze vast amounts of data. Adding Computational resource is not mandatory and is needed in case computational resources are required for job execution.

On “Create Computational Resource” popup you have to choose Computational resource version (configurable) and specify alias for it. To setup a cluster that meets your needs – you have to define:

  • Total number of instances (min 2 and max 14, configurable);
  • Master and Slave instance shapes (list is configurable and supports all available cloud instance shapes, supported in your cloud region);

Also, if you want to save some costs for your Computational resource you can create it based on spot instances(this functionality is for AWS cloud) or preemptible instances (this functionality is for GCP), which are often available at a discount price:

  • Select Spot Instance checkbox;
  • Specify preferable bid for your spot instance in % (between 20 and 90, configurable).

NOTE: When the current Spot price rises above your bid price, the Spot instance is reclaimed by cloud so that it can be given to another customer. Please make sure to backup your data on periodic basis.

This picture shows menu for creating Computational resource for AWS:

Create Computational resource on AWS

You can override the default configurations for applications by supplying a configuration object for applications when you create a cluster (this functionality is only available for Amazon EMR cluster). The configuration object is referenced as a JSON file. To tune computational resource configuration check off "Cluster configurations" check box and insert JSON format in text box:

Create Custom Computational resource on AWS

This picture shows menu for creating Computational resource for GCP:

Create Computational resource on GCP

To create Data Engine Service (Dataproc) with preemptible instances check off 'preemptible node count'. You can add from 1 to 11 preemptible instances.

This picture shows menu for creating Computational resource for Azure:

Create Computational resource on Azure

If you click on "Create" button Computational resource creation kicks off. You see corresponding record on DLab Web UI in status "Creating":

Creating Computational resource

Once Computational resources are provisioned, their status changes to "Running".

After clicking on Computational resource name in DLab dashboard you see Computational resource details popup:

Computational resource info

Also you can go to computational resource master UI via link "Spark job tracker URL', "EMR job tracker URL" or "Dataproc job tracker URL".

Since Computational resource is up and running - you are now able to leverage cluster computational power to run your analytical jobs on.

To do that open any of the analytical tools and select proper kernel/interpreter:

Jupyter – go to Kernel and choose preferable interpreter between local and Computational resource ones. Currently we have added support of Python 2/3, Spark, Scala, R in Jupyter.

Jupiter

Zeppelin – go to Interpreter Biding menu and switch between local and Computational resource there. Once needed interpreter is selected click on "Save".

Zeppelin

Insert following “magics” before blocks of your code to start executing your analytical jobs:

  • interpreter_name.%spark – for Scala and Spark;
  • interpreter_name.%pyspark – for Python2;
  • interpreter_name.%pyspark3 – for Python3;
  • interpreter_name.%sparkr – for R;

RStudio – open R.environ and comment out /opt/spark/ to switch to Computational resource and vise versa to switch to local kernel:

RStudio


Stop Standalone Apache Spark cluster

Once you have stopped working with Standalone Apache Spark cluster (Data Engine) and you need to release cloud resources for the sake of the costs, you might want to stop Standalone Apache Spark cluster. You are able to start Standalone Apache Spark cluster again after a while and proceed with your analytics.

To stop Standalone Apache Spark cluster click on stop button close to Standalone Apache Spark cluster alias.

Hit "YES" in confirmation popup.

Spark stop confirm

After you confirm your intent to stop Standalone Apache Spark cluster - the status changes to "Stopping" and soon becomes "Stopped".


Terminate Computational resource

To release computational resources click on cross button close to Computational resource alias. Confirm decommissioning of Computational resource by hitting "Yes":

Computational resource terminate confirm

In a while Computational resource gets "Terminated". Corresponding cloud instance also is removed on cloud.


Scheduler

Scheduler component allows to automatically schedule Start and Stop triggers for a Notebook/Computational, while for Data Engine or Data Engine Service it can only trigger Stop or Terminate action correspondigly. There are 2 types of a scheduler:

  • Scheduler by time;
  • Scheduler by inactivity.

Scheduler by time is for Notebook/Data Engine Start/Stop and for Data Engine/Data Engine Service termination. Scheduler by inactivity is for Notebook/Data Engine stopping.

To create scheduler for a Notebook click on an gear icon in the "Actions" column for a needed Notebook and hit "Scheduler":

Notebook scheduler action

Popup with following fields shows up:

  • start/finish dates - date range when scheduler is active;
  • start/end time - time when notebook should be running;
  • timezone - your time zone;
  • repeat on - days when scheduler should be active;
  • possibility to synchronize notebook scheduler with computational schedulers;
  • possibility not to stop notebook in case of running job on Standalone Apache Spark cluster.

Notebook scheduler

If you want to stop Notebook on exceeding idle time you should enable "Scheduler by inactivity", fill your inactivity period (in minutes) and click on "Save" button. Notebook is stopped upon exceeding idle time value.

Scheduler by Inactivity.png

Also scheduler can be configured for a Standalone Apache Spark cluster. To configure scheduler for Standalone Apache Spark cluster click on this icon scheduler_computational:

Computational scheduler create

There is a possibility to inherit scheduler start settings from notebook, if such scheduler is present:

Computational scheduler

Notebook/Standalone Apache Spark cluster is started/stopped automatically after scheduler setting. Please also note that if notebook is configured to be stopped, all running data engines assosiated with is stopped (for Standalone Apache Spark cluster) or terminated (for data engine serice) with notebook.

After login user is notified that corresponding resources are about to be stopped/terminated in some time.

Scheduler reminder


Collaboration space

Manage Git credentials

To work with Git (pull, push) via UI tool (ungit) you could add multiple credentials in DLab UI, which are set on all running instances with analytical tools.

When you click on the button "Git credentials" – following popup shows up:

Git_creds_window

In this window you need to add:

  • Your Git server hostname, without http or https, for example: gitlab.com, github.com, bitbucket.com, or your internal Git server.
  • Your Username and Email - used to display author of commit in git.
  • Your Login and Password - for authorization into git server.

Once all fields are filled in and you click on "Assign" button, you see the list of all your Git credentials.

Clicking on "Apply changes" button, your credentials are sent to all running instances with analytical tools. It takes a few seconds for changes to be applied.

Git_creds_window1

On this tab you can also edit your credentials (click on pen icon pen) or delete (click on bin icon bin).

Git UI tool (ungit)

On every analytical tool instance you can see Git UI tool (ungit):

Git_ui_link

Before start working with Git repositories, you need to change working directory on the top of window to:

/home/dlab-user/ or /opt/zeppelin/notebook for Zeppelin analytical tool and press Enter.

Note: Zeppelin already uses git for local versioning of files, you can add upstream for all notebooks.

After changing working directory you can create repository or better way - clone existing:

Git_ui_ungit

After creating repository you can see all commits and branches:

Git_ui_ungit_work

On the top of window in the red field UI shows us changed or new files to commit. You can uncheck or add some files to gitignore.

Note: Git always checks you credentials. If this is your first commit after adding/changing credentials and after clicking on "Commit" button nothing happened - just click on "Commit" button again.

On the right pane of window you also can see buttons to fetch last changes of repository, add upstreams and switch between branches.

To see all modified files - click on the "Circle" button on the center:

Git_ui_ungit_changes

After commit you see your local version and remote repository. To push you changes - click on your current branch and press "Push" button.

Git_ui_ungit_push

Also clicking on "Circle" button you can uncommit or revert changes.


Administration

Manage roles

Administrator can choose what instance shape(s), notebook(s) and computational resource are supposed to create for certain group(s) or user(s). Administrator can also assign administrator per project, who is able to manage roles within particular project. To do it click on "Add group" button. "Add group" popup shows up:

Manage roles

Roles consist of:

  • Administration - allow to execute administrative operation for the whole DLab or administrative operation only per project;
  • Billing - allow to view billing only the own resources or all users;
  • Compute - list of Compute types which are supposed for creation;
  • Compute shapes - list of Compute shapes which are supposed for creation;
  • Notebook - list of Notebook templates which are supposed for creation;
  • Notebook shapes - list of Notebook shapes which are supposed for creation.

Roles

To add group enter group name, choose certain action which should be allowed for group and also you can add discrete user(s) (not mandatory) and then click "Create" button. After addidng the group it appears on "Manage roles" popup.

Administrator can remove group or user. For that you should only click on bin icon binfor certain group or for icon delete for particular user. After that hit "Yes" in confirmation popup.

Delete group

Project management

After project creation (this step is described in create project) administrator is able to manage the project by clicking on gear icon gear in the "Actions" column for the needed project.

Project view

The following menu shows up:

Project menu

Administrator can edit already existing project:

  • Add or remove group;
  • Add new endpoint;
  • Switch off/on 'Use shared image' option.

To edit the project hit "Edit project" and choose option which you want to add, remove or change. For applying changes click on "Update" button.

To stop Edge node hit "Stop edge node". After that confirm "OK" in confirmation popup. All related instances change its status from 'Running' to "Stopping" and soon become "Stopped". You are able to start Edge node again after a while and proceed with your work. Do not forget to start notebook again if you want to continue with your analytics. Because start Edge node does not start related instances.

To terminate Edge node hit "Terminate edge node". After that confirm "OK" in confirmation popup. All related instances change its status to "Terminating" and soon become "Terminated".

Environment management

DLab Environment Management page is an administration page allowing adminstrator to see the list of all users environments and to stop/terminate all of them.

To access Environment management page either navigate to it via main menu:

Environment management

To stop or terminate the Notebook click on a gear icon gear in the "Actions" column for a needed Notebook and hit "Stop" or "Terminate" action:

Manage environment actions

NOTE: Connected Data Engine Server is terminated and related Data Engine is stopped during Notebook stopping. During Notebook termination related Computational resources are automatically terminated.

To stop or release specific cluster click an appropriate button close to cluster alias.

Manage resource action

Confirm stopping/decommissioning of the Computational resource by hitting "Yes":

Manage environment action confirm

NOTE: Terminate action is available only for notebooks and computational resources, not for Edge Nodes.

Multiple Cloud Endpoints

Administrator can connect to any of Cloud endpoints: AWS, GCP, Azure. For that administrator should click on "Endpoints" button. "Connect endpoint" popup shows up:

Connect endpoint

Once all fields are filled in and you click on "Connect" button, you are able to see the list of all your added endpoints on "Endpoint list" tab:

Endpoint list

Administrator can deactivate whole analytical environment via bin icon bin. And all related instances change its satuses to "Terminating" and soon become "Terminated".

Manage DLab quotas

Administrator can set quotas per project and for the whole DLab. To do it click on "Manage DLab quotas" button. "Manage DLab quotas" popup shows up. Administrator can see all active project:

Manage environment

After filling fields and clicking on "Apply" button, new quotas are used for project and DLab. If project and DLab quotas are exceeded the warning shows up during login.

Exceeded quota

In such case user cannot create new instance and already "Running" instance changes its status to "Stopping", except for Data Engine Service (its status changes "Terminating") and soon becomes "Stopped" or "Terminated" appropriately.


DLab Billing report

On this page you can see all billing information, including all costs assosiated with service base name of SSN.

Billing page

In the header you can see 2 fields:

  • Service base name of your environment
  • Date period of available billing report

On the center of header you can choose period of report in datepicker:

Billing datepicker

You can save billing report in csv format hitting "Export" button.

You can also filter data by environment name, user, project, resource type, instance size, product. On top of that you can sort data by user, project, service charges.

In the footer of billing report, you can see "Total" cost for all environments.


Web UI filters

You can leverage functionality of build-in UI filter to quickly manage the analytical tools and computational resources, which you only want to see in your dashboard.

To do this, simply click on icon filter in dashboard header and filter your list by any of:

  • environment name (input field);
  • status (multiple choice);
  • shape (multiple choice);
  • computational resources (multiple choice);

Main page filter

Once your list of filtered by any of the columns, icon filter changes to filter for a filtered columns only.

There is also an option for quick and easy way to filter out all inactive instances (Failed and Terminated) by clicking on “Show active” button in the ribbon. To switch back to the list of all resources, click on “Show all”.