Skip to content

Latest commit

 

History

History
104 lines (73 loc) · 8.99 KB

File metadata and controls

104 lines (73 loc) · 8.99 KB

badge badge

Introduction to Data Science Workshop Series

Teaching materials for KAUST Visualization Core Lab (KVL) Introduction to Data Science Workshop Series.

Course Curricula

Identifying Core Competencies for Data Science

According to a recent O’Reilly Data Science Survey, most data scientists use multiple programming languages on a daily basis to solve their data science problems. The top three programming languages used by data scientists are SQL, Python, and Bash. The ability to share and reproduce data science workflows is critical, whether the workflows are providing decision support in industrial applications, or generating novel insights from scientific data. Core tools for facilitating reproducible data science workflows are version control tools such as Git, virtual environment tools such as Conda, and container technologies such as Docker.

Building Data Science Capacity at KAUST

KVL has organized a series of Introduction to Data Science workshops to build capacity in the core data science tools and enable future data science applications at KAUST.

  • Introduction to Shell for (Data) Scientists
  • Introduction to Conda for (Data) Scientists
  • Introduction to Python for Data Science
  • Introduction to Version Control using Git for (Data) Scientists
  • Introduction to SQL for Data Science

The core workshop material largely follows a curriculum developed by Software and Data Carpentry, two global nonprofit organizations that teach foundational coding and data science skills to researchers worldwide. The curriculum will be offered every Fall and Spring semester in its entirety in order to provide KAUST students, post-docs, staff, and researchers with an opportunity to develop their skills in these core data science tools.

KAUST Core Labs will offer a Certificate of Completion to those learners who complete the core Introduction to Data Science curriculum.

Helping to Advance the State-of-the-Art in Data Science at KAUST

In addition to building capacity in core data science tools, KVL and KAUST Supercomputing Core Laboratory (KSL) are planning to offer additional advanced training courses in tools used in state-of-the-art data science applications with a particular focus on enabling data science with GPUs.

Using Conda

Creating the Conda environment

After adding any necessary dependencies to the Conda environment.yml file you can create the environment in a sub-directory of your project directory by running the following command.

$ conda env create --prefix ./env --file environment.yml

Once the new environment has been created you can activate the environment with the following command.

$ conda activate ./env

Note that the env directory is not under version control as it can always be re-created from the environment.yml file as necessary.

Building JupyterLab extensions (optional)

If you wish to use any JupyterLab extensions included in the environment.yml file then you need to activate the environment and rebuild the JupyterLab application using the following commands to source the postBuild script.

$ conda activate $ENV_PREFIX # optional if environment already active
(/path/to/env) $ . postBuild

Updating the Conda environment

If you add (remove) dependencies to (from) the environment.yml file after the environment has already been created, then you can update the environment with the following command.

$ conda env update --prefix ./env --file environment.yml --prune

Listing the full contents of the Conda environment

You can list the full contents of the Conda environment by running the following command.

$ conda list --prefix ./env

Using Docker

In order to build Docker images for your project and run containers you will need to install Docker and Docker Compose.

Detailed instructions for using Docker to build and image and launch containers can be found in the docker/README.md.