GitHub - elaine84/ds-edu-tour: A guided tour of Jupyter in UC Berkeley's data science education program

Jupyter in UC Berkeley's Data Science Education Program

This document provides a guided tour of information distributed across many web pages and GitHub repositories involved in running a live suite of data science courses at the University of California, Berkeley.
These valuable resources -- designed for and by UC Berkeley students, instructors, and other teaching and support staff -- are publicly available.
This content should be of broad interest to diverse folks thinking about data science education, using Jupyter notebooks in the classroom, and/or deploying and scaling JupyterHub.

Keywords: data science, UC Berkeley, undergraduate education, Jupyter notebooks, JupyterHub deployment

Who is this for?

Designed for anyone who wants to learn about UC Berkeley's data science program but doesn't have an account on data8.berkeley.edu
This snapshot is for folks at JupyterDays Boston 2016 (March 17-18, 2016 in Cambridge, MA) -- hopefully provides plenty of materials to explore, fork, and hack on

What will you find here?

An overview of the UC Berkeley's new data science education program
Pointers to current course materials distributed as Jupyter notebooks
An overview of the live JupyterHub-based infrastructure

Some dependencies

All course content and software is viewable online
You'll need a GitHub account if you want to clone or fork this content
Course content is distributed as Jupyter notebooks that have several Python dependencies
- Python 3 and Jupyter
- The datascience Python package

DATA 8: Foundations of Data Science

Course overview

Teaches computational and inferential (statistical) thinking through interaction with real data
Pilot run in Fall 2015 with ~80 students
Current Spring 2016 enrollment at ~470 students
Three 50 min lectures + 2 hour computer lab session per week

Broader context: The Data Science Education Program

Program website: databears.berkeley.edu
DATA 8 is new, fast-moving, growing, and the intention is to keep growing (up to 3000 students/semester)
Complemented by a suite of connector courses teaching diverse subjects through the lens of data science
Meant as a foundation for advanced courses to be seeded across the university
See the report on Data Sciences @ Berkeley: The Undergraduate Experience

Course design requirements

Must be accessible to all incoming first-years at UC Berkeley
Assume no computer science background and only high school algebra
Get students immediately interacting with data programmatically
Can't require students to figure out a local installation -- too huge a barrier
Provide a platform (technical and intellectual) that students can build on throughout their college careers

Implementation highlights

Jupyter notebooks + JupyterHub support a solution satisfying all design requirements
Why Jupyter notebooks?
- Provide a natural environment for introducing data science skills to students
- Let students develop an explicit computational narrative with data
- Interactive substrate for the online course textbook and computer lab assignments
Why JupyterHub?
- Multi-user server for Jupyter notebooks can support many users (students, instructors, teaching staff)
- Enables browser-based interface to computation in the cloud -- students only need a browser to start programming, interacting with data, and creating a visible record of their analytical steps

Course website: data8.org

Syllabus + links to lecture videos
- An overview of data science
- Using Python to manipulate information in table data structures
- Interpreting and exploring data through visualizations
- Sampling: Understanding the behavior of random selection
- Making predictions from data
- Inference: Reasoning about populations by computing over samples
- Models: Making assumptions and exploring their consequences
data8.org is primarily a student-facing website and its links to computer lab assignments will not work for anyone who doesn't have a course account
We'll show you everything that goes into making these links work for students and how to find the underlying source materials hosted on GitHub across various repositories of the data-8 organization

Online textbook: www.inferentialthinking.com

Alternatively: view the textbook on GitBooks

Most sections of the online textbook begin with a big blue Interact button (example section)

When a student clicks the Interact button, they are redirected to a Jupyter notebook containing an interactive version of the textbook content!

What's going on?

First, we'll explain where the source material is
Second, we'll explain the Interact button

The textbook is hosted in a GitHub repo

git clone https://github.com/data-8/textbook.git

Most of the underlying source material for the textbook is written in Jupyter notebooks (example notebook)
GitBook allows us to write and organize chapters using Markdown
Conveniently, the Markdown syntax allows arbitrary HTML inline
We can convert a notebook into an HTML snippet using nbconvert (example HTML)
Then we include that HTML snippet in the Markdown file (example Markdown)

Sampling
========

{% include "../notebooks-html/Sampling.html" %}

The Interact button distributes content to students

An Interact button in the textbook (example section) is a link like this:

http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/top_movies.csv&path=notebooks/Sampling.ipynb

DS8-Interact is a side server for the DATA 8 JuypterHub deployment to copy remote notebooks into user accounts

git clone https://github.com/data-8/DS8-Interact.git

Computer lab assignments are Jupyter notebooks

data8assets github repo

git clone https://github.com/data-8/data8assets.git

JupyterHub deployment

git clone https://github.com/data-8/jupyterhub-deploy.git

We were about to level out a new basement to give students computers with Jupyter installed...
...until we discovered that Jessica Hamrick had deployed JupyterHub to the cloud. We thought we could do it too.
We based our deployment on Jess' [jupyterhub-compmodels-deploy] comp-jhub.
See Jessica's blog post at Rackspace on [Deploying JupyterHub for Education] jhub-post and also her README at jupyterhub-compmodels-deploy for the design details.

Technical specs

For our Fall 2015 pilot of ~80 students, we deployed JupyterHub on bare-metal machines from UC Berkeley's CS department.
We gave each student 2GB of RAM. We expected about 60% of the class to be on at any point in time, so we provisioned two machines with 64 cores and 26GB RAM each.
For the Spring 2016 class of ~480 students, we used a donation from Microsoft Azure and deployed there, using 36 machines with 8 cores and 14GB RAM each.

Connector courses

Suite of connector courses are taught in departments across campus and introduce diverse subjects through the lens of data science
Spring 2016 has 11 connector courses: in ethics, cognitive science, geospatial data, probability & statistics, ecology, history, matrices & graphs, computational structures, health & human behavior, smart cities, literature
Nearly all use Jupyter notebooks and the DATA 8 JupyterHub deployment
Many connector instructors are new to Python and GitHub

Technical challenges and possible future directions

Scaling up to more students

Theoretically, scaling up to more students means we can just add more nodes to the JupyterHub deployment to get the computing power in. However...
...we're now discovering bugs that are only discoverable when dealing with scale.
We were haunted by a race condition in JupyterHub that resulted in large amounts of 503 errors for weeks.
We've had to make a forum thread for these issues. Students still run into them every day.
Now we have a team of students adding tooling to make deployment more stable. This includes a development deployment, logging and monitoring, and load testing.

Scaling out to more classes

Scaling out is another ordeal. Our goal is to have one JupyterHub deployment that can serve all the classes at UC Berkeley. Adding JupyterHub for a class should be as easy as creating a class page.
At the moment, JupyterHub consolidates all users into one system. We need to split the users into multiple classes.
Problems that need solving:
- Classrooms have different resources. Some might have AWS credits, others can have Azure credits, etc.
- Students should be able to access different hubs for different classes.
- Instructors need a way to distribute content to students.
- Ideally, instructors can also grade assignments easily.
Proposals to solve these problems:
- A JupyterHub Hub, which lists and manages deployments of JupyterHub
- A Dropbox-like interface to GitHub to help instructors with content management
  - See the design doc for an experiment called jupyter-synchronized-folders
  - The design doc is structured as (but is not) a Jupyter Enhancement Proposal
  - We'd love to hear comments via this pull request

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jupyter in UC Berkeley's Data Science Education Program

Who is this for?

What will you find here?

Some dependencies

DATA 8: Foundations of Data Science

Course overview

Broader context: The Data Science Education Program

Course design requirements

Implementation highlights

Course website: data8.org

Online textbook: www.inferentialthinking.com

What's going on?

The textbook is hosted in a GitHub repo

The Interact button distributes content to students

Computer lab assignments are Jupyter notebooks

JupyterHub deployment

Technical specs

Connector courses

Technical challenges and possible future directions

Scaling up to more students

Scaling out to more classes

Other resources

About

Releases

Packages

Contributors 2

elaine84/ds-edu-tour

Folders and files

Latest commit

History

Repository files navigation

Jupyter in UC Berkeley's Data Science Education Program

Who is this for?

What will you find here?

Some dependencies

DATA 8: Foundations of Data Science

Course overview

Broader context: The Data Science Education Program

Course design requirements

Implementation highlights

Course website: data8.org

Online textbook: www.inferentialthinking.com

What's going on?

The textbook is hosted in a GitHub repo

The Interact button distributes content to students

Computer lab assignments are Jupyter notebooks

JupyterHub deployment

Technical specs

Connector courses

Technical challenges and possible future directions

Scaling up to more students

Scaling out to more classes

Other resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages