Skip to content

A guided tour of Jupyter in UC Berkeley's data science education program

Notifications You must be signed in to change notification settings

elaine84/ds-edu-tour

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

Jupyter in UC Berkeley's Data Science Education Program

  • This document provides a guided tour of information distributed across many web pages and GitHub repositories involved in running a live suite of data science courses at the University of California, Berkeley.
  • These valuable resources -- designed for and by UC Berkeley students, instructors, and other teaching and support staff -- are publicly available.
  • This content should be of broad interest to diverse folks thinking about data science education, using Jupyter notebooks in the classroom, and/or deploying and scaling JupyterHub.

Keywords: data science, UC Berkeley, undergraduate education, Jupyter notebooks, JupyterHub deployment

Who is this for?

  • Designed for anyone who wants to learn about UC Berkeley's data science program but doesn't have an account on data8.berkeley.edu
  • This snapshot is for folks at JupyterDays Boston 2016 (March 17-18, 2016 in Cambridge, MA) -- hopefully provides plenty of materials to explore, fork, and hack on

What will you find here?

  • An overview of the UC Berkeley's new data science education program
  • Pointers to current course materials distributed as Jupyter notebooks
  • An overview of the live JupyterHub-based infrastructure

Some dependencies

  • All course content and software is viewable online
  • You'll need a GitHub account if you want to clone or fork this content
  • Course content is distributed as Jupyter notebooks that have several Python dependencies

DATA 8: Foundations of Data Science

Course overview

  • Teaches computational and inferential (statistical) thinking through interaction with real data
  • Pilot run in Fall 2015 with ~80 students
  • Current Spring 2016 enrollment at ~470 students
  • Three 50 min lectures + 2 hour computer lab session per week

Broader context: The Data Science Education Program

  • Program website: databears.berkeley.edu
  • DATA 8 is new, fast-moving, growing, and the intention is to keep growing (up to 3000 students/semester)
  • Complemented by a suite of connector courses teaching diverse subjects through the lens of data science
  • Meant as a foundation for advanced courses to be seeded across the university
  • See the report on Data Sciences @ Berkeley: The Undergraduate Experience

Course design requirements

  • Must be accessible to all incoming first-years at UC Berkeley
  • Assume no computer science background and only high school algebra
  • Get students immediately interacting with data programmatically
  • Can't require students to figure out a local installation -- too huge a barrier
  • Provide a platform (technical and intellectual) that students can build on throughout their college careers

Implementation highlights

  • Jupyter notebooks + JupyterHub support a solution satisfying all design requirements

  • Why Jupyter notebooks?

    • Provide a natural environment for introducing data science skills to students
    • Let students develop an explicit computational narrative with data
    • Interactive substrate for the online course textbook and computer lab assignments
  • Why JupyterHub?

    • Multi-user server for Jupyter notebooks can support many users (students, instructors, teaching staff)
    • Enables browser-based interface to computation in the cloud -- students only need a browser to start programming, interacting with data, and creating a visible record of their analytical steps

Course website: data8.org

data8-sp16

  • Syllabus + links to lecture videos

    • An overview of data science
    • Using Python to manipulate information in table data structures
    • Interpreting and exploring data through visualizations
    • Sampling: Understanding the behavior of random selection
    • Making predictions from data
    • Inference: Reasoning about populations by computing over samples
    • Models: Making assumptions and exploring their consequences
  • data8.org is primarily a student-facing website and its links to computer lab assignments will not work for anyone who doesn't have a course account

  • We'll show you everything that goes into making these links work for students and how to find the underlying source materials hosted on GitHub across various repositories of the data-8 organization

textbook

Alternatively: view the textbook on GitBooks

Most sections of the online textbook begin with a big blue Interact button (example section)

textbook-interact

When a student clicks the Interact button, they are redirected to a Jupyter notebook containing an interactive version of the textbook content!

textbook-interact-jupyterhub

What's going on?

  • First, we'll explain where the source material is
  • Second, we'll explain the Interact button
git clone https://github.com/data-8/textbook.git
Sampling
========

{% include "../notebooks-html/Sampling.html" %}

The Interact button distributes content to students

An Interact button in the textbook (example section) is a link like this:

http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/top_movies.csv&path=notebooks/Sampling.ipynb

DS8-Interact is a side server for the DATA 8 JuypterHub deployment to copy remote notebooks into user accounts

git clone https://github.com/data-8/DS8-Interact.git

Computer lab assignments are Jupyter notebooks

data8assets github repo

git clone https://github.com/data-8/data8assets.git

JupyterHub deployment

git clone https://github.com/data-8/jupyterhub-deploy.git
  • We were about to level out a new basement to give students computers with Jupyter installed...
  • ...until we discovered that Jessica Hamrick had deployed JupyterHub to the cloud. We thought we could do it too.
  • We based our deployment on Jess' [jupyterhub-compmodels-deploy] comp-jhub.
  • See Jessica's blog post at Rackspace on [Deploying JupyterHub for Education] jhub-post and also her README at jupyterhub-compmodels-deploy for the design details.

Technical specs

  • For our Fall 2015 pilot of ~80 students, we deployed JupyterHub on bare-metal machines from UC Berkeley's CS department.
  • We gave each student 2GB of RAM. We expected about 60% of the class to be on at any point in time, so we provisioned two machines with 64 cores and 26GB RAM each.
  • For the Spring 2016 class of ~480 students, we used a donation from Microsoft Azure and deployed there, using 36 machines with 8 cores and 14GB RAM each.

Connector courses

  • Suite of connector courses are taught in departments across campus and introduce diverse subjects through the lens of data science
  • Spring 2016 has 11 connector courses: in ethics, cognitive science, geospatial data, probability & statistics, ecology, history, matrices & graphs, computational structures, health & human behavior, smart cities, literature
  • Nearly all use Jupyter notebooks and the DATA 8 JupyterHub deployment
  • Many connector instructors are new to Python and GitHub

Technical challenges and possible future directions

Scaling up to more students

  • Theoretically, scaling up to more students means we can just add more nodes to the JupyterHub deployment to get the computing power in. However...

  • ...we're now discovering bugs that are only discoverable when dealing with scale.

  • We were haunted by a race condition in JupyterHub that resulted in large amounts of 503 errors for weeks.

  • We've had to make a forum thread for these issues. Students still run into them every day.

    Error reporting forum thread

  • Now we have a team of students adding tooling to make deployment more stable. This includes a development deployment, logging and monitoring, and load testing.

Scaling out to more classes

  • Scaling out is another ordeal. Our goal is to have one JupyterHub deployment that can serve all the classes at UC Berkeley. Adding JupyterHub for a class should be as easy as creating a class page.
  • At the moment, JupyterHub consolidates all users into one system. We need to split the users into multiple classes.
  • Problems that need solving:
    • Classrooms have different resources. Some might have AWS credits, others can have Azure credits, etc.
    • Students should be able to access different hubs for different classes.
    • Instructors need a way to distribute content to students.
    • Ideally, instructors can also grade assignments easily.
  • Proposals to solve these problems:

Other resources

About

A guided tour of Jupyter in UC Berkeley's data science education program

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published