- This document provides a guided tour of information distributed across many web pages and GitHub repositories involved in running a live suite of data science courses at the University of California, Berkeley.
- These valuable resources -- designed for and by UC Berkeley students, instructors, and other teaching and support staff -- are publicly available.
- This content should be of broad interest to diverse folks thinking about data science education, using Jupyter notebooks in the classroom, and/or deploying and scaling JupyterHub.
Keywords: data science, UC Berkeley, undergraduate education, Jupyter notebooks, JupyterHub deployment
- Designed for anyone who wants to learn about UC Berkeley's data science program but doesn't have an account on data8.berkeley.edu
- This snapshot is for folks at JupyterDays Boston 2016 (March 17-18, 2016 in Cambridge, MA) -- hopefully provides plenty of materials to explore, fork, and hack on
- An overview of the UC Berkeley's new data science education program
- Pointers to current course materials distributed as Jupyter notebooks
- An overview of the live JupyterHub-based infrastructure
- All course content and software is viewable online
- You'll need a GitHub account if you want to clone or fork this content
- Course content is distributed as Jupyter notebooks that have several Python dependencies
- Python 3 and Jupyter
- The datascience Python package
- Teaches computational and inferential (statistical) thinking through interaction with real data
- Pilot run in Fall 2015 with ~80 students
- Current Spring 2016 enrollment at ~470 students
- Three 50 min lectures + 2 hour computer lab session per week
- Program website: databears.berkeley.edu
- DATA 8 is new, fast-moving, growing, and the intention is to keep growing (up to 3000 students/semester)
- Complemented by a suite of connector courses teaching diverse subjects through the lens of data science
- Meant as a foundation for advanced courses to be seeded across the university
- See the report on Data Sciences @ Berkeley: The Undergraduate Experience
- Must be accessible to all incoming first-years at UC Berkeley
- Assume no computer science background and only high school algebra
- Get students immediately interacting with data programmatically
- Can't require students to figure out a local installation -- too huge a barrier
- Provide a platform (technical and intellectual) that students can build on throughout their college careers
-
Jupyter notebooks + JupyterHub support a solution satisfying all design requirements
-
Why Jupyter notebooks?
- Provide a natural environment for introducing data science skills to students
- Let students develop an explicit computational narrative with data
- Interactive substrate for the online course textbook and computer lab assignments
-
Why JupyterHub?
- Multi-user server for Jupyter notebooks can support many users (students, instructors, teaching staff)
- Enables browser-based interface to computation in the cloud -- students only need a browser to start programming, interacting with data, and creating a visible record of their analytical steps
Course website: data8.org
-
Syllabus + links to lecture videos
- An overview of data science
- Using Python to manipulate information in table data structures
- Interpreting and exploring data through visualizations
- Sampling: Understanding the behavior of random selection
- Making predictions from data
- Inference: Reasoning about populations by computing over samples
- Models: Making assumptions and exploring their consequences
-
data8.org is primarily a student-facing website and its links to computer lab assignments will not work for anyone who doesn't have a course account
-
We'll show you everything that goes into making these links work for students and how to find the underlying source materials hosted on GitHub across various repositories of the data-8 organization
Online textbook: www.inferentialthinking.com
Alternatively: view the textbook on GitBooks
Most sections of the online textbook begin with a big blue Interact button (example section)
When a student clicks the Interact button, they are redirected to a Jupyter notebook containing an interactive version of the textbook content!
- First, we'll explain where the source material is
- Second, we'll explain the Interact button
git clone https://github.com/data-8/textbook.git
- Most of the underlying source material for the textbook is written in Jupyter notebooks (example notebook)
- GitBook allows us to write and organize chapters using Markdown
- Conveniently, the Markdown syntax allows arbitrary HTML inline
- We can convert a notebook into an HTML snippet using nbconvert (example HTML)
- Then we include that HTML snippet in the Markdown file (example Markdown)
Sampling
========
{% include "../notebooks-html/Sampling.html" %}
An Interact button in the textbook (example section) is a link like this:
http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/top_movies.csv&path=notebooks/Sampling.ipynb
DS8-Interact is a side server for the DATA 8 JuypterHub deployment to copy remote notebooks into user accounts
git clone https://github.com/data-8/DS8-Interact.git
git clone https://github.com/data-8/data8assets.git
git clone https://github.com/data-8/jupyterhub-deploy.git
- We were about to level out a new basement to give students computers with Jupyter installed...
- ...until we discovered that Jessica Hamrick had deployed JupyterHub to the cloud. We thought we could do it too.
- We based our deployment on Jess' [jupyterhub-compmodels-deploy] comp-jhub.
- See Jessica's blog post at Rackspace on [Deploying JupyterHub for Education] jhub-post and also her README at jupyterhub-compmodels-deploy for the design details.
- For our Fall 2015 pilot of ~80 students, we deployed JupyterHub on bare-metal machines from UC Berkeley's CS department.
- We gave each student 2GB of RAM. We expected about 60% of the class to be on at any point in time, so we provisioned two machines with 64 cores and 26GB RAM each.
- For the Spring 2016 class of ~480 students, we used a donation from Microsoft Azure and deployed there, using 36 machines with 8 cores and 14GB RAM each.
- Suite of connector courses are taught in departments across campus and introduce diverse subjects through the lens of data science
- Spring 2016 has 11 connector courses: in ethics, cognitive science, geospatial data, probability & statistics, ecology, history, matrices & graphs, computational structures, health & human behavior, smart cities, literature
- Nearly all use Jupyter notebooks and the DATA 8 JupyterHub deployment
- Many connector instructors are new to Python and GitHub
-
Theoretically, scaling up to more students means we can just add more nodes to the JupyterHub deployment to get the computing power in. However...
-
...we're now discovering bugs that are only discoverable when dealing with scale.
-
We were haunted by a race condition in JupyterHub that resulted in large amounts of 503 errors for weeks.
-
We've had to make a forum thread for these issues. Students still run into them every day.
-
Now we have a team of students adding tooling to make deployment more stable. This includes a development deployment, logging and monitoring, and load testing.
- Scaling out is another ordeal. Our goal is to have one JupyterHub deployment that can serve all the classes at UC Berkeley. Adding JupyterHub for a class should be as easy as creating a class page.
- At the moment, JupyterHub consolidates all users into one system. We need to split the users into multiple classes.
- Problems that need solving:
- Classrooms have different resources. Some might have AWS credits, others can have Azure credits, etc.
- Students should be able to access different hubs for different classes.
- Instructors need a way to distribute content to students.
- Ideally, instructors can also grade assignments easily.
- Proposals to solve these problems:
- A JupyterHub Hub, which lists and manages deployments of JupyterHub
- A Dropbox-like interface to GitHub to help instructors with content
management
- See the design doc for an experiment called jupyter-synchronized-folders
- The design doc is structured as (but is not) a Jupyter Enhancement Proposal
- We'd love to hear comments via this pull request