- Feb 11: when counting pull requests per user, lines like
Revert "Merge pull request #859 from wvh/register_error_handler"
will count (even though this is technically undoing a pull request)
This project is longer than usual, with 32 questions, because it is
designed to help you learn (or review) Python concepts taught in 220.
The first 75% can be done as a group and focuses on review. The last
25% is individual and focuses on new 320 concepts: check_output
,
time
, and git
.
VIDEO showing how to get started (assumes you have completed lab 2). This was recorded in Fall 2021, but still applies to the project this semester.
Make sure to do lab 2 before starting this, as it must be done on your virtual machine.
- SSH to your virtual machine using the following:
ssh USERNAME@IP_ADDRESS
If you don't recall these from lab, you can find your username (https://console.cloud.google.com/compute/metadata/sshKeys) and External IP address (https://console.cloud.google.com/compute/instances) in Google's console.
-
run
pip3 install pandas==1.3.5 matplotlib==3.5.1
-
run
git clone https://github.com/cs320-wisc/s22.git
-
go to
http://YOUR_IP_ADDRESS:2020/tree
in the browser (sign in, if prompted) -
enter the
s22
>p1
directory -
Click the "New" button and select "Python 3", then start the project
-
Go to "File" > "Rename", and name your notebook "p1"
-
do a "Kernel" > "Restart & Run All" in your notebook
-
"File" > "Save and Checkpoint"
-
from an SSH session, navigate to the s22/p1 directory
-
run
python3 tester.py p1.ipynb
and work on fixing any issues
Your notebook should have a comment like this:
# project: p1
# submitter: ????
# partner: none
# hours: ????
For submitter, use your NetID (part before @wisc.edu in your email). Estimate how many hours you spent on the project. This semester, "partner" should always be "none". Don't list people on your assigned team.
Submit as follows:
-
"File" > "Download as" > "Notebook (.ipynb)"
-
go to https://tyler.caraza-harter.com/cs320/s22/submission.html
-
select P1
-
"Choose File" to select the .ipynb file you had downloaded
-
"Submit" (don't use "Ignore Errors" unless you're right before a deadline -- better to email us to get help resolving the issue)
For this portion of the project, you may collaborate with your group members in any way (even looking at working code). You may also seek help from 320 staff (mentors, TAs, instructor). You may not seek receive help from other 320 students (outside your group) or anybody outside the course.
Take a look at the builtin Python functions to see if one can answer this: https://docs.python.org/3/library/functions.html
Some functions we use a lot in 220/320 are abs, dir, float, input, int, len, list, max, min, range, set, sorted, str, sum, type.
7 and 2 are ints, so the result of dividing these is an int (3, after rounding down 3.5) in most programming languages. Python produces the mathematically correct answer, even though it is not an int (like 7 and 2).
In other cases where you want to divide 7 by 2 and get an int, you would use 7 // 2
.
Complete the code in accordance with the comment to calculate the answer.
x = 4
maximum = 10
minimum = 5
error = ???? # True if x is outside the minimum-to-maximum range
error
Notes:
- we don't need to specify the type of our variables as in some languages (e.g., Java) -- Python knows x is an
int
because we assigned4
, which is an int. Variable types are not fixed after creation as in some languages (e.g., Go) -- we could later runx = "howdy"
if we wanted to - in Python, a
bool
isTrue
orFalse
. We use theand
,or
, andnot
operators (in other programming languages, these operators are often expressed as&&
,||
, and!
).
Complete the following to answer:
word = "KAFKAESQUE"
suffix_match = ???? # .endswith(...) method not allowed for this question! (practice slicing)
suffix_match
Skim string methods here: https://docs.python.org/3/library/stdtypes.html#string-methods. Some important ones: find
, isdigit
, join
, split
, lower
, upper
, strip
, replace
.
Hints:
- to ignore case, it's often easy to use a method to make everything upper or lower case
- to get a single character from a string, you can use
s[INDEX]
. 0 is the first character, 1 is the second, and so on. Python supports negative indexing, meanings[-1]
is the last letter,s[-2]
is the next to last, etc. You can also slice strings to get a substring by putting a colon between two indexess[inclusive_start:exclusive_end]
. You can leave off one of the indexes to go to the start or end of the string. For example,word[:3]
would evaluate to"KAF"
. - in Java, you compare strings with
s1.equals(s2)
, but in Python the correct equivalent iss1 == s2
. The equivalent of Java's==
is Python's rarely usedis
operator.
Your function should generally take two ints and return their sum. For example, add(2, 3)
should return 5. Users of the function should also be able to call it like add(x=2, y=3)
. If only one argument is passed, 1 should be added. For example, add(3)
or add(x=3)
would both return 4.
Python parameters may be filled with positions arguments, keyword arguments, or default arguments. If this is unfamiliar, read the following:
- https://docs.python.org/3/tutorial/controlflow.html#defining-functions
- https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions
In Python, indents are very important. The code inside a function/if/loop is indented (Python doesn't use {
and {
to indicate this, as in Java and many other languages).
Call your function to answer.
Complete the following so that status
says something meaningful about x
.
x = 4
if ????:
status = "negative"
elif ????:
status = "positive"
else:
status = "zero"
status
https://docs.python.org/3/tutorial/controlflow.html#if-statements
Paste the following:
nums = [3, 4, 1, 6]
for x in nums:
print(x)
Python lists can be created like [item1, item2, ...]
and indexed/sliced just like strings (strings and lists are both examples of Python sequences; by definition, you can index and slice any kind of sequence you encounter in Python). This list contains just ints, but you're free to have a mix of types in Python lists.
In general, you can plug in a variable name and sequence into a for
loop to run a piece of code for every entry in the sequence:
for ???? in ????:
# DO SOMETHING
More on for
loops:
- https://docs.python.org/3/tutorial/controlflow.html#for-statements
- https://docs.python.org/3/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops
Write a function called smart_count
that takes a list of numbers and returns their sum. It should also have the following features:
- ignore numbers greater than 10
- if there is a negative number, that number (and all that follow it, positive or negative, should be skipped)
Use continue
to implement feature 1 and break
to implement feature 2.
The answer should be 6: 2+1+3. 11 and 15 are too large, so they are skipped. 8 and 2 are skipped because they are after a negative number (which is also skipped).
Copy/paste the following:
header = ["A", "B", "C"]
coord1 = {"x": 8, "y": 5}
coord2 = {"x": 9, "y": 2}
coord3 = {"x": 3, "y": 1}
rows = [
[1, 6, coord1],
[3, 4, coord2],
[5, 2, coord3],
]
Note that rows
is a list of lists. Each inner list contains two ints and one dict (dictionary). For complicated nested structures like this, it's often helpful to visualize the stack of frames and heap of objects in PythonTutor: https://pythontutor.com/live.html#mode=edit.
You could copy the above to visualize it, or use the following link for your convenience:
Both lists and dicts contain values. With lists, each value is associated with an index (integers starting from 0). With dicts, each value is associated with a key specified by the programmers. Keys are often strings, but they don't need to be.
Docs:
- https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
- https://docs.python.org/3/tutorial/datastructures.html#dictionaries
Q10: what is the value associated with the "x" key of the dict in the last position of the first list?
Hint: if the question were "what is the value associated with the 'y' key of the dict in the last position of the second list?", the solution would be: rows[1][-1]["y"]
. You just need to tack on brackets containing indexes (for lists) or keys (for dicts) to delve deeper into a nested structure.
Complete the following so that the first change via v2
is NOT reflected in rows
, but the second change via v2
IS reflected in rows
:
import copy
v2 = ????
v2[0] = 8888 # first change
v2[1][1] = 9999 # second change
Relevant docs: https://docs.python.org/3/library/copy.html
To get a good intuition about the reference/shallow/deep copy, try stepping through the following slowly in PythonTutor:
import copy
v1 = [[1], [], [2, 3]]
v2 = v1
v2 = copy.copy(v1)
v2 = copy.deepcopy(v1)
Q12: if we imagine the list of lists structure referenced by rows
as a table, with column names in header
, what is the sum of values in the "B" column?
Note: the "B" column corresponds to the values at index in 1 of each list, but you are not allowed to hardcode 1 for this solution. Instead, use header.index(????)
to look up the position of "B" within the header
list.
Docs:
- https://docs.python.org/3/howto/sorting.html#sorting-basics
- https://docs.python.org/3/howto/sorting.html#key-functions
Hint: if we had to sort by the "A" column descending, we could do the following:
def get_column_a(row):
print("lookup A column for a row")
return row[header.index("A")]
rows.sort(key=get_column_a, reverse=True)
rows
Note that we aren't calling get_column_a
ourselves (because there are no parentheses after it on the sort line). Instead, we're giving the sort
method a reference to that function; this allows sort
to call the function on each row, to figure out what part of the row objects matters for the sort.
When we only need a function for one purpose, we can use the lambda
syntax instead of the def
syntax to define the function on a single line, without even giving it a name. The following works the same as the earlier example (but without the print):
rows.sort(key=lambda row: row[header.index("A")], reverse=True)
rows
Q14: say you're going on vacation to Europe with 400 US dollars; how many Euros can you get at the current exchange rate?
This site provides exchange rate information in JSON format: https://www.floatrates.com/json-feeds.html. JSON is a simple format that can represent nested dicts and lists in files and web resources.
Download a copy of usd.json
to the directory where your project is. An easy way is to open a terminal, cd
to the appriate directory, then run wget SOME_URL_HERE
to download the web resource.
Note: you can run shell commands in Jupyter, too, if you start the command with a !
(to indicate it is not Python code). If you do this, be sure to delete the cell after the download. Otherwise you'll create too much traffic on the floatrates.com site, re-downloading the same thing every time you re-run your notebook.
You can read a file like this:
f = open("usd.json")
data = f.read()
f.close()
Check the type of data
and the first portion of it:
print(type(data))
print(data[:300] + "...")
Even though the file contains a string that could be interpreted as JSON, Python won't deserialize it to Python dicts/lists automatically. Instead of calling .read()
, we need to use the load
function in the json
module:
https://docs.python.org/3/library/json.html#json.load
When reading documentation, start by focusing on parameters that can't take default arguments.
Normally, if you divide by 0, you'll get an exception. Write a function that does division; when there is such an exception, is should catch it and return the float nan
(not a number).
How to catch exceptions: https://docs.python.org/3/tutorial/errors.html#handling-exceptions
To get nan
, you can convert a string: float("nan")
Requirement: the function should only catch the exception that gets thrown for division by zero (not other exceptions). To find the name of this exception, you could try doing a simple division by zero in a cell and observe what gets thrown.
The US Census Bureau conducts the ACS (American Community Survey) yearly, asking a variety of questions. The following gives data on household computer use from the years 2013 to 2018:
We have downloaded the data for each year to a file in the home-computers
directory.
Create a dictionary called years
like this:
- key: a year (int), corresponding to a year of data in the directory. Don't hardcode the years -- use
os.listdir
and extract the year from each filename (right before the first.
). - value: a pandas DataFrame corresponding to the CSV for that year. Skip the first row from each CSV file: https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html. Use
set_index
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) to make "Geographic Area Name" the index of the DataFrame. This will let you easily look up state stats by name (instead of by row number) later.
Answer with a sorted list.
The answer is in row 49 and column 1. The hardcoding way to answer (not allowed) would thus be this:
df = years[2018]
df.iat[49, 1] # iat works like df.iloc[49, 1], but is faster for one cell
Instead of hardcoding 49 and 1, you can use "Wisconsin" (row index name) and "Estimate!!Total" (column name). When using names instead of positions, you just need to use .at
or .loc
(instead of .iat
).
The data is in the "Estimate!!Total!!No Computer" column.
If df
is a DataFrame, df["some column name here"]
will extract an individual column as a Pandas Series. A Pandas Series is like a list/dict hybrid. You can use .iat
to look up values by integer position (like you would with a list). You can use .at
to look up values by the Series' index, like you would with a dict. Note the confusing terminology here: a Series' index is like a dict's key, and the "i" in "iat" does NOT refer to "index".
If you have a Pandas Series s
, you can do various aggregations on it, like .mean()
, .sum()
, .max()
, etc.
Q20: what is the biggest per-state margin of error for "No Computer", as a fraction of the total estimate? (2018)
The margin of error is given in the "Margin of Error!!Total!!No Computer" column.
You can divide one Pandas Series by another on an elementwise basis like this: s3 = s2 / s1
. You can then compute s3.max()
. Or better, see if you can combine everything into a one-line computation.
Q21: for Wisconsin and adjacent states, what percent of households are estimated to be without a computer? (2018)
States: Illinois, Indiana, Iowa, Michigan, Minnesota, Wisconsin.
Answer with a dict
, where the key is the state name, and the value is the percent.
If you have a Series s
, you can use s.plot.bar()
or s.plot.barh()
. Be sure to set an axis label for the percent.
Example:
Answer with a plot like this:
Answer with a plot like this:
Columns:
- "Estimate!!Total!!Has one or more types of computing devices!!Smartphone"
- "Estimate!!Total!!Has one or more types of computing devices!!Tablet or other portable wireless computer"
You have to do the remainder of this project on your own. Do not discuss with anybody except 320 staff (mentors, TAs, instructor).
For this part, you'll do two things:
- analyze the history of this project: https://github.com/pallets/flask. We'll eventually learn how to use the flask module to build web applications -- for now we'll just analyze changes to the codebase over time.
- measure how long various Pandas operations take
We have a copy of the flask repo in flask.zip
. Run unzip flask.zip
. If apt
is not installed, follow the suggestion in the error message to install it. If that doesn't work because you don't have admin permissions, re-run the suggested command with sudo
in front of the suggestion (that runs the command as the super/root/admin user).
If you pass cwd="????"
to check_output
, you can run the git log
command inside the flask
directory that was created when you ran the git clone
command. "CWD" stands for "change working directory".
check_output
function in the subprocess
module (https://docs.python.org/3.8/library/subprocess.html#subprocess.check_output) returns a byte sequence; consider converting it to a string ("utf-8" encoding) and splitting it by newline (\n
) to get a list. This will be useful for answering following questions.
Answer with a list. Earlier commits should be later in the list.
Use check_output
to run a git checkout
command to switch to that commit, before reading flask/README
the way you would read any regular text file in Python (using open
and .read
).
When running git log
, you'll see some entries like this:
commit 7b0c82dfdc867641dd6e1b200f735bffd66e4c12
Merge: c5ca1750 a841cfab
Author: David Lord <[email protected]>
Date: Wed Dec 22 17:10:24 2021 -0800
Merge pull request #4350 from olliemath/patch-1
Only use a custom JSONDecoder if needed
This means the code was approved by David Lord (who has permission to make changes), but the code change was written and proposed by olliemath.
Whenever a line from git log
contains the text "Merge pull request" and "/", extract the username immediately before the "/". Count occurences of usernames in dictionary like the following:
{'Yourun-proger': 2,
'olliemath': 1,
'pallets': 204,
'jugmac00': 1,
'pgjones': 14,
'eprigorodov': 1,
...
}
Note: there will be some entries like the following that are actually undoing a pull request:
Revert "Merge pull request #859 from wvh/register_error_handler"
For simplicity, we'll count these just like the original pull requests.
This one will be difficult because the command will fail, triggering an exception. First, run this by itself to determine what exception is thrown in this circumstance:
check_output(["pip3", "instal"])
Search the page here to learn about the exception type, and import it: https://docs.python.org/3/library/subprocess.html
Then, use that information to catch exceptions of that type (fill in the missing exception type):
try:
check_output(["pip3", "instal"])
except ???? as e:
output = e.output
output
Oops, output
is empty because programs often print errors to a different place than regular output. Read the documentation for the exception to find what should be used instead of e.output
.
One last detail -- even though you use the correct code to get the error output, it will be None
at first. You need to update the check_output
call to be like this to capture error output:
check_output(["pip3", "instal"], stderr=PIPE)
We'll want to generate test data of various sizes. Use this function for that purpose:
def rand_df(rows):
return pd.DataFrame(np.random.randint(10, size=(rows, 4)),
columns=["A", "B", "C", "D"],
index=[f"r{i}" for i in range(1, rows+1)])
Answer with a plot as follows:
- x-axis is number of number of rows in a DataFrame
- y-axis is milliseconds is how long it takes to loop over the DataFrame
- two lines: one for
iterrows
and one foritertuples
If you have a DataFrame generated from rand_df
called df
, you can take a measurement like this:
t0 = time()
for row in df.iterrows():
pass
t1 = time()
Your plot should look something like this (we're hiding the legend labels so it's a surprise for you which is faster).
Some noise is OK as long as you get the same general shape (we get a slightly different plot each time we measure ourselves).
The easiest way to create a plot with two lines is to create a DataFrame with a column of measurements corresponding to each line. Here's a simple example to adapt:
times_df = pd.DataFrame(dtype=float)
times_df.at[1, "A"] = 50
times_df.at[2, "A"] = 60
times_df.at[1, "B"] = 35
times_df.at[2, "B"] = 34
times_df.plot.line()
Answer with a line plot, similar to the one for the previous questions. Here is a code snippet to use for the measurement (adapt to measure .at
as well):
total = 0
for idx in df.index:
for col in df.columns:
total += df.loc[idx, col]
Answer this one with a line plot similar as to the last two. You should, however, have measurements going up to 20000 rows.
For the two code snippets to measure:
result = df["A"].apply(laugh).tolist()
AND
result = []
for val in df["A"]:
result.append(laugh(val))
The laugh
function is defined as follows:
def laugh(x):
return "ha" * x