Skip to content

Latest commit

 

History

History

p1

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Project 1: Review, Git Analysis, Benchmarking

Corrections/Clarifications

  • Feb 11: when counting pull requests per user, lines like Revert "Merge pull request #859 from wvh/register_error_handler" will count (even though this is technically undoing a pull request)

Overview

This project is longer than usual, with 32 questions, because it is designed to help you learn (or review) Python concepts taught in 220. The first 75% can be done as a group and focuses on review. The last 25% is individual and focuses on new 320 concepts: check_output, time, and git.

VIDEO showing how to get started (assumes you have completed lab 2). This was recorded in Fall 2021, but still applies to the project this semester.

Setup

Make sure to do lab 2 before starting this, as it must be done on your virtual machine.

  1. SSH to your virtual machine using the following:
ssh USERNAME@IP_ADDRESS

If you don't recall these from lab, you can find your username (https://console.cloud.google.com/compute/metadata/sshKeys) and External IP address (https://console.cloud.google.com/compute/instances) in Google's console.

  1. run pip3 install pandas==1.3.5 matplotlib==3.5.1

  2. run git clone https://github.com/cs320-wisc/s22.git

  3. go to http://YOUR_IP_ADDRESS:2020/tree in the browser (sign in, if prompted)

  4. enter the s22 > p1 directory

  5. Click the "New" button and select "Python 3", then start the project

  6. Go to "File" > "Rename", and name your notebook "p1"

Testing

  1. do a "Kernel" > "Restart & Run All" in your notebook

  2. "File" > "Save and Checkpoint"

  3. from an SSH session, navigate to the s22/p1 directory

  4. run python3 tester.py p1.ipynb and work on fixing any issues

Submission

Your notebook should have a comment like this:

# project: p1
# submitter: ????
# partner: none
# hours: ????

For submitter, use your NetID (part before @wisc.edu in your email). Estimate how many hours you spent on the project. This semester, "partner" should always be "none". Don't list people on your assigned team.

Submit as follows:

  1. "File" > "Download as" > "Notebook (.ipynb)"

  2. go to https://tyler.caraza-harter.com/cs320/s22/submission.html

  3. select P1

  4. "Choose File" to select the .ipynb file you had downloaded

  5. "Submit" (don't use "Ignore Errors" unless you're right before a deadline -- better to email us to get help resolving the issue)

Group Part (75%)

For this portion of the project, you may collaborate with your group members in any way (even looking at working code). You may also seek help from 320 staff (mentors, TAs, instructor). You may not seek receive help from other 320 students (outside your group) or anybody outside the course.

Review 220: Control Flow (Part 1)

Q1: what is the type of 7/2?

Take a look at the builtin Python functions to see if one can answer this: https://docs.python.org/3/library/functions.html

Some functions we use a lot in 220/320 are abs, dir, float, input, int, len, list, max, min, range, set, sorted, str, sum, type.

7 and 2 are ints, so the result of dividing these is an int (3, after rounding down 3.5) in most programming languages. Python produces the mathematically correct answer, even though it is not an int (like 7 and 2).

In other cases where you want to divide 7 by 2 and get an int, you would use 7 // 2.

Q2: what is error?

Complete the code in accordance with the comment to calculate the answer.

x = 4
maximum = 10
minimum = 5
error = ???? # True if x is outside the minimum-to-maximum range
error

Notes:

  1. we don't need to specify the type of our variables as in some languages (e.g., Java) -- Python knows x is an int because we assigned 4, which is an int. Variable types are not fixed after creation as in some languages (e.g., Go) -- we could later run x = "howdy" if we wanted to
  2. in Python, a bool is True or False. We use the and, or, and not operators (in other programming languages, these operators are often expressed as &&, ||, and !).

Q3: ignoring case, does word end with the suffix "esque"?

Complete the following to answer:

word = "KAFKAESQUE"
suffix_match = ???? # .endswith(...) method not allowed for this question! (practice slicing)
suffix_match

Skim string methods here: https://docs.python.org/3/library/stdtypes.html#string-methods. Some important ones: find, isdigit, join, split, lower, upper, strip, replace.

Hints:

  1. to ignore case, it's often easy to use a method to make everything upper or lower case
  2. to get a single character from a string, you can use s[INDEX]. 0 is the first character, 1 is the second, and so on. Python supports negative indexing, meaning s[-1] is the last letter, s[-2] is the next to last, etc. You can also slice strings to get a substring by putting a colon between two indexes s[inclusive_start:exclusive_end]. You can leave off one of the indexes to go to the start or end of the string. For example, word[:3] would evaluate to "KAF".
  3. in Java, you compare strings with s1.equals(s2), but in Python the correct equivalent is s1 == s2. The equivalent of Java's == is Python's rarely used is operator.

Requirement: add function

Your function should generally take two ints and return their sum. For example, add(2, 3) should return 5. Users of the function should also be able to call it like add(x=2, y=3). If only one argument is passed, 1 should be added. For example, add(3) or add(x=3) would both return 4.

Python parameters may be filled with positions arguments, keyword arguments, or default arguments. If this is unfamiliar, read the following:

  1. https://docs.python.org/3/tutorial/controlflow.html#defining-functions
  2. https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions

In Python, indents are very important. The code inside a function/if/loop is indented (Python doesn't use { and { to indicate this, as in Java and many other languages).

Q4: what is add(3, 4)?

Call your function to answer.

Q5: what is add(9)?

Q6: what is status?

Complete the following so that status says something meaningful about x.

x = 4
if ????:
    status = "negative"
elif ????:
    status = "positive"
else:
    status = "zero"
status

https://docs.python.org/3/tutorial/controlflow.html#if-statements

Requirement: nums list and smart_count function

Paste the following:

nums = [3, 4, 1, 6]
for x in nums:
    print(x)

Python lists can be created like [item1, item2, ...] and indexed/sliced just like strings (strings and lists are both examples of Python sequences; by definition, you can index and slice any kind of sequence you encounter in Python). This list contains just ints, but you're free to have a mix of types in Python lists.

In general, you can plug in a variable name and sequence into a for loop to run a piece of code for every entry in the sequence:

for ???? in ????:
    # DO SOMETHING

More on for loops:

Write a function called smart_count that takes a list of numbers and returns their sum. It should also have the following features:

  1. ignore numbers greater than 10
  2. if there is a negative number, that number (and all that follow it, positive or negative, should be skipped)

Use continue to implement feature 1 and break to implement feature 2.

Q7: what is smart_count(nums)?

Q8: what is smart_count([2, 1, 11, 3, 15, -1, 8, 2])?

The answer should be 6: 2+1+3. 11 and 15 are too large, so they are skipped. 8 and 2 are skipped because they are after a negative number (which is also skipped).

Review 220: State (Part 2)

Requirement: lists and dicts

Copy/paste the following:

header = ["A", "B", "C"]

coord1 = {"x": 8, "y": 5}
coord2 = {"x": 9, "y": 2}
coord3 = {"x": 3, "y": 1}

rows = [
    [1, 6, coord1],
    [3, 4, coord2],
    [5, 2, coord3],
]

Note that rows is a list of lists. Each inner list contains two ints and one dict (dictionary). For complicated nested structures like this, it's often helpful to visualize the stack of frames and heap of objects in PythonTutor: https://pythontutor.com/live.html#mode=edit.

You could copy the above to visualize it, or use the following link for your convenience:

https://pythontutor.com/visualize.html#code=header%20%3D%20%5B%22A%22,%20%22B%22,%20%22C%22%5D%0A%0Acoord1%20%3D%20%7B%22x%22%3A%208,%20%22y%22%3A%205%7D%0Acoord2%20%3D%20%7B%22x%22%3A%209,%20%22y%22%3A%202%7D%0Acoord3%20%3D%20%7B%22x%22%3A%203,%20%22y%22%3A%201%7D%0A%0Arows%20%3D%20%5B%0A%20%20%20%20%5B1,%206,%20coord1%5D,%0A%20%20%20%20%5B3,%204,%20coord2%5D,%0A%20%20%20%20%5B5,%202,%20coord3%5D,%0A%5D&cumulative=false&curInstr=7&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false

Both lists and dicts contain values. With lists, each value is associated with an index (integers starting from 0). With dicts, each value is associated with a key specified by the programmers. Keys are often strings, but they don't need to be.

Docs:

Q9: after inserting a "z" key in coord3 (with coord3["z"] = 3.14), what is rows?

Q10: what is the value associated with the "x" key of the dict in the last position of the first list?

Hint: if the question were "what is the value associated with the 'y' key of the dict in the last position of the second list?", the solution would be: rows[1][-1]["y"]. You just need to tack on brackets containing indexes (for lists) or keys (for dicts) to delve deeper into a nested structure.

Q11: what is rows after running the following?

Complete the following so that the first change via v2 is NOT reflected in rows, but the second change via v2 IS reflected in rows:

import copy
v2 = ????
v2[0] = 8888    # first change
v2[1][1] = 9999 # second change

Relevant docs: https://docs.python.org/3/library/copy.html

To get a good intuition about the reference/shallow/deep copy, try stepping through the following slowly in PythonTutor:

import copy
v1 = [[1], [], [2, 3]]
v2 = v1
v2 = copy.copy(v1)
v2 = copy.deepcopy(v1)

Q12: if we imagine the list of lists structure referenced by rows as a table, with column names in header, what is the sum of values in the "B" column?

Note: the "B" column corresponds to the values at index in 1 of each list, but you are not allowed to hardcode 1 for this solution. Instead, use header.index(????) to look up the position of "B" within the header list.

Q13: what is rows after we sort it in-place by the "B" column, ascending?

Docs:

Hint: if we had to sort by the "A" column descending, we could do the following:

def get_column_a(row):
    print("lookup A column for a row")
    return row[header.index("A")]

rows.sort(key=get_column_a, reverse=True)
rows

Note that we aren't calling get_column_a ourselves (because there are no parentheses after it on the sort line). Instead, we're giving the sort method a reference to that function; this allows sort to call the function on each row, to figure out what part of the row objects matters for the sort.

When we only need a function for one purpose, we can use the lambda syntax instead of the def syntax to define the function on a single line, without even giving it a name. The following works the same as the earlier example (but without the print):

rows.sort(key=lambda row: row[header.index("A")], reverse=True)
rows

Q14: say you're going on vacation to Europe with 400 US dollars; how many Euros can you get at the current exchange rate?

This site provides exchange rate information in JSON format: https://www.floatrates.com/json-feeds.html. JSON is a simple format that can represent nested dicts and lists in files and web resources.

Download a copy of usd.json to the directory where your project is. An easy way is to open a terminal, cd to the appriate directory, then run wget SOME_URL_HERE to download the web resource.

Note: you can run shell commands in Jupyter, too, if you start the command with a ! (to indicate it is not Python code). If you do this, be sure to delete the cell after the download. Otherwise you'll create too much traffic on the floatrates.com site, re-downloading the same thing every time you re-run your notebook.

You can read a file like this:

f = open("usd.json")
data = f.read()
f.close()

Check the type of data and the first portion of it:

print(type(data))
print(data[:300] + "...")

Even though the file contains a string that could be interpreted as JSON, Python won't deserialize it to Python dicts/lists automatically. Instead of calling .read(), we need to use the load function in the json module:

https://docs.python.org/3/library/json.html#json.load

When reading documentation, start by focusing on parameters that can't take default arguments.

Requirement: divide function

Normally, if you divide by 0, you'll get an exception. Write a function that does division; when there is such an exception, is should catch it and return the float nan (not a number).

How to catch exceptions: https://docs.python.org/3/tutorial/errors.html#handling-exceptions

To get nan, you can convert a string: float("nan")

Requirement: the function should only catch the exception that gets thrown for division by zero (not other exceptions). To find the name of this exception, you could try doing a simple division by zero in a cell and observe what gets thrown.

Q15: what is divide(3, 2)?

Q16: what is divide(-3, 0)?

Review 220: Data Science (Part 3)

The US Census Bureau conducts the ACS (American Community Survey) yearly, asking a variety of questions. The following gives data on household computer use from the years 2013 to 2018:

https://data.census.gov/cedsci/table?t=Computer%20and%20Internet%20Use&g=0100000US%240400000&tid=ACSDT1Y2015.B28001&hidePreview=true&tp=true&moe=true

We have downloaded the data for each year to a file in the home-computers directory.

Create a dictionary called years like this:

Q17: what are the keys in years?

Answer with a sorted list.

Q18: how many households did Wisconsin have in 2018?

The answer is in row 49 and column 1. The hardcoding way to answer (not allowed) would thus be this:

df = years[2018]
df.iat[49, 1] # iat works like df.iloc[49, 1], but is faster for one cell

Instead of hardcoding 49 and 1, you can use "Wisconsin" (row index name) and "Estimate!!Total" (column name). When using names instead of positions, you just need to use .at or .loc (instead of .iat).

Q19: how many total households in the US are estimated to not have any computer at home? (2018)

The data is in the "Estimate!!Total!!No Computer" column.

If df is a DataFrame, df["some column name here"] will extract an individual column as a Pandas Series. A Pandas Series is like a list/dict hybrid. You can use .iat to look up values by integer position (like you would with a list). You can use .at to look up values by the Series' index, like you would with a dict. Note the confusing terminology here: a Series' index is like a dict's key, and the "i" in "iat" does NOT refer to "index".

If you have a Pandas Series s, you can do various aggregations on it, like .mean(), .sum(), .max(), etc.

Q20: what is the biggest per-state margin of error for "No Computer", as a fraction of the total estimate? (2018)

The margin of error is given in the "Margin of Error!!Total!!No Computer" column.

You can divide one Pandas Series by another on an elementwise basis like this: s3 = s2 / s1. You can then compute s3.max(). Or better, see if you can combine everything into a one-line computation.

Q21: for Wisconsin and adjacent states, what percent of households are estimated to be without a computer? (2018)

States: Illinois, Indiana, Iowa, Michigan, Minnesota, Wisconsin.

Answer with a dict, where the key is the state name, and the value is the percent.

Q22: same question, but answer with a bar plot.

If you have a Series s, you can use s.plot.bar() or s.plot.barh(). Be sure to set an axis label for the percent.

Example:

Q23: how as the number of WI households without computers changed over recent years?

Answer with a plot like this:

Q24: what is the relationship between household with smartphones and those with tablets? (2018)

Answer with a plot like this:

Columns:

  • "Estimate!!Total!!Has one or more types of computing devices!!Smartphone"
  • "Estimate!!Total!!Has one or more types of computing devices!!Tablet or other portable wireless computer"

Individual Part (25%)

You have to do the remainder of this project on your own. Do not discuss with anybody except 320 staff (mentors, TAs, instructor).

For this part, you'll do two things:

  1. analyze the history of this project: https://github.com/pallets/flask. We'll eventually learn how to use the flask module to build web applications -- for now we'll just analyze changes to the codebase over time.
  2. measure how long various Pandas operations take

We have a copy of the flask repo in flask.zip. Run unzip flask.zip. If apt is not installed, follow the suggestion in the error message to install it. If that doesn't work because you don't have admin permissions, re-run the suggested command with sudo in front of the suggestion (that runs the command as the super/root/admin user).

Q25: what is the first line of output from git log when run in the flask repo directory?

If you pass cwd="????" to check_output, you can run the git log command inside the flask directory that was created when you ran the git clone command. "CWD" stands for "change working directory".

check_output function in the subprocess module (https://docs.python.org/3.8/library/subprocess.html#subprocess.check_output) returns a byte sequence; consider converting it to a string ("utf-8" encoding) and splitting it by newline (\n) to get a list. This will be useful for answering following questions.

Q26: What are the commit numbers of the 50 earliest commits?

Answer with a list. Earlier commits should be later in the list.

Q27: what did the README file contain after the 3rd commit?

Use check_output to run a git checkout command to switch to that commit, before reading flask/README the way you would read any regular text file in Python (using open and .read).

Q28: how many pull requests were merged from each GitHub user?

When running git log, you'll see some entries like this:

commit 7b0c82dfdc867641dd6e1b200f735bffd66e4c12
Merge: c5ca1750 a841cfab
Author: David Lord <[email protected]>
Date:   Wed Dec 22 17:10:24 2021 -0800

    Merge pull request #4350 from olliemath/patch-1
    
    Only use a custom JSONDecoder if needed

This means the code was approved by David Lord (who has permission to make changes), but the code change was written and proposed by olliemath.

Whenever a line from git log contains the text "Merge pull request" and "/", extract the username immediately before the "/". Count occurences of usernames in dictionary like the following:

{'Yourun-proger': 2,
 'olliemath': 1,
 'pallets': 204,
 'jugmac00': 1,
 'pgjones': 14,
 'eprigorodov': 1,
 ...
}

Note: there will be some entries like the following that are actually undoing a pull request:

Revert "Merge pull request #859 from wvh/register_error_handler"

For simplicity, we'll count these just like the original pull requests.

Q29: what is the output of pip3 instal? (yes, the mispelling was intentional)

This one will be difficult because the command will fail, triggering an exception. First, run this by itself to determine what exception is thrown in this circumstance:

check_output(["pip3", "instal"])

Search the page here to learn about the exception type, and import it: https://docs.python.org/3/library/subprocess.html

Then, use that information to catch exceptions of that type (fill in the missing exception type):

try:
    check_output(["pip3", "instal"])
except ???? as e:
    output = e.output
output

Oops, output is empty because programs often print errors to a different place than regular output. Read the documentation for the exception to find what should be used instead of e.output.

One last detail -- even though you use the correct code to get the error output, it will be None at first. You need to update the check_output call to be like this to capture error output:

check_output(["pip3", "instal"], stderr=PIPE)

Q30: what is faster for looping over a DataFrame, iterrows or itertuples?

We'll want to generate test data of various sizes. Use this function for that purpose:

def rand_df(rows):
    return pd.DataFrame(np.random.randint(10, size=(rows, 4)),
                        columns=["A", "B", "C", "D"],
                        index=[f"r{i}" for i in range(1, rows+1)])

Answer with a plot as follows:

  • x-axis is number of number of rows in a DataFrame
  • y-axis is milliseconds is how long it takes to loop over the DataFrame
  • two lines: one for iterrows and one for itertuples

If you have a DataFrame generated from rand_df called df, you can take a measurement like this:

t0 = time()
for row in df.iterrows():
    pass
t1 = time()

Your plot should look something like this (we're hiding the legend labels so it's a surprise for you which is faster).

Some noise is OK as long as you get the same general shape (we get a slightly different plot each time we measure ourselves).

The easiest way to create a plot with two lines is to create a DataFrame with a column of measurements corresponding to each line. Here's a simple example to adapt:

times_df = pd.DataFrame(dtype=float)
times_df.at[1, "A"] = 50
times_df.at[2, "A"] = 60
times_df.at[1, "B"] = 35
times_df.at[2, "B"] = 34
times_df.plot.line()

Q31: what is faster, loc, or at?

Answer with a line plot, similar to the one for the previous questions. Here is a code snippet to use for the measurement (adapt to measure .at as well):

total = 0
for idx in df.index:
    for col in df.columns:
        total += df.loc[idx, col]

Q32: what is faster, a loop or .apply?

Answer this one with a line plot similar as to the last two. You should, however, have measurements going up to 20000 rows.

For the two code snippets to measure:

result = df["A"].apply(laugh).tolist()

AND

result = []
for val in df["A"]:
    result.append(laugh(val))

The laugh function is defined as follows:

def laugh(x):
    return "ha" * x