- SWBAT examine the ways that data collection can impact the story that data tells.
- SWBAT understand how sample size, sample technique, and sample demographics can affect data interpretation.
Make a copy of this NYC Census Demographics by Council District Google Sheet.
- Based on the fourth worksheet (All Districts), what ethnic group saw the greatest population increase from 2000 to 2010? What ethnic group saw the greatest population decrease from 2000 to 2010?
- Compare the population changes in Districts 3, 6, and 10. Which changed the most? Why is this question particularly hard to answer?
- Which sheet(s) would make the best case for gentrification of NYC? Which sheet(s) show gentrification isn't occurring in NYC?
- Where did this data come from? Do you trust that source?
- What if the data didn't indicate its source - would you still trust it?
- In teams of 3, pick a sector of NYC that you interact with on a regular basis. For example, you might pick transportation, education, food / grocery, utilities / rent / real estate, music, film, TV, or an online social media presence. Then, make a mind map of as many sources of data as you can think of which relate to that topic. For each source, branch off bubbles that list as many kinds of data in that source as you can think of.
e.g. If your topic is transportation, then The MTA is a source of data, including:
- train arrival data,
- bus arrival data,
- passenger swipe data,
- financial transaction data,
- etc.
If you're struggling to come up with data sources, think about your daily routine:
- Where does data come from online?
- Where does data come from during your commute?
- Where does data come from at school?
- Where does data come from on tv?
- Where does data come from at the store?
- etc.
Not all data can be seen by everyone. Many companies closely guard their data (and sometimes that data gets breached by hackers), but governments, some companies, and people online are starting to freely share their data; this is called open data. Open data increases the value of the data because people can use data in new and novel ways.
- For each of the kinds of data you listed in bubbles on your mind map, put a box around the data that you think might be available as open data, and put an X over the data you think is private data.
If you find that you've crossed off ALL or NONE of these data sources, you may need to think a bit more about other collections of data which are public or private and add them to your list. Do you have more boxes or X's on your mind map? Why do you think that is? Is that a good thing?
Let's compare the MTA in New York City and a transit authority in some other city.
In the other city, the transit authority restricts access to their train arrival time data and budgets $20,000 to build an app for users to see when trains will arrive at the station. That transit authority doesn't know how to build apps; they know how to drive trains. So instead of only spending $20,000, their app goes over-budget by $15,000; it then underperforms with users, and the agency doesn't have any additional funds to spend to make it better.
In New York City, however, the transit authority recognizes that they're pretty good at driving trains, but not so great at making apps. The MTA instead spends $10,000 to build an API that shares train arrival data with developers. This open data in the API is then taken by developers who compete for a $5,000 prize to whomever can build the best app with that data. The MTA receives more than 50 app submissions, and the public gets to choose which app is the best and should get the money. The developers of the winning app choose to build a business from their app, and they go on to develop apps for other cities as well. The MTA spent $15,000, didn't go over-budget, and got loads of great apps just by opening up their data and making it available for developers to use.
- Check out some of the real apps made by developers using the MTA's open data: web.mta.info/apps/
- Explore the MTA's open data portal: web.mta.info/developers/
- Open data in practice (NYC.gov video)
- Open Government Timeline
- "What I learned in two years of moving government forms online" (Medium, Feb 22, 2018)
A critical part of working with data is sourcing data. Sometimes you have to go out and find data. Sometimes the data you want doesn't exist. And sometimes, the data exists, but you aren't allowed to see it. What's the best way to collect data? Well, it depends on the issue you're investigating.
- As a group, choose a civic issue you feel strongly about.
e.g. Access to Housing, Waste Water Treatment, Food Islands, Transit Spending, etc.
To begin, we need to have one or more questions we want to ask about that issue.
- Brainstorm 5 or more questions you have about your topic.
e.g. How far is every New Yorker from a grocery store? e.g. How much do New Yorkers spend on groceries every month? e.g. Where does produce in New York markets come from?
- As a group, choose the top one or two questions you want to investigate further.
- For each question, list the data you'd need to answer that question. (Don't worry if you think you wouldn't be allowed to get the data.) Make sure to include survey data for one of your questions.
e.g. To answer "How much do New Yorkers spend on groceries every month?", we'd need receipts or bills from a lot of people's groceries. Or we could get financial records from several big New York groceries. Or maybe banks would give us people's spending records. Or we'd have to go survey people and ask them how much they spend every month, and we'd probably have to do that over a few months to make an average.
- For the survey data you'd want to collect, write down 5-10 questions you could to ask people in order to better understand the problem you chose.
Finding data can be the most liberating or the most frustrating part of working with data.
With so much data out there, how do you decide which data you can trust and use vs. which data might be problematic and you shouldn't use?