Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Folder name and add code blocks RE Lesson 5; #179

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Web scraping is a technique to automatically access and extract large amounts of

We will be downloading turnstile data from this site:

```
```python
http://web.mta.info/developers/turnstile.html
```

Expand All @@ -35,7 +35,7 @@ Notice that on the top left of the console, there is an arrow symbol.

If you click on this arrow and then click on an area of the site itself, the code for that particular item will be highlighted in the console. I’ve clicked on the very first data file, Saturday, September 22, 2018 and the console has highlighted in blue the link to that particular file.

```
```python
<a href="data/nyct/turnstile/turnstile_180922.txt">Saturday, September 22, 2018</a>
```

Expand All @@ -49,7 +49,7 @@ Now that we’ve identified the location of the links, let’s get started on co

We start by importing the following libraries.

```
```python
import requests
import urllib.request
import time
Expand All @@ -58,7 +58,7 @@ from bs4 import BeautifulSoup

Next, we set the url to the website and access the site with our requests library.

```
```python
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
```
Expand All @@ -67,13 +67,13 @@ If the access was successful, you should see the following output:

Next we parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure. If you are interested in learning more about this library, check out the [BeatifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

```
```python
soup = BeautifulSoup(response.text, "html.parser")
```

We use the method .findAll to locate all of our `<a>` tags.

```
```python
soup.findAll('a')
```

Expand All @@ -83,22 +83,56 @@ This code gives us every line of code that has an `<a>` tag. The information tha

Next, let’s extract the actual link that we want. Let’s test out the first link.

```
```python
one_a_tag = soup.findAll('a')[36]
link = one_a_tag['href']
```

This code saves ‘data/nyct/turnstile/turnstile_180922.txt’ to our variable link. The full url to download the data is actually http://web.mta.info/developers/data/nyct/turnstile/turnstile_180922.txt which I discovered by clicking on the first data file on the website as a test. We can use our `urllib.request library to download this file path to our computer. We provide request.urlretrieve with two parameters: file url and the filename. For my files, I named them “turnstile_180922.txt”, “turnstile_180901”, etc.

```
```python
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])
```

Last but not least, we should include this line of code so that we can pause our code for a second so that we are not spamming the website with requests. This helps us avoid getting flagged as a spammer.

```
```python
time.sleep(1)
```

Now that we understand how to download a file, let’s try downloading the entire set of data files with a for loop. The code below contains the entire set of code for web scraping the NY MTA turnstile data.

```python
# We start by importing the following libraries.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# set the url to the website
url = 'http://web.mta.info/developers/turnstile.html'

# access the site with our requests library.
response = requests.get(url)

# parse the html with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# extract the actual link identified above
one_a_tag = soup.findAll('a')[36]
link = one_a_tag['href']

# save ‘data/nyct/turnstile/turnstile_180922.txt’ to our variable link.
download_url = 'http://web.mta.info/developers/'+ link

"""
provide the request.urlretrieve function with two parameters:
1. file url
2. file name.
"""
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

# pause for one second
time.sleep(1)
```