python-mastery-101/19-WebScraping at main · yodkwtf/python-mastery-101

History

Name		Name	Last commit message	Last commit date
parent directory ..
accessing_data.py		accessing_data.py
blog_data.csv		blog_data.csv
bs_basics.py		bs_basics.py
readme.md		readme.md
scraping-program.py		scraping-program.py

readme.md

Web Scraping

Web Scraping means programmatically grabbing data from a web page

Three steps: Download, extract data by parsing the html received, use that data

Why Scrape?

If a website doesn't provide an APi to work with their data, we can use scraping to get some data off of their website

Is it ethical?

Some websites don't want you to do it, doesn't mean you can't do it.
Best practice: consult the robots.txt file to see what files are allowed to scraped
It's not illegal or something, more like they are requesting you not to do it
If you make too many req, try to time them out
If you keep on doing it your IP can be blocked

Beautiful Soup

Lets us navigate through HTML with python
Doesn't download HTML - we need requests module for that

from bs4 import BeautifulSoup

html_data = """
some html code returned from the requests module
"""

soup = BeautifulSoup(html_data, "html.parser")
# this will convert the html string into html

# Now we can navigate around the html
print(soup.body) # prints the entire body from html

We have many methods to select specific tags and selectors like select , find, find_all

The best way to know about these methods is to read the official docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

19-WebScraping

19-WebScraping

readme.md

Web Scraping

Why Scrape?

Is it ethical?

Beautiful Soup

Files

19-WebScraping

Directory actions

More options

Directory actions

More options

Latest commit

History

19-WebScraping

Folders and files

parent directory

readme.md

Web Scraping

Why Scrape?

Is it ethical?

Beautiful Soup