Skip to content

Commit

Permalink
added email extractor tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
x4nth055 committed Dec 3, 2019
1 parent 56f9836 commit e1c2cb8
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](web-scraping/weather-extractor))
- [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python). ([code](web-scraping/download-images))
- [How to Extract All Website Links in Python](https://www.thepythoncode.com/article/extract-all-website-links-python). ([code](web-scraping/link-extractor))
- [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python). ([code](web-scraping/email-extractor))

- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
- [How to Use Pickle for Object Serialization in Python](https://www.thepythoncode.com/article/object-serialization-saving-and-loading-objects-using-pickle-python). ([code](general/object-serialization))
Expand Down
7 changes: 7 additions & 0 deletions web-scraping/email-extractor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python)
To run this:
- `pip3 install -r requirements.txt`
- To extract email addresses from `"https://www.randomlists.com/email-addresses"` website and save them to the file `emails.txt`:
```
python email_harvester.py https://www.randomlists.com/email-addresses emails.txt
```
16 changes: 16 additions & 0 deletions web-scraping/email-extractor/email_harvester.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import re
from requests_html import HTMLSession
import sys

url = sys.argv[1]
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# initiate an HTTP session
session = HTMLSession()
# get the HTTP Response
r = session.get(url)
# for JAVA-Script driven websites
r.html.render()
with open(sys.argv[2], "a") as f:
for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
print(re_match.group().strip(), file=f)
1 change: 1 addition & 0 deletions web-scraping/email-extractor/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
requests-html

0 comments on commit e1c2cb8

Please sign in to comment.