added email extractor tutorial

x4nth055 · Dec 3, 2019 · e1c2cb8 · e1c2cb8
1 parent 56f9836
commit e1c2cb8
Show file tree

Hide file tree

Showing 4 changed files with 25 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -62,6 +62,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
     - [How to Extract Weather Data from Google in Python](https://www.thepythoncode.com/article/extract-weather-data-python). ([code](web-scraping/weather-extractor))
     - [How to Download All Images from a Web Page in Python](https://www.thepythoncode.com/article/download-web-page-images-python). ([code](web-scraping/download-images))
     - [How to Extract All Website Links in Python](https://www.thepythoncode.com/article/extract-all-website-links-python). ([code](web-scraping/link-extractor))
+    - [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python). ([code](web-scraping/email-extractor))
 
 - ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
     - [How to Use Pickle for Object Serialization in Python](https://www.thepythoncode.com/article/object-serialization-saving-and-loading-objects-using-pickle-python). ([code](general/object-serialization))

diff --git a/web-scraping/email-extractor/README.md b/web-scraping/email-extractor/README.md
@@ -0,0 +1,7 @@
+# [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python)
+To run this:
+- `pip3 install -r requirements.txt`
+- To extract email addresses from `"https://www.randomlists.com/email-addresses"` website and save them to the file `emails.txt`:
+    ```
+    python email_harvester.py https://www.randomlists.com/email-addresses emails.txt
+    ```
diff --git a/web-scraping/email-extractor/email_harvester.py b/web-scraping/email-extractor/email_harvester.py
@@ -0,0 +1,16 @@
+import re
+from requests_html import HTMLSession
+import sys
+
+url = sys.argv[1]
+EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
+
+# initiate an HTTP session
+session = HTMLSession()
+# get the HTTP Response
+r = session.get(url)
+# for JAVA-Script driven websites
+r.html.render()
+with open(sys.argv[2], "a") as f:
+    for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
+        print(re_match.group().strip(), file=f)
diff --git a/web-scraping/email-extractor/requirements.txt b/web-scraping/email-extractor/requirements.txt
@@ -0,0 +1 @@
+requests-html