Adding support for Reddit RSS reader

lalitpagaria · Jan 30, 2021 · 00293fd · 00293fd
1 parent a226ddc
commit 00293fd
Show file tree

Hide file tree

Showing 10 changed files with 354 additions and 3 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -0,0 +1,36 @@
+# This workflow will install Python dependencies, run test and lint with a single version of Python
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: CI
+
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v2
+
+    - name: Set up Python 3.7
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.7
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8 pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,35 @@
+# This workflows will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+name: Upload Python Package
+
+on:
+  release:
+    types: [created]
+
+jobs:
+  deploy:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v2
+
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.6'
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+
+    - name: publish to PyPI
+      if: github.event_name != 'pull_request'
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python setup.py sdist bdist_wheel
+        twine upload dist/*
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,10 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+/.idea/.gitignore
+/.idea/app_store_reviews_reader.iml
+/.idea/misc.xml
+/.idea/modules.xml
+/.idea/inspectionProfiles/profiles_settings.xml
+/.idea/inspectionProfiles/Project_Default.xml
+/.idea/vcs.xml
diff --git a/LICENSE b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2021 Lalit Pagaria
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/README.md b/README.md
@@ -1,2 +1,91 @@
-# reddit-rss-reader
-Wrapper around Reddit RSS feed
+<p align="center">
+    <a href="https://github.com/lalitpagaria/reddit-rss-reader/blob/master/LICENSE">
+        <img alt="License" src="https://img.shields.io/github/license/lalitpagaria/reddit-rss-reader?color=blue">
+    </a>
+    <a href="https://pypi.org/project/reddit-rss-reader">
+        <img src="https://img.shields.io/pypi/pyversions/reddit-rss-reader" alt="PyPI - Python Version" />
+    </a>
+    <a href="https://pypi.org/project/reddit-rss-reader/">
+        <img alt="Release" src="https://img.shields.io/pypi/v/app-store-reviews-reader">
+    </a>
+    <a href="https://pepy.tech/project/reddit-rss-reader">
+        <img src="https://pepy.tech/badge/reddit-rss-reader/month" alt="Downloads" />
+    </a>
+    <a href="https://github.com/lalitpagaria/reddit-rss-reader/commits/master">
+        <img alt="Last commit" src="https://img.shields.io/github/last-commit/lalitpagaria/reddit-rss-reader">
+    </a>
+</p>
+
+# Reddit RSS Reader
+This is wrapper around publicly/privately available Reddit RSS feeds. It can be used to fetch content from front page, subreddit, all comments of subreddit, all comments of a certain post, comments of certain reddit user, search pages and many more. For more details about what type of RSS feed is provided by Reddit refer these links: [link1](https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/) and [link2](https://www.reddit.com/wiki/rss).
+
+*Note: These feeds are rate limited hence can only be used for testing purpose. For serious scrapping register your bot at [apps](https://www.reddit.com/prefs/apps/) to get client details and use it with [Praw](https://github.com/praw-dev/praw).
+
+
+## Installation
+
+Install via PyPi:
+```shell
+pip install reddit-rss-reader
+```
+Install from master branch (if you want to try the latest features):
+```shell
+git clone https://github.com/lalitpagaria/reddit-rss-reader
+cd reddit-rss-reader
+pip install --editable .
+```
+
+# How to use
+`RedditRSSReader` require feed url, hence refer [link](https://www.reddit.com/wiki/rss) to generate. For example to fetch all comments on subreddit `r/wallstreetbets` -
+```shell
+https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new
+```
+
+Now you can run the following [example](https://github.com/lalitpagaria/reddit-rss-reader/tree/master/example) -
+```python
+import pprint
+from datetime import datetime, timedelta
+
+import pytz as pytz
+
+from reddit_rss_reader.reader import RedditRSSReader
+
+
+reader = RedditRSSReader(
+    url="https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new"
+)
+
+# To consider comments entered in past 5 days only
+since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5)
+
+# fetch_content will fetch all contents if no parameters are passed.
+# If `after` is passed then it will fetch contents after this date
+# If `since_id` is passed then it will fetch contents after this id
+reviews = reader.fetch_content(
+    after=since_time
+)
+
+pp = pprint.PrettyPrinter(indent=4)
+for review in reviews:
+    pp.pprint(review.__dict__)
+```
+Reader return `RedditContent` which have following information (`extracted_text` and `image_alt_text` are extracted from Reddit content via `BeautifulSoup`) -
+```python
+@dataclass
+class RedditContent:
+    title: str
+    link: int
+    id: str
+    content: str
+    extracted_text: Optional[str]
+    image_alt_text: Optional[str]
+    updated: datetime
+    author_uri: str
+    author_name: str
+    category: str
+```
+The output is given with UTF-8 charsets, if you are scraping non-english reddits then set the environment to use UTF -
+```shell
+export LANG=en_US.UTF-8
+export PYTHONIOENCODING=utf-8
+```
diff --git a/example/example.py b/example/example.py
@@ -0,0 +1,26 @@
+import pprint
+from datetime import datetime, timedelta
+
+import pytz as pytz
+
+from reddit_rss_reader.reader import RedditRSSReader
+
+
+reader = RedditRSSReader(
+    url="https://www.reddit.com/r/wallstreetbets/new/.rss?sort=new"
+    # url="https://www.reddit.com/r/wallstreetbets/comments/l84ner/for_those_who_have_been_around_for_a_while_what/.rss?sort=new"
+)
+
+# To consider comments entered in past 5 days only
+since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5)
+
+# fetch_content will fetch all contents if no parameters are passed.
+# If `after` is passed then it will fetch contents after this date
+# If `since_id` is passed then it will fetch contents after this id
+reviews = reader.fetch_content(
+    after=since_time
+)
+
+pp = pprint.PrettyPrinter(indent=4)
+for review in reviews:
+    pp.pprint(review.__dict__)
diff --git a/reddit_rss_reader/__init__.py b/reddit_rss_reader/__init__.py
diff --git a/reddit_rss_reader/reader.py b/reddit_rss_reader/reader.py
@@ -0,0 +1,82 @@
+import logging
+from dataclasses import dataclass
+from datetime import datetime
+from time import mktime
+from typing import List, Optional
+
+import feedparser
+import requests
+from bs4 import BeautifulSoup
+from feedparser import FeedParserDict
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class RedditContent:
+    title: str
+    link: int
+    id: str
+    content: str
+    extracted_text: Optional[str]
+    image_alt_text: Optional[str]
+    updated: datetime
+    author_uri: str
+    author_name: str
+    category: str
+
+
+@dataclass
+class RedditRSSReader:
+    # For proper url
+    # Refer https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/
+    url: str
+
+    def fetch_content(self, after: Optional[datetime] = None, since_id: Optional[int] = None) -> List[RedditContent]:
+        contents: List[RedditContent] = []
+        feed = self.fetch_feed(self.url)
+
+        for entry in feed.entries:
+            if after is not None and after.timetuple() > entry.updated_parsed:
+                break
+
+            if since_id is not None and since_id == entry.id:
+                break
+
+            try:
+                soup = BeautifulSoup(entry.summary)
+
+                image_alt_texts = [x['alt'] for x in soup.find_all('img', alt=True)]
+                image_alt_texts = image_alt_texts if image_alt_texts else []
+
+                contents.append(
+                    RedditContent(
+                        link=entry.link,
+                        id=entry.id,
+                        title=entry.title,
+                        content=entry.summary,
+                        extracted_text=soup.get_text(),
+                        image_alt_text=". ".join(image_alt_texts),
+                        updated=datetime.fromtimestamp(mktime(entry.updated_parsed)),
+                        author_name=entry.author_detail.name,
+                        author_uri=str(entry.author_detail.href),
+                        category=entry.category
+                    )
+                )
+            except Exception:
+                logger.error(f'Error parsing Entry={entry}')
+
+        return contents
+
+    @staticmethod
+    def fetch_feed(feed_url: str) -> FeedParserDict:
+        # On MacOS https do not work, hence using workaround
+        # Refer https://github.com/uvacw/inca/issues/162
+        is_https = "https://" in feed_url[:len("https://")]
+        if is_https:
+            feed_content = requests.get(feed_url)
+            feed = feedparser.parse(feed_content.text)
+        else:
+            feed = feedparser.parse(feed_url)
+
+        return feed
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+feedparser
+beautifulsoup4
+requests
diff --git a/setup.py b/setup.py
@@ -0,0 +1,73 @@
+import pathlib
+from io import open
+
+from setuptools import find_packages, setup
+
+
+def parse_requirements(filename):
+    with open(filename) as file:
+        lines = file.read().splitlines()
+
+    return [
+        line.strip()
+        for line in lines
+        if not (
+            (not line) or
+            (line.strip()[0] == "#") or
+            (line.strip().startswith('--find-links')) or
+            ("git+https" in line)
+        )
+    ]
+
+
+def get_dependency_links(filename):
+    with open(filename) as file:
+        lines = file.read().splitlines()
+
+    return [
+        line.strip().split('=')[1]
+        for line in lines
+        if line.strip().startswith('--find-links')
+    ]
+
+
+dependency_links = get_dependency_links('requirements.txt')
+parsed_requirements = parse_requirements('requirements.txt')
+
+# The directory containing this file
+HERE = pathlib.Path(__file__).parent
+
+# The text of the README file
+README = (HERE / "README.md").read_text()
+
+
+setup(
+    name="reddit-rss-reader",
+    version="1.0",
+    author="Lalit Pagaria",
+    author_email="[email protected]",
+    description="A Wrapper around Reddit RSS feed",
+    long_description=README,
+    long_description_content_type="text/markdown",
+    license="Apache Version 2.0",
+    url="https://github.com/lalitpagaria/reddit-rss-reader",
+    packages=find_packages(exclude=["*.tests", "*.tests.*", "tests.*", "tests"]),
+    dependency_links=dependency_links,
+    install_requires=parsed_requirements,
+    include_package_data=True,
+    python_requires=">=3.6.0",
+    tests_require=["pytest"],
+    classifiers=[
+        "Development Status :: 5 - Production/Stable",
+        "Intended Audience :: Developers",
+        "Intended Audience :: Customer Service",
+        "Intended Audience :: Information Technology",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+        "Programming Language :: Python :: 3.6",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Topic :: Software Development :: Libraries :: Python Modules",
+    ],
+)