-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding support for Reddit RSS reader
- Loading branch information
1 parent
a226ddc
commit 00293fd
Showing
10 changed files
with
354 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# This workflow will install Python dependencies, run test and lint with a single version of Python | ||
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions | ||
|
||
name: CI | ||
|
||
on: | ||
push: | ||
branches: [ master ] | ||
pull_request: | ||
branches: [ master ] | ||
|
||
jobs: | ||
build: | ||
|
||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- name: Set up Python 3.7 | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.7 | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install flake8 pytest | ||
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi | ||
- name: Lint with flake8 | ||
run: | | ||
# stop the build if there are Python syntax errors or undefined names | ||
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics | ||
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide | ||
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# This workflows will upload a Python Package using Twine when a release is created | ||
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries | ||
|
||
name: Upload Python Package | ||
|
||
on: | ||
release: | ||
types: [created] | ||
|
||
jobs: | ||
deploy: | ||
|
||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: '3.6' | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install setuptools wheel twine | ||
- name: publish to PyPI | ||
if: github.event_name != 'pull_request' | ||
env: | ||
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} | ||
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} | ||
run: | | ||
python setup.py sdist bdist_wheel | ||
twine upload dist/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,91 @@ | ||
# reddit-rss-reader | ||
Wrapper around Reddit RSS feed | ||
<p align="center"> | ||
<a href="https://github.com/lalitpagaria/reddit-rss-reader/blob/master/LICENSE"> | ||
<img alt="License" src="https://img.shields.io/github/license/lalitpagaria/reddit-rss-reader?color=blue"> | ||
</a> | ||
<a href="https://pypi.org/project/reddit-rss-reader"> | ||
<img src="https://img.shields.io/pypi/pyversions/reddit-rss-reader" alt="PyPI - Python Version" /> | ||
</a> | ||
<a href="https://pypi.org/project/reddit-rss-reader/"> | ||
<img alt="Release" src="https://img.shields.io/pypi/v/app-store-reviews-reader"> | ||
</a> | ||
<a href="https://pepy.tech/project/reddit-rss-reader"> | ||
<img src="https://pepy.tech/badge/reddit-rss-reader/month" alt="Downloads" /> | ||
</a> | ||
<a href="https://github.com/lalitpagaria/reddit-rss-reader/commits/master"> | ||
<img alt="Last commit" src="https://img.shields.io/github/last-commit/lalitpagaria/reddit-rss-reader"> | ||
</a> | ||
</p> | ||
|
||
# Reddit RSS Reader | ||
This is wrapper around publicly/privately available Reddit RSS feeds. It can be used to fetch content from front page, subreddit, all comments of subreddit, all comments of a certain post, comments of certain reddit user, search pages and many more. For more details about what type of RSS feed is provided by Reddit refer these links: [link1](https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/) and [link2](https://www.reddit.com/wiki/rss). | ||
|
||
*Note: These feeds are rate limited hence can only be used for testing purpose. For serious scrapping register your bot at [apps](https://www.reddit.com/prefs/apps/) to get client details and use it with [Praw](https://github.com/praw-dev/praw). | ||
|
||
|
||
## Installation | ||
|
||
Install via PyPi: | ||
```shell | ||
pip install reddit-rss-reader | ||
``` | ||
Install from master branch (if you want to try the latest features): | ||
```shell | ||
git clone https://github.com/lalitpagaria/reddit-rss-reader | ||
cd reddit-rss-reader | ||
pip install --editable . | ||
``` | ||
|
||
# How to use | ||
`RedditRSSReader` require feed url, hence refer [link](https://www.reddit.com/wiki/rss) to generate. For example to fetch all comments on subreddit `r/wallstreetbets` - | ||
```shell | ||
https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new | ||
``` | ||
|
||
Now you can run the following [example](https://github.com/lalitpagaria/reddit-rss-reader/tree/master/example) - | ||
```python | ||
import pprint | ||
from datetime import datetime, timedelta | ||
|
||
import pytz as pytz | ||
|
||
from reddit_rss_reader.reader import RedditRSSReader | ||
|
||
|
||
reader = RedditRSSReader( | ||
url="https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new" | ||
) | ||
|
||
# To consider comments entered in past 5 days only | ||
since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5) | ||
|
||
# fetch_content will fetch all contents if no parameters are passed. | ||
# If `after` is passed then it will fetch contents after this date | ||
# If `since_id` is passed then it will fetch contents after this id | ||
reviews = reader.fetch_content( | ||
after=since_time | ||
) | ||
|
||
pp = pprint.PrettyPrinter(indent=4) | ||
for review in reviews: | ||
pp.pprint(review.__dict__) | ||
``` | ||
Reader return `RedditContent` which have following information (`extracted_text` and `image_alt_text` are extracted from Reddit content via `BeautifulSoup`) - | ||
```python | ||
@dataclass | ||
class RedditContent: | ||
title: str | ||
link: int | ||
id: str | ||
content: str | ||
extracted_text: Optional[str] | ||
image_alt_text: Optional[str] | ||
updated: datetime | ||
author_uri: str | ||
author_name: str | ||
category: str | ||
``` | ||
The output is given with UTF-8 charsets, if you are scraping non-english reddits then set the environment to use UTF - | ||
```shell | ||
export LANG=en_US.UTF-8 | ||
export PYTHONIOENCODING=utf-8 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
import pprint | ||
from datetime import datetime, timedelta | ||
|
||
import pytz as pytz | ||
|
||
from reddit_rss_reader.reader import RedditRSSReader | ||
|
||
|
||
reader = RedditRSSReader( | ||
url="https://www.reddit.com/r/wallstreetbets/new/.rss?sort=new" | ||
# url="https://www.reddit.com/r/wallstreetbets/comments/l84ner/for_those_who_have_been_around_for_a_while_what/.rss?sort=new" | ||
) | ||
|
||
# To consider comments entered in past 5 days only | ||
since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5) | ||
|
||
# fetch_content will fetch all contents if no parameters are passed. | ||
# If `after` is passed then it will fetch contents after this date | ||
# If `since_id` is passed then it will fetch contents after this id | ||
reviews = reader.fetch_content( | ||
after=since_time | ||
) | ||
|
||
pp = pprint.PrettyPrinter(indent=4) | ||
for review in reviews: | ||
pp.pprint(review.__dict__) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
import logging | ||
from dataclasses import dataclass | ||
from datetime import datetime | ||
from time import mktime | ||
from typing import List, Optional | ||
|
||
import feedparser | ||
import requests | ||
from bs4 import BeautifulSoup | ||
from feedparser import FeedParserDict | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@dataclass | ||
class RedditContent: | ||
title: str | ||
link: int | ||
id: str | ||
content: str | ||
extracted_text: Optional[str] | ||
image_alt_text: Optional[str] | ||
updated: datetime | ||
author_uri: str | ||
author_name: str | ||
category: str | ||
|
||
|
||
@dataclass | ||
class RedditRSSReader: | ||
# For proper url | ||
# Refer https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/ | ||
url: str | ||
|
||
def fetch_content(self, after: Optional[datetime] = None, since_id: Optional[int] = None) -> List[RedditContent]: | ||
contents: List[RedditContent] = [] | ||
feed = self.fetch_feed(self.url) | ||
|
||
for entry in feed.entries: | ||
if after is not None and after.timetuple() > entry.updated_parsed: | ||
break | ||
|
||
if since_id is not None and since_id == entry.id: | ||
break | ||
|
||
try: | ||
soup = BeautifulSoup(entry.summary) | ||
|
||
image_alt_texts = [x['alt'] for x in soup.find_all('img', alt=True)] | ||
image_alt_texts = image_alt_texts if image_alt_texts else [] | ||
|
||
contents.append( | ||
RedditContent( | ||
link=entry.link, | ||
id=entry.id, | ||
title=entry.title, | ||
content=entry.summary, | ||
extracted_text=soup.get_text(), | ||
image_alt_text=". ".join(image_alt_texts), | ||
updated=datetime.fromtimestamp(mktime(entry.updated_parsed)), | ||
author_name=entry.author_detail.name, | ||
author_uri=str(entry.author_detail.href), | ||
category=entry.category | ||
) | ||
) | ||
except Exception: | ||
logger.error(f'Error parsing Entry={entry}') | ||
|
||
return contents | ||
|
||
@staticmethod | ||
def fetch_feed(feed_url: str) -> FeedParserDict: | ||
# On MacOS https do not work, hence using workaround | ||
# Refer https://github.com/uvacw/inca/issues/162 | ||
is_https = "https://" in feed_url[:len("https://")] | ||
if is_https: | ||
feed_content = requests.get(feed_url) | ||
feed = feedparser.parse(feed_content.text) | ||
else: | ||
feed = feedparser.parse(feed_url) | ||
|
||
return feed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
feedparser | ||
beautifulsoup4 | ||
requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
import pathlib | ||
from io import open | ||
|
||
from setuptools import find_packages, setup | ||
|
||
|
||
def parse_requirements(filename): | ||
with open(filename) as file: | ||
lines = file.read().splitlines() | ||
|
||
return [ | ||
line.strip() | ||
for line in lines | ||
if not ( | ||
(not line) or | ||
(line.strip()[0] == "#") or | ||
(line.strip().startswith('--find-links')) or | ||
("git+https" in line) | ||
) | ||
] | ||
|
||
|
||
def get_dependency_links(filename): | ||
with open(filename) as file: | ||
lines = file.read().splitlines() | ||
|
||
return [ | ||
line.strip().split('=')[1] | ||
for line in lines | ||
if line.strip().startswith('--find-links') | ||
] | ||
|
||
|
||
dependency_links = get_dependency_links('requirements.txt') | ||
parsed_requirements = parse_requirements('requirements.txt') | ||
|
||
# The directory containing this file | ||
HERE = pathlib.Path(__file__).parent | ||
|
||
# The text of the README file | ||
README = (HERE / "README.md").read_text() | ||
|
||
|
||
setup( | ||
name="reddit-rss-reader", | ||
version="1.0", | ||
author="Lalit Pagaria", | ||
author_email="[email protected]", | ||
description="A Wrapper around Reddit RSS feed", | ||
long_description=README, | ||
long_description_content_type="text/markdown", | ||
license="Apache Version 2.0", | ||
url="https://github.com/lalitpagaria/reddit-rss-reader", | ||
packages=find_packages(exclude=["*.tests", "*.tests.*", "tests.*", "tests"]), | ||
dependency_links=dependency_links, | ||
install_requires=parsed_requirements, | ||
include_package_data=True, | ||
python_requires=">=3.6.0", | ||
tests_require=["pytest"], | ||
classifiers=[ | ||
"Development Status :: 5 - Production/Stable", | ||
"Intended Audience :: Developers", | ||
"Intended Audience :: Customer Service", | ||
"Intended Audience :: Information Technology", | ||
"License :: OSI Approved :: Apache Software License", | ||
"Operating System :: OS Independent", | ||
"Programming Language :: Python :: 3.6", | ||
"Programming Language :: Python :: 3.7", | ||
"Programming Language :: Python :: 3.8", | ||
"Programming Language :: Python :: 3.9", | ||
"Topic :: Software Development :: Libraries :: Python Modules", | ||
], | ||
) |