Skip to content

Commit

Permalink
Adding support for Reddit RSS reader
Browse files Browse the repository at this point in the history
  • Loading branch information
lalitpagaria committed Jan 30, 2021
1 parent a226ddc commit 00293fd
Show file tree
Hide file tree
Showing 10 changed files with 354 additions and 3 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# This workflow will install Python dependencies, run test and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: CI

on:
push:
branches: [ master ]
pull_request:
branches: [ master ]

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2

- name: Set up Python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.7

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
35 changes: 35 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

name: Upload Python Package

on:
release:
types: [created]

jobs:
deploy:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.6'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: publish to PyPI
if: github.event_name != 'pull_request'
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,10 @@ dmypy.json

# Pyre type checker
.pyre/
/.idea/.gitignore
/.idea/app_store_reviews_reader.iml
/.idea/misc.xml
/.idea/modules.xml
/.idea/inspectionProfiles/profiles_settings.xml
/.idea/inspectionProfiles/Project_Default.xml
/.idea/vcs.xml
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2021 Lalit Pagaria

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
93 changes: 91 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,91 @@
# reddit-rss-reader
Wrapper around Reddit RSS feed
<p align="center">
<a href="https://github.com/lalitpagaria/reddit-rss-reader/blob/master/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/lalitpagaria/reddit-rss-reader?color=blue">
</a>
<a href="https://pypi.org/project/reddit-rss-reader">
<img src="https://img.shields.io/pypi/pyversions/reddit-rss-reader" alt="PyPI - Python Version" />
</a>
<a href="https://pypi.org/project/reddit-rss-reader/">
<img alt="Release" src="https://img.shields.io/pypi/v/app-store-reviews-reader">
</a>
<a href="https://pepy.tech/project/reddit-rss-reader">
<img src="https://pepy.tech/badge/reddit-rss-reader/month" alt="Downloads" />
</a>
<a href="https://github.com/lalitpagaria/reddit-rss-reader/commits/master">
<img alt="Last commit" src="https://img.shields.io/github/last-commit/lalitpagaria/reddit-rss-reader">
</a>
</p>

# Reddit RSS Reader
This is wrapper around publicly/privately available Reddit RSS feeds. It can be used to fetch content from front page, subreddit, all comments of subreddit, all comments of a certain post, comments of certain reddit user, search pages and many more. For more details about what type of RSS feed is provided by Reddit refer these links: [link1](https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/) and [link2](https://www.reddit.com/wiki/rss).

*Note: These feeds are rate limited hence can only be used for testing purpose. For serious scrapping register your bot at [apps](https://www.reddit.com/prefs/apps/) to get client details and use it with [Praw](https://github.com/praw-dev/praw).


## Installation

Install via PyPi:
```shell
pip install reddit-rss-reader
```
Install from master branch (if you want to try the latest features):
```shell
git clone https://github.com/lalitpagaria/reddit-rss-reader
cd reddit-rss-reader
pip install --editable .
```

# How to use
`RedditRSSReader` require feed url, hence refer [link](https://www.reddit.com/wiki/rss) to generate. For example to fetch all comments on subreddit `r/wallstreetbets` -
```shell
https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new
```

Now you can run the following [example](https://github.com/lalitpagaria/reddit-rss-reader/tree/master/example) -
```python
import pprint
from datetime import datetime, timedelta

import pytz as pytz

from reddit_rss_reader.reader import RedditRSSReader


reader = RedditRSSReader(
url="https://www.reddit.com/r/wallstreetbets/comments/.rss?sort=new"
)

# To consider comments entered in past 5 days only
since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5)

# fetch_content will fetch all contents if no parameters are passed.
# If `after` is passed then it will fetch contents after this date
# If `since_id` is passed then it will fetch contents after this id
reviews = reader.fetch_content(
after=since_time
)

pp = pprint.PrettyPrinter(indent=4)
for review in reviews:
pp.pprint(review.__dict__)
```
Reader return `RedditContent` which have following information (`extracted_text` and `image_alt_text` are extracted from Reddit content via `BeautifulSoup`) -
```python
@dataclass
class RedditContent:
title: str
link: int
id: str
content: str
extracted_text: Optional[str]
image_alt_text: Optional[str]
updated: datetime
author_uri: str
author_name: str
category: str
```
The output is given with UTF-8 charsets, if you are scraping non-english reddits then set the environment to use UTF -
```shell
export LANG=en_US.UTF-8
export PYTHONIOENCODING=utf-8
```
26 changes: 26 additions & 0 deletions example/example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import pprint
from datetime import datetime, timedelta

import pytz as pytz

from reddit_rss_reader.reader import RedditRSSReader


reader = RedditRSSReader(
url="https://www.reddit.com/r/wallstreetbets/new/.rss?sort=new"
# url="https://www.reddit.com/r/wallstreetbets/comments/l84ner/for_those_who_have_been_around_for_a_while_what/.rss?sort=new"
)

# To consider comments entered in past 5 days only
since_time = datetime.utcnow().astimezone(pytz.utc) + timedelta(days=-5)

# fetch_content will fetch all contents if no parameters are passed.
# If `after` is passed then it will fetch contents after this date
# If `since_id` is passed then it will fetch contents after this id
reviews = reader.fetch_content(
after=since_time
)

pp = pprint.PrettyPrinter(indent=4)
for review in reviews:
pp.pprint(review.__dict__)
Empty file added reddit_rss_reader/__init__.py
Empty file.
82 changes: 82 additions & 0 deletions reddit_rss_reader/reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import logging
from dataclasses import dataclass
from datetime import datetime
from time import mktime
from typing import List, Optional

import feedparser
import requests
from bs4 import BeautifulSoup
from feedparser import FeedParserDict

logger = logging.getLogger(__name__)


@dataclass
class RedditContent:
title: str
link: int
id: str
content: str
extracted_text: Optional[str]
image_alt_text: Optional[str]
updated: datetime
author_uri: str
author_name: str
category: str


@dataclass
class RedditRSSReader:
# For proper url
# Refer https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/
url: str

def fetch_content(self, after: Optional[datetime] = None, since_id: Optional[int] = None) -> List[RedditContent]:
contents: List[RedditContent] = []
feed = self.fetch_feed(self.url)

for entry in feed.entries:
if after is not None and after.timetuple() > entry.updated_parsed:
break

if since_id is not None and since_id == entry.id:
break

try:
soup = BeautifulSoup(entry.summary)

image_alt_texts = [x['alt'] for x in soup.find_all('img', alt=True)]
image_alt_texts = image_alt_texts if image_alt_texts else []

contents.append(
RedditContent(
link=entry.link,
id=entry.id,
title=entry.title,
content=entry.summary,
extracted_text=soup.get_text(),
image_alt_text=". ".join(image_alt_texts),
updated=datetime.fromtimestamp(mktime(entry.updated_parsed)),
author_name=entry.author_detail.name,
author_uri=str(entry.author_detail.href),
category=entry.category
)
)
except Exception:
logger.error(f'Error parsing Entry={entry}')

return contents

@staticmethod
def fetch_feed(feed_url: str) -> FeedParserDict:
# On MacOS https do not work, hence using workaround
# Refer https://github.com/uvacw/inca/issues/162
is_https = "https://" in feed_url[:len("https://")]
if is_https:
feed_content = requests.get(feed_url)
feed = feedparser.parse(feed_content.text)
else:
feed = feedparser.parse(feed_url)

return feed
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
feedparser
beautifulsoup4
requests
73 changes: 73 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import pathlib
from io import open

from setuptools import find_packages, setup


def parse_requirements(filename):
with open(filename) as file:
lines = file.read().splitlines()

return [
line.strip()
for line in lines
if not (
(not line) or
(line.strip()[0] == "#") or
(line.strip().startswith('--find-links')) or
("git+https" in line)
)
]


def get_dependency_links(filename):
with open(filename) as file:
lines = file.read().splitlines()

return [
line.strip().split('=')[1]
for line in lines
if line.strip().startswith('--find-links')
]


dependency_links = get_dependency_links('requirements.txt')
parsed_requirements = parse_requirements('requirements.txt')

# The directory containing this file
HERE = pathlib.Path(__file__).parent

# The text of the README file
README = (HERE / "README.md").read_text()


setup(
name="reddit-rss-reader",
version="1.0",
author="Lalit Pagaria",
author_email="[email protected]",
description="A Wrapper around Reddit RSS feed",
long_description=README,
long_description_content_type="text/markdown",
license="Apache Version 2.0",
url="https://github.com/lalitpagaria/reddit-rss-reader",
packages=find_packages(exclude=["*.tests", "*.tests.*", "tests.*", "tests"]),
dependency_links=dependency_links,
install_requires=parsed_requirements,
include_package_data=True,
python_requires=">=3.6.0",
tests_require=["pytest"],
classifiers=[
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Customer Service",
"Intended Audience :: Information Technology",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Topic :: Software Development :: Libraries :: Python Modules",
],
)

0 comments on commit 00293fd

Please sign in to comment.