Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 403 when scraping Substack from an Actions runner #56

Open
chriscarrollsmith opened this issue Jan 27, 2024 · 2 comments
Open

Error 403 when scraping Substack from an Actions runner #56

chriscarrollsmith opened this issue Jan 27, 2024 · 2 comments

Comments

@chriscarrollsmith
Copy link
Contributor

Scraping Substack with extractus works on a home PC, but it does not work from an Actions runner. For reasons I don't fully understand, Substack began returning Error 403: Forbidden at 7 PM EST on January 15, 2023. Here is a reproducible example:

name: Fetch RSS Feed

on:
  push:
    branches:
      - main

jobs:
  fetch-rss:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Fetch RSS Feed
      uses: Promptly-Technologies-LLC/rss-fetch-action@v2
      with:
        feed_url: https://knowledgeworkersguide.substack.com/feed
        file_path: ./feed.json
        parser_options: "{\"useISODateFormat\": false, \"getExtraEntryFields\": \"(feedEntry) => { return { 'content:encoded': feedEntry['content:encoded'] || '' }; }\"}"
        fetch_options: "{}"
        remove_published: true
    
    - name: Commit and push changes to repository
      uses: stefanzweifel/git-auto-commit-action@v4
      with:
        commit_message: 'Update RSS feed'
        file_pattern: '*.json'

I have tried adding custom headers, but without success.

@chriscarrollsmith
Copy link
Contributor Author

chriscarrollsmith commented Jan 27, 2024

Note that this is not an issue with extractus. Version 1 of the rss-fetch-action, which used isomorphic-fetch, also fails:

      - name: Fetch RSS Feed
        uses: Promptly-Technologies-LLC/rss-fetch-action@v1
        with:
          feed_url: https://babafaqirchand.substack.com/feed
          file_path: ./src/components/ui/RssFeed.json
          remove_last_build_date: true

I have also tried a Windows runner rather than an Ubuntu runner, but still got the same Error 403.

@chriscarrollsmith
Copy link
Contributor Author

Honestly, it seems like Substack may have just specifically blocked Github Actions runners for some reason. I'm not sure why you would do this (maybe IP concerns about Substack content appearing on Github, or abusive high-frequency requests?) or how you would go about it (some kind of CORS/IP blocking?), but it's my current best guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant