Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make web scraper accessible #213

Open
matyaskopp opened this issue Dec 19, 2024 · 0 comments
Open

make web scraper accessible #213

matyaskopp opened this issue Dec 19, 2024 · 0 comments
Assignees

Comments

@matyaskopp
Copy link
Member

matyaskopp commented Dec 19, 2024

Reimplement simplified scraping tool and make it available within this repository.
Make package ParCzech::Scraper, because it will be needed in multiple scripts

features

  • downloading (ParCzech::Scraper::Down)
    • allow using delays between http requests
    • allow saving downloaded file (use the same path as url)
    • save metadata in tsv file if available (see change folder structure #212)
    • allow using cached data
    • change relative urls to absolute
  • parsing and processing (ParCzech::Scraper::Parse)
    • preprocess function
    • make some buildin preprocess finction, that can be used (allow some text polishing, like character replacements)
    • allow parsing data at all - save raw (eg for audio files)
    • traversing data (allow adding context node)
      • string and node results
      • scalar or array context
@matyaskopp matyaskopp added this to the Code refactorization milestone Dec 19, 2024
@matyaskopp matyaskopp self-assigned this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant