Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out a full-featured scraping framework #11

Open
lassik opened this issue Aug 21, 2020 · 9 comments
Open

Figure out a full-featured scraping framework #11

lassik opened this issue Aug 21, 2020 · 9 comments

Comments

@lassik
Copy link
Member

lassik commented Aug 21, 2020

Currently we use a hand-written scraper for these implementations:

  • chez
  • guile
  • kawa
  • larceny
  • loko
  • mit
  • racket
  • scheme48
  • slib
  • tinyscheme
  • ypsilon

Would be nice to extend the scraper generator in listings.scm so it can handle all of the above.

The scraper generator currently writes Unix shell scripts that use curl, tar and grep, but that's just an implementation detail. If Scheme had a more fully developed archive framework, we could just as well skip the shell scripts and run the scrapers directly in Scheme.

The main thing is to discover a specification language for the scrapers.

@lassik
Copy link
Member Author

lassik commented Aug 21, 2020

The current definitions look like this:

(define schemes

@lassik
Copy link
Member Author

lassik commented Aug 21, 2020

IMHO it's a good idea to scrape the listings from manuals whenever possible. Manuals usually describe the most "official" features of the implementation - they tend to put emphasis on polished features, not experimental ones. So if a SRFI is mentioned in the manual, support is probably reasonably complete.

@jpellegrini
Copy link
Contributor

it's a good idea to scrape the listings from manuals whenever possible

Some of the manuals are generated programatically (STklos is one example). Maybe (I don't know) this could be used, somehow.

@lassik
Copy link
Member Author

lassik commented Aug 21, 2020

Yes - in that case it doesn't matter whether we scrape the source or result. Whichever is easier to do. For STklos we currently scrape doc/skb/srfi.stk with one regexp.

@erkin
Copy link
Member

erkin commented Nov 5, 2020

I added GitLab support to the scraper generator to integrate Loko and Kawa. I also added support for Snow Fort and chez-srfi, which doesn't depend on Unix shell scripts but depends on external Racket packages. Maybe a compromise can be made.

For Racket, we can scrape racket/srfi repo.
Guile and MIT/GNU Scheme are both hosted on Savannah with cgit, so maybe we can add a third Git host to the scraper generator.
TinyScheme, Ypsilon, Chez, Scheme48, SLIB and Larceny don't seem like they're going to add new SRFIs any time soon, so we can even remove the existing handwritten scrapers and keep static data.

@lassik
Copy link
Member Author

lassik commented Nov 5, 2020

Fantastic. Thanks you very much for continuing to work on this!

@erkin
Copy link
Member

erkin commented Nov 6, 2020

No problem! (I had to neglect it for a while though. ^^;)

I added Racket to the scraper generator (although it's not yet complete, see racket/srfi#10). Now only CHICKEN, Guile and MIT/GNU Scheme remain.

@erkin
Copy link
Member

erkin commented Nov 8, 2020

CHICKEN down with d25c7b7! Thanks to the work of @diamond-lizard.

@lassik
Copy link
Member Author

lassik commented Nov 11, 2020

I added Racket to the scraper generator (although it's not yet complete, see racket/srfi#10). Now only CHICKEN, Guile and MIT/GNU Scheme remain.

Excellent work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants