-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kiwix for wiki? #17
Comments
Dude, thank you very much. I'm getting a somewhat busy week, but I will check this material on the WeekEnd at worst case. One problem with Wiki exports is that we also need the images, and I think that 260MB is too few to hold all of them? Anyway, I will give this peek this week for sure. |
yo im testing a windows app called "cyotek webcopy" and its been great could you maybe take a look on this @Lisias ? edit: it seem that it also backs up images, it also ignore any website that arent "https://wiki.kerbalspaceprogram.com/" but it looks so barebones |
Thank you so much for looking into this, saving the wiki this way actually does seem to include all the images (at least the ones I checked) as embedded in the articles, and some have a higher resolution version. Given that all of English Wikipedia (including lower resolution images) in this format is "only" ~100 GB, I think it probably is saving them all. Also, here is a link to an archive.org copy of the same file. BTW, even though the wiki on zimit (how I made this file) says the files it makes are incompatible with the desktop version of Kiwix, it seems to work fine for me. This is the page about making .zim files other ways, but I haven't really messed with any of them because I don't really know what I'm doing, and the forum seems kind of fragile right now, but I might try something later. One last thing: while this does make viewing/searching stuff relatively easy, I have no idea if it is easy/hard it is to convert that back into a website (it does look like you can convert zim files to HTML?) Edit: If anyone knows anything about how the wiki/forum is licensed, I could try to request that the openzim people upload a zim, which would then (I think) get uploaded to the kiwix library, making it easier for people to find the files. (asking because I could only find the take-two ToS, which I'm somewhat confused about) btw, if anyone wants to use this now kiwix (compatible file reader) is available on android, ios, mac, windows, linux, and as a browser extension. |
My tool is scraping the images too - everything is being archived. Worst case scenario, we convert the The Wiki is CC BY SA 3.0, by the way: https://wiki.kerbalspaceprogram.com/wiki/Category:Wikipedia_copyright And, finally, there's this link: https://wiki.kerbalspaceprogram.com/wiki/Special:Export We can use it to properly export wiki into a format that it would be imported back later - didn't knew about this one. We can use my WARC files to list all possible URLS (since I had scraped them already) and then try to automate the exports using the link above. Or we can use this dude's solution! : https://pastebin.com/P96x8a7F . Cool!
In time, I'm a UNIX guy - I'm prone to prefer tools that run on all platforms, so it's pretty unlikely that I would adopt a Windows only tool. I can check it, toy with it to gather intelligence, but since my dedicated appliance for scraping is a Raspberry PI 5, the tool I will use need to run on Linux. :) |
It works, but I prefer to use SiteSucker - mainly because it works on MacOS and I'm used to it for years. On Windows, there's HTTrack, very, very powerful tool. They are pretty old, to tell the truth, but - heck! - they are still working for me and do the job. HOWEVER... The spider I wrote is pretty effective too, more versatile and, well, it works perfectly fine to scrap Forum, so it would be stupid not to use it on wiki, the thing just works - fire it up and forget, it's smart enough to just scrap a page again after a month (or any other period I define) saving bandwidth, and if the page had changed, I archive it together with the older version and we have a historical footprint of the site. In the end, since WIki's license is highly permissive, it's a matter of using the tool that best suits you. And we always can "export" the
Yep. And I also found this: https://github.com/openzim/warc2zim So converting "my work" into a
It really depends on the site itself. More complex/convoluted sites like Forum will probably be harder to export into a website, lots and lots of the content is javascript generated. On this, Exporting into files will demand converting URLs to use the filesystem (ie., from For example, some CSS relies on fonts from I didn't studied exactly how |
Sorry, I've been kinda busy, but I'll try to look at this properly over the weekend. I also did go ahead and request that a zim of the wiki get added to the library.
This does seem like a better idea. |
Not sure if this is the right place, but it might be a good idea to get a .zim file for at least the wiki, which can be used with kiwix for easier offline viewing. (wiki is only ~250 MB). Here is a link to a .zim of the wiki (google drive, don't have any better ideas on where to upload that), or you can make one using https://zimit.kiwix.org/#/. Not sure if that would work for the forum itself though, given the 4GB/2hr limit.
The text was updated successfully, but these errors were encountered: