Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kiwix for wiki? #17

Open
44477744477 opened this issue Jan 22, 2025 · 6 comments
Open

Kiwix for wiki? #17

44477744477 opened this issue Jan 22, 2025 · 6 comments
Assignees

Comments

@44477744477
Copy link

44477744477 commented Jan 22, 2025

Not sure if this is the right place, but it might be a good idea to get a .zim file for at least the wiki, which can be used with kiwix for easier offline viewing. (wiki is only ~250 MB). Here is a link to a .zim of the wiki (google drive, don't have any better ideas on where to upload that), or you can make one using https://zimit.kiwix.org/#/. Not sure if that would work for the forum itself though, given the 4GB/2hr limit.

@Lisias
Copy link
Contributor

Lisias commented Jan 24, 2025

Dude, thank you very much.

I'm getting a somewhat busy week, but I will check this material on the WeekEnd at worst case.

One problem with Wiki exports is that we also need the images, and I think that 260MB is too few to hold all of them?

Anyway, I will give this peek this week for sure.

@Lisias Lisias self-assigned this Jan 24, 2025
@towermom9
Copy link

towermom9 commented Jan 24, 2025

yo im testing a windows app called "cyotek webcopy" and its been great could you maybe take a look on this @Lisias ?

edit: it seem that it also backs up images, it also ignore any website that arent "https://wiki.kerbalspaceprogram.com/" but it looks so barebones

@44477744477
Copy link
Author

44477744477 commented Jan 25, 2025

Thank you so much for looking into this, saving the wiki this way actually does seem to include all the images (at least the ones I checked) as embedded in the articles, and some have a higher resolution version. Given that all of English Wikipedia (including lower resolution images) in this format is "only" ~100 GB, I think it probably is saving them all.

Also, here is a link to an archive.org copy of the same file. BTW, even though the wiki on zimit (how I made this file) says the files it makes are incompatible with the desktop version of Kiwix, it seems to work fine for me.

This is the page about making .zim files other ways, but I haven't really messed with any of them because I don't really know what I'm doing, and the forum seems kind of fragile right now, but I might try something later.

One last thing: while this does make viewing/searching stuff relatively easy, I have no idea if it is easy/hard it is to convert that back into a website (it does look like you can convert zim files to HTML?)

Edit: If anyone knows anything about how the wiki/forum is licensed, I could try to request that the openzim people upload a zim, which would then (I think) get uploaded to the kiwix library, making it easier for people to find the files. (asking because I could only find the take-two ToS, which I'm somewhat confused about)

btw, if anyone wants to use this now kiwix (compatible file reader) is available on android, ios, mac, windows, linux, and as a browser extension.

@Lisias
Copy link
Contributor

Lisias commented Jan 25, 2025

My tool is scraping the images too - everything is being archived.

Worst case scenario, we convert the WARC files into ZIM ones - I had researched a bit, and found libraries to create ZIM files. Since I know very very well how to handle WARC files at this point, converting the formats will not be a problem for me - assuming that someone else didn't already did that. :)

The Wiki is CC BY SA 3.0, by the way: https://wiki.kerbalspaceprogram.com/wiki/Category:Wikipedia_copyright

And, finally, there's this link: https://wiki.kerbalspaceprogram.com/wiki/Special:Export

We can use it to properly export wiki into a format that it would be imported back later - didn't knew about this one.

We can use my WARC files to list all possible URLS (since I had scraped them already) and then try to automate the exports using the link above.

Or we can use this dude's solution! : https://pastebin.com/P96x8a7F . Cool!

import csv
 
from selenium import webdriver
from selenium.common import NoSuchElementException
from selenium.webdriver.common.by import By
 
# Archives the KSP Wiki
def main():
	web_url = "https://wiki.kerbalspaceprogram.com/wiki/Special:AllPages"
 
	# Update this to your browsers webdriver. I'm using ChromiumEdge because the Snap Chromium webdriver (I'm on Ubuntu) seems to be broken
	driver = webdriver.ChromiumEdge()
 
	get_pages_recursively(driver, web_url)
	driver.quit()
 
# Recursive function for loading the next page, iterating through the page entries, and then loading the next page again
def get_pages_recursively(driver, web_url):
	driver.get(web_url)
 
	page_list_element = driver.find_element(By.CLASS_NAME, "mw-allpages-chunk")
 
	pages_list = []
 
	for list_item_element in page_list_element.find_elements(By.TAG_NAME, "li"):
		pages_list.append(list_item_element.find_element(By.TAG_NAME, "a").get_attribute("title") + "\n")
 
	# Write to the csv file
	with open("./ksp-wiki-database.txt", "a") as f:
		f.writelines(pages_list)
 
	# Recursive function base case aka we either have more pages to load or we have reached the end
	try:
		next_page = driver.find_element(By.XPATH, "//a[contains(@title, 'Special:AllPages') and contains(text(), 'Next page')]").get_attribute("href")
		get_pages_recursively(driver, next_page)
	except NoSuchElementException:
		return
 
if __name__ == "__main__":
		main()

In time, I'm a UNIX guy - I'm prone to prefer tools that run on all platforms, so it's pretty unlikely that I would adopt a Windows only tool. I can check it, toy with it to gather intelligence, but since my dedicated appliance for scraping is a Raspberry PI 5, the tool I will use need to run on Linux. :)

@Lisias
Copy link
Contributor

Lisias commented Jan 29, 2025

@towermom9

yo im testing a windows app called "cyotek webcopy" and its been great could you maybe take a look on this @Lisias ?

It works, but I prefer to use SiteSucker - mainly because it works on MacOS and I'm used to it for years. On Windows, there's HTTrack, very, very powerful tool. They are pretty old, to tell the truth, but - heck! - they are still working for me and do the job.

HOWEVER...

The spider I wrote is pretty effective too, more versatile and, well, it works perfectly fine to scrap Forum, so it would be stupid not to use it on wiki, the thing just works - fire it up and forget, it's smart enough to just scrap a page again after a month (or any other period I define) saving bandwidth, and if the page had changed, I archive it together with the older version and we have a historical footprint of the site.

In the end, since WIki's license is highly permissive, it's a matter of using the tool that best suits you.

And we always can "export" the WARC contents into a filesystem hosted files using warcat or similar tool.

@44477744477

This is the page about making .zim files other ways, but I haven't really messed with any of them because I don't really know what I'm doing, and the forum seems kind of fragile right now, but I might try something later.

Yep. And I also found this: https://github.com/openzim/warc2zim

So converting "my work" into aZIM file is possible, making things way easier - I do things "my way" and then export it as a ZIM file.

One last thing: while this does make viewing/searching stuff relatively easy, I have no idea if it is easy/hard it is to convert that back into a website (it does look like you can convert zim files to HTML?)

It really depends on the site itself. More complex/convoluted sites like Forum will probably be harder to export into a website, lots and lots of the content is javascript generated. On this, WARC files appears to be a safer bet, because it store the request and the response with all the metadata, allowing to reproduce the website perfectly - at least for the browser you are emulating while scraping.

Exporting into files will demand converting URLs to use the filesystem (ie., from https:// to file://) what is not only (at least technically) a copyright infringement, but will tamper the files itself, potentially causing some bugs.

For example, some CSS relies on fonts from fonts.googleapis.com. Every single CSS (or javascript) that access fonts.googleapis.com would need to be rewrote to take them from your harddisk instead. When using WARC this is not needed because these archives simulates the http request and response, completely fooling the browser into believing it's fetching data from the Internet, instead of a local file.

I didn't studied exactly how ZIM files works, but I suspect it must be something similar, but without the need of a HTTP server (being it local or remote) to server the content to your browser.

@44477744477
Copy link
Author

44477744477 commented Jan 30, 2025

Sorry, I've been kinda busy, but I'll try to look at this properly over the weekend.

I also did go ahead and request that a zim of the wiki get added to the library.

So converting "my work" into a ZIM file is possible, making things way easier - I do things "my way" and then export it as a ZIM file.

This does seem like a better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants