Downloading a site from Archive.org and then serving locally using Kiwix

Since I am on semi-hibernation, I find myself working on all kinds of weird projects since I have a ton of extra time on my hands. Today I worked on a completely ridiculous project to save a site that has been archived on the Wayback Machine and then put that site into my self-hosted Kiwix instance.

I am a self-proclaimed prepper and I find myself perusing r/preppers from time to time. About a month ago someone posted a link to a website about prepping and surviving Hurricane Katrina. The site no longer exists except for on Archive.org. It has some good information that I wanted to save locally, just in case the [archive disappears](https://web.archive.org/web/2008092318…

Before you jump in, know that:

I’m a weirdo.
I like to be a prepper.
I am an unapologetic hardcore data hoarder.

So, over the course of about 3 hours I figured out how to get this site from Archive.org and convert it to a .zim file that is then served from my local Kiwix server.

Getting the site from the Wayback Machine

You can pull the pages from the Wayback machine using wget or the tool I ended up using, which is wayback-machine-downloader.

Using wget we can do:

wget --recursive --no-clobber --page-requisites --convert-links --domains web.archive.org --no-parent https://web.archive.org/web/20080913053159if_/http://www.theplacewithnoname.com/blogs/klessons/

This will create a folder in your current working dir with all the files once it is completed. It worked for me when I ran it after using the next tool.

wayback-machine-downloader

This is a neat tool for doing the same, with rewriting internal links, multithreading, and downloading external assets (if you need it). I ran it in Docker and followed the directions in the repo to build the image.

Next, I set to download the site. Run the container with:

docker run -it -v $HOME/docker_config/wayback-downloader/websites:/app/websites wayback-machine-downloader

It will prompt you through the options. The first option is to set the URL of the page/site to download. It assumes the the data is on archive.org, so we only need the second half of the URL. Here’s the full URL for the site I was downloading:

https://web.archive.org/web/20080913053159if_/http://www.theplacewithnoname.com/blogs/klessons/

But I don’t need https://web.archive.org/web/20080913053159if_/, just http://www.theplacewithnoname.com/blogs/klessons/.

I didn’t need to set a date range, but I did want to rewrite the links to relative. This way when I click on a link on the archived site it will navigate locally, not open the browser to the full URL on archive.org.

Even then, it didn’t write the URLs correctly and I had to manually edit the index.html file. It set all the paths to go up a directory, but I needed it to use the same dir. For example, all the html pages were in ./p/00xx.html. But, the index.html was set to be ../p/00xx.html. I didn’t need it to go back a dir, just stay where it was. It was an easy find/replace fix.

Once I got through the prompts it downloaded the site and was sitting on my machine. I could open in Firefox/Chrome (after the index.html edits) and navigate it.

Kiwix

I didn’t want the site sitting on my desktop. Instead, I want it on my homelab and served by Kiwix. Kiwix is known for being able to serve downloaded copies of Wikipedia and a couple dozen sites in their library. However, it will server any zim file and you can use openzim/zimit to archive any website you want. I’ve done it for a small collection of sites I want to keep in my data hoard.

Now that I have a local copy of the site I pulled from Archive.org, I want to convert it to a zim file and the drop it into the directory Kiwix serves.

To do this, first I needed to serve the local files. I used a simple python web server. Navigate to the directory with the index.html file and run:

python3 -m http.server 8000

Now you can open a browser and go to http://localhost:8000 and browse the site.

I am running zimit on my homelab, so now I’m going to use it to create a .zim file from the basic webserver of the site I downloaded. To do so, run:

docker run -v $HOME/docker_config/zimit:/output --shm-size=2g ghcr.io/openzim/zimit zimit --seeds http://IP_ADDR:8000/index.html --name listening_to_katrina

After about 5 minutes it completed and I had a file named listening_to_katrina.zim. All I had to do was move the file to the directory being served by Kiwix, restart Kiwix, and it was ready.

Do what you want with this information 😅.

Thank you for reading! If you would like to comment on this post you can start a conversation on the Fediverse. Message me on Mastodon at @cinimodev@masto.ctms.me. Or, you may email me at blog.discourse904@8alias.com. This is an intentionally masked email address that will be forwarded to the correct inbox.

Getting the site from the Wayback Machine

wayback-machine-downloader

Kiwix

Similar Posts