Resurrecting www.alexstjohn.com
What?
www.alexstjohn.com was the personal blog/website of Alex St. John, from his bio on the site: "Best known for his early work on gaming and creating the DirectX media platform at Microsoft in the early 1990’s, Alex St. John later founded WildTangent Inc."
He is a somewhat controversial, opinionated figure, latterly known for holding abhorent views about millennials, minorities and women in tech, which even prompted a rebuttal from his own daughter.
For unknown reasons at some point after June 2017 he deleted his blog, and also had it removed from the Internet Archive.
Why?
Despite everything, his website was a treasure trove of information, stories and files relating to the early history of DirectX which is of great interest to graphics programmers and retro computing enthusiasts.
Prompted by a question on vogons.org asking if anyone had a copy of the site following its deletion from the internet, I did some cursory searches and found that it hadn't been completely purged from the Internet Archive after all. Google had indexed two Web Archive (WARC) files in a collection of contributions to the site from 2016 and 2017, both around 8GB in size.
Having found them I downloaded a copy of each, and set to investigating whether they contained the content that was being sought in the original question.
How?
A Web Archive is not like a zip archive of a site but is rather a record of the crawl containing the HTTP requests, response and content for every URL visited. The question became how to view the contents of the archive.
I initially came across WebRecorder, and its replacement ReplayWeb.page. I tried WebRecorder first and this could read the .warc file, and I was able to confirm that it did contain the requested content. However it was read-only, slow and didn't have any apparent ability to save the page.

I couldn't get ReplayWeb.page to work - it would load the .warc file, but then I couldn't find the content I knew was in it. Not willing to spend too much time on it I moved on to searching for ways to extract from .warc files instead.
This lead to trying several different projects, but ultimately the most promising was Warcat. This would export the content of the web archive but fell over around 81000 requests in.
I traced this down to the code expecting the Request header in the archive to have windows standard line terminators (\r\n) but the failing record had just newlines (\n) for some reason. Despite never having done any Python I hacked in a check and that seemed to solve the problem and allow all files to be extracted.
This produced a huge mass of folders and files (~112k in total), many of which had no file extensions. At the root were folders for all the hosts in the .warc - all 2056 of them! From the WebRecorder experiment I knew where to find the required content and located it very quickly. However the page was called _index_da39a3 (a 6 character hash as it turns out), and when the page was renamed to index.html and opened it displayed the text content but with no styling.
Investigating the page source revealed that all URL's were defined as absolute e.g. https://www.alexstjohn.com/path/to/content rather than relative e.g. path/to/content.
Bespoke solution
The obvious solution was to create a tool to traverse all the content and rewrite any URL's to be relative. This would allow the site to be hosted anywhere and/or browsed offline.
The first problem as previously mentioned was that many of the extracted files had no file extension which would complicate being served. Also many files were named in amended form to account for invalid characters in filenames and differentiate between query string parameters. For example a call to:
https://fonts.googleapis.com/css?family=Droid%20Sans%3A400,700&ver=4.8-RC2-40878
was written to the file system as
<extract folder>\fonts.googleapis.com\css_family=Droid%20Sans%3A400,700&ver=4.8-RC2-40878
Therefore it would first be necessary to determine the extension of every file before any processing could begin. Likewise many PDF files and archives for download were saved with encoded characters in the filename due to how they were linked from the html.
My plan was to this do it two phases in the tool - the first would be to identify and rename all files to have the correct extension, the second would be to process all the files that would likely contain URL's and rewrite them.
Phase one - identifying the file content
The first phase proved to be trivial enough to implement. After enumerating all files it would load each one and attempt to identify it. This started with trying to match the first few bytes with known file type signatures and return the associated extension.
If not it would move onto to examining the first 64 bytes of the file for text common to an extension, so <html for html files, <svg etc. If it still couldn't determine the type then switch to running regular expressions on the whole content. As html could contain inline css and javascript they were run in that order. Whilst the regular expressions were quite crude they proved effective enough.
However further down the line I did come across some mis-identified files, for example a javascript file containing DOM manipulation using <html or other tags would get it flagged as html. To bypass this issue I inserted a check before the text content checking to scan the file name for known extensions followed by _ as this was how the querystring ? was represented in the file system due to being an invalid filename character. If found then it would use the extension from within the filename.
As mentioned earlier the index files where named with a 6 character alphanumeric hash, apparently to ensure uniqueness. I took the opportunity to recognise the pattern and rename them to index.html.
Throughout this scanning and renaming files process I maintained a dictionary of all renames from the original filename to the new as they were renamed in the file system.
Phase two - rewriting the URL's
The next step was to get all the files that may contain absolute URL's to rewrite. I limited this to html, css and js files.
However identifying absolute URL's would prove to be trickier than expected. I was also aware that I didn't want to blanket update all URL's, only those that referenced files that were part of the .warc file, and only where they were references in tags or scripts, but any URL's given in the text should retain the original full URL.
The basic algorithm I came up with:
1. for every enumerated file
2. set chracterIndex to zero
3. run the URL regular expression
4. for every match found
5. try mapping the URL to the file system
6. get the filename from the filesystem, or the renames dictionary
7. if found convert the filename to relative from the current files location
8. grab the original text content from the characterIndex to the matches starting character into a string builder
9. convert the relative file path to a URL
10. write the relative path into the stringbuilder
11. update characterIndex to the end of the match in the original file
12. if a URL was replaced
13. write the text content from the characterIndex to the end of the file into the stringbuilder
14. write the stringbuilder content to the file
My initial attempt at the regular expression proved surprisingly sucessful at getting the context right, but it didn't handle everything and very obviously had an issue with links to downloadable files that contained brackets in the filename due to the handling for when the URL was unquoted in css with url() or in function calls in javascript.
Some minor tweaks, albeit significantly increasing the complexity and runtime of the regex with look behind, resolved those issues.
It also became obvious that not all captured sites in the .warc actually worked, particularly embedded youtube videos. I also noted that the extracted content was around 9.5GB, of which at least 2GB were cached video files of various dimensions from foxnews.com and the theguardian.com among others and apparently unrelated to the site. I made the decision to delete these hosts before processing, presumably the content would still be available online, or via the Internet Archive if required, and if not found in the file system the URL's would be left untouched.
At this point it was largely possible to navigate around the site locally, with styling and images loading but there were still some glaring issues. The dropdowns on every page for categories and archives both navigated to URL's that hadn't been captured.
My solution for the categories was to replace the html for the select with relative links to the actual category folders that did exist rather than the numbered folders it was captured using. Although in hindsight I could also have implemented this with some rewrites in the web server configuration.
The archive dropdown was more awkward, but resolved by extending step 6 in the algorithm to look in the file systems folder for an index.html file and if not found generate a gen_index.html file which just contains a basic list of of links to index.html files found in subfolders of the folder pointed at.
The final major issue involved javascript apparently not running. This turned out to be because the original site was using Cloudflares rocketscript which had changed the type attribute of most script tags to "text/rocketscript" and src to "data-rocketsrc".
The solution was to add a step before step 12 of the basic algorithm to scan for the rocketscrpt values and replace them with text/javascript and src respectively.
Performance
I did most of the development on my skylake laptop (2 cores/4 threads) and the initial scan of files took some 40 minutes to complete. Plugging the laptop in enabled higher clock speeds and associated decrease in processing time but it was still slow. A fairly obvious optimization was to process the files in parallel since they were all being processed in isolation from each other which help enormously.
Phase two was also largely isolated apart from where the code tries to generate the gen_index.html files and write them. Every so many runs the application would crash as two threads happened to try generating the index and writing it simultaneously. The simple answer was to place a lock around the write.
And that was largely it. I did find a handful of URL's being made relative out of context but that was easily resolved manually, along with making some URL's in certain pages text actual links rather than static text (and correctly unprocessed). I'm sure there are some linked sites that are missing relative URL's, but I'll address them manually as I come across them.