While attempting to upgrade to a new Ubuntu distribution Sunday late night, I managed to slag nickm.com. I don’t mean that I insulted my server; rather, I irrevocably converted it into a molten heap, or at least the software equivalent.
The bright side of such failures (perhaps the light is provided by the glowing and otherwise useless material that used to be serving my website) is that one learns how good one’s been at backing up. In my case, I actually had recent copies of almost all of my data stashed away: not only important files, but also the mysql database. That means that after about 12 hours of reinstalling and once more setting up my server, most of it was up and running.
The one thing I didn’t have, due to a permissions/backup quirk, was the image directory for this blog. I had a very old copy of the directory, but had stopped storing local copies of blog images a while back, trusting in my backups that didn’t work. Obviously, I’m to blame; of more general interest than my culpability is what I tried in an attempt to find the missing images, and what did and didn’t work:
FAIL: The Internet Archive Wayback Machine. As best as I can tell, the Wayback Machine’s acquisition apparatus has been switched off since mid-2008. Nothing of Post Position is available at the Internet Archive – no trace of it. There’s no record of Grand Text Auto since mid-2008, either. I found only the tiniest hint of activity in recent months. The major, bustling site Reddit has exactly one image available for all of 2010. The IA Wayback Machine was better than nothing, but was never searchable; now it seems to be over.
SEMI-FAIL: Google. Google’s cache had/has a very small number of my images – only the ones I have recently posted. Perhaps Google cached my images from three months ago and longer in the past, too, but has removed them? The cache is nowhere near a snapshot of the Web, in any case. Google also keeps smaller copies of my images within its image search. All images there are degraded by being reduced in size and converted to jpg, even the smallest of images. This at least allows people in my situation to see what they’ve lost, though. Of course, Google’s cache is not meant as a serious archival tool, so recovering at least few recent files from there was nice.
FAIL: Bing. Yes, I checked the Bing cache (used by Yahoo) also. It doesn’t seem to cache images at all.
WIN: The Electronic Literature Organization and Archive-It.org. After I had more or less given up, and after I had started to recreate images to fill the gaps in blog posts, I remembered that Archive-It, thanks to the work of the Electronic Literature Organization, has archived not only works of electronic literature but also contextual information, such as e-lit authors’ websites. Their archives are searchable, too. Archive-It and the ELO did keep copies of material from nickm.com, and succeeded in preserving the images that I’d lost, outdoing the Internet Archive as well as Google, and Microsoft. (Again, I did get some copies of recent images from Google, and neither that cache nor Bing’s is intended as an archive.) Scott Rettberg did a great deal of work on the ELO Archive-It project, I know, which was undertaken by the ELO when Joe Tabbi was president. Matt Kirschenbaum worked to connect the ELO with Archive-IT, and Patricia Tomaszek did much of the implementation work. A particular thanks to those ELO folks along with the others who worked on this project.
Online archives don’t exist as backup services, of course, but it’s not absurd to see if they can help individuals and organizations in times of crisis – in addition to performing their main function of serving scholars and helping preserve our cultural memory. Given the intricacies of backing up, data storage and formats, and technological change to new systems and platforms, this is sure to be an important secondary function for the digital archive.
Please let me know if you find anything missing or broken here at nickm.com.
6 Replies to “Lessons from the Breakdown Lane”
The Internet Archive always seems to lag by a few years. They don’t actually scrape the web, they get data dumps from Alexa’s web crawl. There is some sort of delay; I don’t know if it’s a delay to integrate, a delay by Alexa to preserve some sort of competitive advantage, or something else.
I’ve noticed that delay with the Internet Archive as well. I am looking for something on my domain I deleted accidentally and it hasn’t shown up yet. I don’t remember it being so delayed in previous times…
I found this article from sometime on or after November 10, 2010: “Internet Archive Increases Delay in Posting Pages to Collection.”
As they did six months ago, the Internet Archive is saying that their move to a new interface, with more recent archived files available, will happen soon: “the rollout has continued to experience delays …”
Via Wayne Marshall: Could liability be hobbling the Internet Archive?
I thought, actually, that the Internet Archive had express legal permission via the Library of Congress to archive the Web. Am I misremembering?