Thursday, 10 October 2013

Archiving, persistence and robots.txt

Web archiving is a hazardous business. Content gets created, content gets deleted, content gets changed every minute of every day. There's basically so much content you can't hope to archive it all.

Also a lot of web archiving assume that the pages are static, even if they've been generated from a script - pure on the fly pages have no chance of being archived.

However you can usually make an assumption that if something was a static web page and there long enough, that it will be on the wayback machine in some form.

Not necessarily it turns out. I recently wanted to look at some content I'd written years ago. I didn't have the original source, but I did have the url and I did remember searching successfully for the same content on the wayback machine some years ago. (I even had a screenshot as proof that my memory wasn't playing tricks).

So, you would think it would be easy. Nope. Access is denied because the wayback machine honours the sites current robots.txt file, not the one current at the time of the snapshot, meaning that if your favouriet site changes its robots.txt between then and now to deny access you are locked out.

Now there's a lot of reasons why they've enacted the policy they have but it effectively locks away content that was once public, and that doesn't seem quite right ...

Written with StackEdit.

No comments: