Our disappearing web

Garret McMahon is right. He looked at the just-put-up “old Google” from 2001 (lots of fun to do searches and see what Google looked like back then, that index was done just a few days after I started blogging) and he noticed that lots of things that were on the Web back then are gone.

My blog, for instance, is gone for the first year and a half.

Funny, just the other night I met Internet Archive’s Brewster Kahle who was flying home on the plane I was on. We talked about this issue and he said it is troubling but that they are trying to catch a lot more now. He invited me over to meet the team, which we’ll do soon. I also visited the Library of Congress a couple of weeks ago and talked with some of their top archivists. They told me story after story of human knowledge and historical documents from our lifetimes that were destroyed. Heck, the Library of Congress itself has been destroyed by fire twice. I visited Thomas Jefferson’s library which was sold to the Library to get it started again after a fire wiped out its collections. Then, later, another fire wiped out a good chunk of his collection again.

I seriously doubt these words will survive 100 years. What about you?

71 thoughts on “Our disappearing web

  1. Well, it is sad, indeed, but it’s also logical. Every single day, almost as many information is written on the web as in the first 1000 years of our era : and the number of information written everyday will increase. The size of total information to store is then exponential and there is probably no value in keeping all forever!

    Our responsability though, is to make sure that the “Information Darwinism” works well in a sense that we are keeping everuthing valuable at a time. I hope your blog will still be here in 100 years if it’s still valuable then, but it won’t be kept untill year 10000 since the information will probably not be as valuable by then!

    Also, it still poses the problem of storage. So far, we are creating technologies that can store more, but for less time… we need to work on this!

    Like

  2. Well, it is sad, indeed, but it’s also logical. Every single day, almost as many information is written on the web as in the first 1000 years of our era : and the number of information written everyday will increase. The size of total information to store is then exponential and there is probably no value in keeping all forever!

    Our responsability though, is to make sure that the “Information Darwinism” works well in a sense that we are keeping everuthing valuable at a time. I hope your blog will still be here in 100 years if it’s still valuable then, but it won’t be kept untill year 10000 since the information will probably not be as valuable by then!

    Also, it still poses the problem of storage. So far, we are creating technologies that can store more, but for less time… we need to work on this!

    Like

  3. Robert, Do you think a line should be drawn somewhere insofar as what should be saved and what shouldn’t? Should all information, including, say, every single blog comment on the web, be saved, or do you think we should focus more on saving what is important? I think we should only archive that which seems like it would have value to someone in the future. A blog post about our cats, for instance, probably does not need to be saved.

    Like

  4. Robert, Do you think a line should be drawn somewhere insofar as what should be saved and what shouldn’t? Should all information, including, say, every single blog comment on the web, be saved, or do you think we should focus more on saving what is important? I think we should only archive that which seems like it would have value to someone in the future. A blog post about our cats, for instance, probably does not need to be saved.

    Like

  5. It’s more than likely that people in 2108 won’t be able to access the text of this post. No one saves everything, and at some point someone makes an editorial decision regarding what to keep and what to junk, and considering the sheer amount of data that’s being generated, those editorial decisions are going to be made more frequently.

    It’s odd to see what is being preserved, and what isn’t. Google’s Usenet archives have preserved some of the postings that I made in the 1980s during my undergraduate years. But last night, when I searched one of my old Blogger blogs in the process of working on a post, Blogger’s built-in search facilities couldn’t find the relevant stuff that I had written in late 2003 or early 2004.

    That having been said, it’s still a lot easier to find out about things that happened in 2001 than it is to find out about things that happened in, say, 1981 – much less 1681. I guess partial data is better than none at all.

    Like

  6. It’s more than likely that people in 2108 won’t be able to access the text of this post. No one saves everything, and at some point someone makes an editorial decision regarding what to keep and what to junk, and considering the sheer amount of data that’s being generated, those editorial decisions are going to be made more frequently.

    It’s odd to see what is being preserved, and what isn’t. Google’s Usenet archives have preserved some of the postings that I made in the 1980s during my undergraduate years. But last night, when I searched one of my old Blogger blogs in the process of working on a post, Blogger’s built-in search facilities couldn’t find the relevant stuff that I had written in late 2003 or early 2004.

    That having been said, it’s still a lot easier to find out about things that happened in 2001 than it is to find out about things that happened in, say, 1981 – much less 1681. I guess partial data is better than none at all.

    Like

  7. Humm..sounds like the 80-20 rule. 80% of it’s crap anyway. Perhaps a new industry to go through all this data in the future to decide what to throw away. Seems also true for the…data in our heads! : P

    Like

  8. Humm..sounds like the 80-20 rule. 80% of it’s crap anyway. Perhaps a new industry to go through all this data in the future to decide what to throw away. Seems also true for the…data in our heads! : P

    Like

  9. I make an effort to keep my stuff on the web as much as possible, but under my control. I use my own domain, backup my data, and keep things up and running. I’ve been through a few CMS’s, several servers, etc. But it’s still there. The very first blog post is still there. And I intend to keep it that way.

    I host my own photos too. So I can keep them online. What happens to Flickr if Yahoo doesn’t survive? While you may have backups, what about the url’s?

    Certain people (won’t mention names) think 80% of startups could die in this economic downturn… what about the data they host? What about the users and the networks created?

    I’m convinced at this point that hosting your own data is important, and I live by that mantra.

    That’s largely why I’m interested in things like distributed social networks, openID, blogs, more than Facebook, FriendFeed, Flickr. I like my data, I value my time, and my relationships. I don’t want my relationships to be owned by a third party.

    Am I insane for feeling this is important? At times I wonder, but seeing your post in some concern gives me hope that I’m not alone.

    I own and share my data. Owning it lets me share forever. I think that’s better than letting a company own it and share it on my behalf.

    Like

  10. I make an effort to keep my stuff on the web as much as possible, but under my control. I use my own domain, backup my data, and keep things up and running. I’ve been through a few CMS’s, several servers, etc. But it’s still there. The very first blog post is still there. And I intend to keep it that way.

    I host my own photos too. So I can keep them online. What happens to Flickr if Yahoo doesn’t survive? While you may have backups, what about the url’s?

    Certain people (won’t mention names) think 80% of startups could die in this economic downturn… what about the data they host? What about the users and the networks created?

    I’m convinced at this point that hosting your own data is important, and I live by that mantra.

    That’s largely why I’m interested in things like distributed social networks, openID, blogs, more than Facebook, FriendFeed, Flickr. I like my data, I value my time, and my relationships. I don’t want my relationships to be owned by a third party.

    Am I insane for feeling this is important? At times I wonder, but seeing your post in some concern gives me hope that I’m not alone.

    I own and share my data. Owning it lets me share forever. I think that’s better than letting a company own it and share it on my behalf.

    Like

  11. “I seriously doubt these words will survive 100 years.”

    They shouldn’t. Seriously, I don’t mean that as a dig, but given the ease with which people generate content, it’s a great thing that it’s not all chiseled onto stone. The problem we have nowadays is not saving content, but deciding what should be saved.

    Like

  12. “I seriously doubt these words will survive 100 years.”

    They shouldn’t. Seriously, I don’t mean that as a dig, but given the ease with which people generate content, it’s a great thing that it’s not all chiseled onto stone. The problem we have nowadays is not saving content, but deciding what should be saved.

    Like

  13. I wish everything could be saved. Say 100 years from now with all this data saved we would be able to build some incredible social studies with advanced search algorithms. Say like… Yearly word clouds for the entire internet. That would be incredible from a historical view.

    Like

  14. I wish everything could be saved. Say 100 years from now with all this data saved we would be able to build some incredible social studies with advanced search algorithms. Say like… Yearly word clouds for the entire internet. That would be incredible from a historical view.

    Like

  15. I *love* Archive.org but I sure wish they hadn’t deleted the entire history of http://mounthermon.org which I’ve managed since 1997. Apparently if someone asks them not to archive a site and to delete the history, they do it… and they claim there is no bringing back that which was deleted. Wish I could have had some input, and wonder who asked them to delete my site from their history?

    Like

  16. I *love* Archive.org but I sure wish they hadn’t deleted the entire history of http://mounthermon.org which I’ve managed since 1997. Apparently if someone asks them not to archive a site and to delete the history, they do it… and they claim there is no bringing back that which was deleted. Wish I could have had some input, and wonder who asked them to delete my site from their history?

    Like

  17. What’s sad is that even knowledge and creativity worth preserving may still be lost.

    We will never be able to read the entirety of Aeschylus’ “Prometheus Unbound” apart from a few fragments that have survived, quoted by other authors; or read any but a few of books of Livy’s history of Rome.

    I think thet’s one reason the information on the web is more subject to be lost than many other forms of information. The oldest books we have survived because they were copied out, over and over and over; people thought they were worth preserving, so we can still read words written by Julius Caesar, or Homer’s Illiad, etc. stuff that’s genuinely thousands of years old.

    On the web we hardly ever copy, we just link. There’s not enough redundancy in our data. Makes it fragile and easy to lose.

    Then there’s format problems, both file formats and storage media problems, and the sheer space required to keep everything. It’s very likely that a good proportion of the stuff on the web will be lost. Of course, a lot of it isn’t worth preserving.

    Like

  18. What’s sad is that even knowledge and creativity worth preserving may still be lost.

    We will never be able to read the entirety of Aeschylus’ “Prometheus Unbound” apart from a few fragments that have survived, quoted by other authors; or read any but a few of books of Livy’s history of Rome.

    I think thet’s one reason the information on the web is more subject to be lost than many other forms of information. The oldest books we have survived because they were copied out, over and over and over; people thought they were worth preserving, so we can still read words written by Julius Caesar, or Homer’s Illiad, etc. stuff that’s genuinely thousands of years old.

    On the web we hardly ever copy, we just link. There’s not enough redundancy in our data. Makes it fragile and easy to lose.

    Then there’s format problems, both file formats and storage media problems, and the sheer space required to keep everything. It’s very likely that a good proportion of the stuff on the web will be lost. Of course, a lot of it isn’t worth preserving.

    Like

  19. The Google archived indexes offer a unique opportunity to discover who were the first to use NOW POPULAR terms – like Web 2.0 or Web services or SEO etc.

    It will be intriguing to see what ultimately happened to those early innovators – by comparing their status now.

    Like

  20. The Google archived indexes offer a unique opportunity to discover who were the first to use NOW POPULAR terms – like Web 2.0 or Web services or SEO etc.

    It will be intriguing to see what ultimately happened to those early innovators – by comparing their status now.

    Like

  21. We record so much.
    Increasingly always connected.
    The Internet is part of the brain.
    It hurts to forget.
    Real men don’t let the world just mirror.
    Will time wash away all things as it has?
    Real men do more…
    Can I save it all?
    Should I save it all?
    Will I remember?

    Our legacy is not secured but through the interaction. Is the Internet assured? Is the world’s finest interaction able to sustain us for the future?

    Like

  22. We record so much.
    Increasingly always connected.
    The Internet is part of the brain.
    It hurts to forget.
    Real men don’t let the world just mirror.
    Will time wash away all things as it has?
    Real men do more…
    Can I save it all?
    Should I save it all?
    Will I remember?

    Our legacy is not secured but through the interaction. Is the Internet assured? Is the world’s finest interaction able to sustain us for the future?

    Like

  23. Referring to mike n’s post, the difficulty is in determining what will be important in 2108, and what won’t be. Perhaps information about our cats WILL be important in the future; I have no way of knowing. Bear in mind that the majority of the contemporaries of Johann Sebastian Bach probably wouldn’t have bothered to save his old-fashioned work. That’s the danger in editorial decisions.

    Like

  24. Referring to mike n’s post, the difficulty is in determining what will be important in 2108, and what won’t be. Perhaps information about our cats WILL be important in the future; I have no way of knowing. Bear in mind that the majority of the contemporaries of Johann Sebastian Bach probably wouldn’t have bothered to save his old-fashioned work. That’s the danger in editorial decisions.

    Like

  25. This reminds me a bit of Winston and the memory hole. Are we in Orwellian times? It’s something to think about, but we have enough to worry about these days.

    Like

  26. This reminds me a bit of Winston and the memory hole. Are we in Orwellian times? It’s something to think about, but we have enough to worry about these days.

    Like

  27. Interesting how far we’ll go to try to Archive EVERYTHING. I think its important from an information perspective, because as we archive more, then future generation become smarter – but where do we draw the line. I think as computers become faster and the techniques for storage become more productive, etc. then archiving won’t be as ‘difficult’. As a result, I feel it may take years until we start archiving everything, so this post possibly might not make it to the year 3,000, but others may. Robert – Interesting to think about this though none the less.

    Like

  28. Interesting how far we’ll go to try to Archive EVERYTHING. I think its important from an information perspective, because as we archive more, then future generation become smarter – but where do we draw the line. I think as computers become faster and the techniques for storage become more productive, etc. then archiving won’t be as ‘difficult’. As a result, I feel it may take years until we start archiving everything, so this post possibly might not make it to the year 3,000, but others may. Robert – Interesting to think about this though none the less.

    Like

  29. I don’t know if the words will last a century or not, but what struck me about the Google from 2001 was how much better the web was back then. Searches found good stuff, not just hucksters trying to sell us cheap crap, there were personal web sites with people I want to communicate with, not lots of spamming “young ladies” from foreign parts trying to friend us on Facebook, there were papers and publications and informed speculation, not just a bunch of teaser abstracts trying to get us to pay for access for the full PDF.

    So I guess that what’s important and worth remembering is getting hidden in the noise, and if 99% of the web doesn’t make it to the next decade, let alone century, how are we going to know if anything of value got lost?

    Like

  30. I don’t know if the words will last a century or not, but what struck me about the Google from 2001 was how much better the web was back then. Searches found good stuff, not just hucksters trying to sell us cheap crap, there were personal web sites with people I want to communicate with, not lots of spamming “young ladies” from foreign parts trying to friend us on Facebook, there were papers and publications and informed speculation, not just a bunch of teaser abstracts trying to get us to pay for access for the full PDF.

    So I guess that what’s important and worth remembering is getting hidden in the noise, and if 99% of the web doesn’t make it to the next decade, let alone century, how are we going to know if anything of value got lost?

    Like

  31. My concern is all about how our grandkids and great grandkids will remember us, if at all. And what about their kids and grandkids? We’re the first generation with this unique opportunity to preserve our thoughts, pictures and video for generations who follow us hundreds of years from now.

    Robert says, “I own and share my data. Owning it lets me share forever. I think that’s better than letting a company own it and share it on my behalf.” Beg to differ, Robert, but you are not going to be here forever. Who’s going to manage it when you’re gone? Your kids? Doubtful.

    Are we all willing to let everything we share and think just fade away? Seems like there’s a better way and I, for one, am committed to working on this. http://remembergranny.com/?p=361

    Like

  32. My concern is all about how our grandkids and great grandkids will remember us, if at all. And what about their kids and grandkids? We’re the first generation with this unique opportunity to preserve our thoughts, pictures and video for generations who follow us hundreds of years from now.

    Robert says, “I own and share my data. Owning it lets me share forever. I think that’s better than letting a company own it and share it on my behalf.” Beg to differ, Robert, but you are not going to be here forever. Who’s going to manage it when you’re gone? Your kids? Doubtful.

    Are we all willing to let everything we share and think just fade away? Seems like there’s a better way and I, for one, am committed to working on this. http://remembergranny.com/?p=361

    Like

  33. First I want to applaud you for going to the LoC and talking to Mr. Kahle. Digital preservation is a serious issue in archives today and one that archivists around the world are tackling. Unfortunately with proprietary formats being so prevalent, there is little guarantee that the content will stick around very long if you just let it sit. I actually encourage ppl with digital photos on their HD to print out the ones they want their grandchildren to see. In this area there is exciting open format work being done to preserve content however and archivists are coming up with creativity solutions to capture the linking/concept mapping that is happening with the web today.
    There has been an explosion in content over the past 100 years and the internet has contributed greatly to the crush of information. I don’t forsee all of the content from so many blogs being captured for antiquity. I laud what Mr. Kahle is doing but I find that he is (1) lacking a good access and retrieval system (2) really doing a disservice by capturing so much, because what is important becomes irrelevant. Yes, I am actually argue against equity when preserving history. But Ah! there is the point. What is important?! But surely not everything? If so then who decides? In archives archivists have begun annotating historical documents to explain why they are keeping what they do. They recognizie that there is little objectivity anymore. And unfortunately, historically, the powerful usually decide what to keep and what is remembered. This is frightening (and trust me its done in totalitarian regimes every day!), but something archivists fight against every day! This is part of what makes this small, little-known job of the archivist one of the most important in our information saturated world today.

    Like

  34. First I want to applaud you for going to the LoC and talking to Mr. Kahle. Digital preservation is a serious issue in archives today and one that archivists around the world are tackling. Unfortunately with proprietary formats being so prevalent, there is little guarantee that the content will stick around very long if you just let it sit. I actually encourage ppl with digital photos on their HD to print out the ones they want their grandchildren to see. In this area there is exciting open format work being done to preserve content however and archivists are coming up with creativity solutions to capture the linking/concept mapping that is happening with the web today.
    There has been an explosion in content over the past 100 years and the internet has contributed greatly to the crush of information. I don’t forsee all of the content from so many blogs being captured for antiquity. I laud what Mr. Kahle is doing but I find that he is (1) lacking a good access and retrieval system (2) really doing a disservice by capturing so much, because what is important becomes irrelevant. Yes, I am actually argue against equity when preserving history. But Ah! there is the point. What is important?! But surely not everything? If so then who decides? In archives archivists have begun annotating historical documents to explain why they are keeping what they do. They recognizie that there is little objectivity anymore. And unfortunately, historically, the powerful usually decide what to keep and what is remembered. This is frightening (and trust me its done in totalitarian regimes every day!), but something archivists fight against every day! This is part of what makes this small, little-known job of the archivist one of the most important in our information saturated world today.

    Like

  35. It is a reminder how quickly data is lost when you consider that many people could not tell you the names of their great-grandparents let alone what they did or who they were. We’ll be forgotten just as quickly.

    As for the words: if they have impact then they will be copied, not as the result of an editorial decision but in celebration of what they say. Quality information will be copied, quoted, and preserved because of the inspiration it brings to others.

    However, must information will unfortunately be lost and it is not for any editor to dictate what will be interesting to a future set of readers. I’ve read Homer and Livy, along with many other celebrated works. I’d actually be more fascinated to read journals written by a commoner from back then (not that many could write, but for the sake of argument). I’d be interested to know what occurred in their day to day lives. Perhaps future generations will be fascinated to know what it was like to have cats as pets when our world looks like Terminus and no room for pet animals remains.

    Like

  36. It is a reminder how quickly data is lost when you consider that many people could not tell you the names of their great-grandparents let alone what they did or who they were. We’ll be forgotten just as quickly.

    As for the words: if they have impact then they will be copied, not as the result of an editorial decision but in celebration of what they say. Quality information will be copied, quoted, and preserved because of the inspiration it brings to others.

    However, must information will unfortunately be lost and it is not for any editor to dictate what will be interesting to a future set of readers. I’ve read Homer and Livy, along with many other celebrated works. I’d actually be more fascinated to read journals written by a commoner from back then (not that many could write, but for the sake of argument). I’d be interested to know what occurred in their day to day lives. Perhaps future generations will be fascinated to know what it was like to have cats as pets when our world looks like Terminus and no room for pet animals remains.

    Like

  37. You need to print everything a few times and save it at various locations around the world. It might just survive then. But what happens if one of the smaller video hosting guys goes bust… or a photo sharing site… yeah I have the stuff somewhere… but will I go to the trouble all over again? If one of these services were to go down then I hope a big player would step in to offer support in moving assets

    Like

  38. You need to print everything a few times and save it at various locations around the world. It might just survive then. But what happens if one of the smaller video hosting guys goes bust… or a photo sharing site… yeah I have the stuff somewhere… but will I go to the trouble all over again? If one of these services were to go down then I hope a big player would step in to offer support in moving assets

    Like

  39. Since your words are stored in so many different ways all around the world on 1000s of different machines and even digital media, I would say that you have a better chance of your words surviving longer than 100 years.

    Mostly due to the proliferation of social media and cloud storage databases.

    I have the backup CDs to my complete files from 99, my first computer.

    How enduring this format may be and the actual digital media it is recorded on is another story.

    As far as library fires go, the greatest loss was the burning of the Egyptian Royal library at Alexandria. Probably the greatest loss of what the ancients really knew and the history of ancient Egypt. Like how the pyramids were built. Not to mention Greek history stored there when they dominated the region.

    I wonder if any of our current knowledge will be looked back upon as having been worthy enough to mourn the loss?

    Like

  40. Since your words are stored in so many different ways all around the world on 1000s of different machines and even digital media, I would say that you have a better chance of your words surviving longer than 100 years.

    Mostly due to the proliferation of social media and cloud storage databases.

    I have the backup CDs to my complete files from 99, my first computer.

    How enduring this format may be and the actual digital media it is recorded on is another story.

    As far as library fires go, the greatest loss was the burning of the Egyptian Royal library at Alexandria. Probably the greatest loss of what the ancients really knew and the history of ancient Egypt. Like how the pyramids were built. Not to mention Greek history stored there when they dominated the region.

    I wonder if any of our current knowledge will be looked back upon as having been worthy enough to mourn the loss?

    Like

  41. Reminds me of EH Carr’s observation in What is History?

    He was writing a thesis on the Peloponnesian War, and had on his desk a stack of volumes containing pretty much everything that had been written on the subject. This made him wonder how, of everything that had ever been observed and known about the war, these few volumes came to be “the facts” and “the history” of the war.

    In many respects, we have acquired outsized expectations for what should last and in what amount of detail.

    Like

  42. Reminds me of EH Carr’s observation in What is History?

    He was writing a thesis on the Peloponnesian War, and had on his desk a stack of volumes containing pretty much everything that had been written on the subject. This made him wonder how, of everything that had ever been observed and known about the war, these few volumes came to be “the facts” and “the history” of the war.

    In many respects, we have acquired outsized expectations for what should last and in what amount of detail.

    Like

  43. The Hotmail ad comment is simply mean. I suppose techies should know better–in fact, true techies should send their letters from gmail–but the culprit here is Hotmail.

    Like

  44. The Hotmail ad comment is simply mean. I suppose techies should know better–in fact, true techies should send their letters from gmail–but the culprit here is Hotmail.

    Like

  45. With as much new content being added to the web each and every day, I don’t see how stuff cannot disappear.. Think of it this way, if it did not.. how much more redundant would searches be than they already are?

    Like

  46. With as much new content being added to the web each and every day, I don’t see how stuff cannot disappear.. Think of it this way, if it did not.. how much more redundant would searches be than they already are?

    Like

  47. I think that all web creation tools/packages and all web sites should have a button availible to their creators that, once pressed, sends the web page to sites like the internet archive that strips off the security bits and pieces and just saves the important parts for future people to access. It is sad that when information storage devices grow exponentially in size, that we lose these old web sites due to hardware/software crashes.

    Like

  48. I think that all web creation tools/packages and all web sites should have a button availible to their creators that, once pressed, sends the web page to sites like the internet archive that strips off the security bits and pieces and just saves the important parts for future people to access. It is sad that when information storage devices grow exponentially in size, that we lose these old web sites due to hardware/software crashes.

    Like

Comments are closed.