Why you cannot find crap with search engines anymore

Perhaps you have noticed that less and less of the useful content on the world wide web seems to be showing up in the search engines, and when it does, many of the links are dead. This is a bad thing. Well, in case you were wondering why, it is not because the web is growing too fast. Google and their competitors are well financed corporations, fully capable of scaling up their technology to handle more content.

No, this is the entirely the fault of a group short-sighted people who do not realize what they are doing. Control over the web rests in the hands of website developers, both professional and amateur, and they are doing a lot of things that take content that search engines used to be able to find, and make it invisible to the search engines. They also seem to be fond of breaking your bookmarks on you.

It is not that these people are trying to make the Internet a less useful place, mind you, just that they do not fully grasp the importance of the role they play, nor the magnitude of the consequences that result from their actions.

URLs get less respect than Dangerfield, rest his soul.

The core problem is that people these days, and by that I mean for the last several years, have lost all respect for WWW URLs, not that they had enough to begin with. By this, I mean noone seems to realize that once you post content at a URL, you should think twice about moving it.

Let us take a little walk down memory lane, back to when the WWW was just forming. What were URLs invented for? The answer is in an introductory section to RFC1630, one of the founding documents of the WWW, which reads in part:

   Many protocols and systems for document search and retrieval are
   currently in use, and many more protocols or refinements of existing
   protocols are to be expected in a field whose expansion is explosive.

   These systems are aiming to achieve global search and readership of
   documents across differing computing platforms, and despite a
   plethora of protocols and data formats.

This means that URLs were intended to be used not only for "document retrieval" but as part of an overall solution for those who wanted to implement a "global search" on collections of documents, be this through a proper search engine, or something as basic as "favorite links" page.

Naturally, changing the URL of an information resource sometimes cannot be avoided. A large amount of the information resources on the WWW are living in rented accomodations, for example, so situations where accounts expire are bound to occur. If this were the only reason why URLs were to change, though, search engines would have very little problem keeping up with what is living where.

But most URLs change because some "visionary" web developer comes trouncing through a website like a giant lizard, changing URLs left and right. This happens even with very large corporate sites. I cannot tell you how many times I personally have seen people like this not only rearrange content in a way that fails to preserve legacy URLs, but also destroy content entirely because the content did not fit their grandiose "makeover" plans. Form over substance seems to be their motto.

Such people need to be taught to keep their meat-hooks off the content of the WWW, to give the Internet a braid, not a crew-cut. Or they should go take up interior decorating instead.

Stylesheets, a small salve for a big dynamic rash

Most web developers consider "stylesheets" to be a "neato" enhancement for the world wide web. In actuality, however frivolous they may seem (all they do is let people make their webpages look more like magazine page) stylesheets have a fortunate side-effect.

Before stylesheets, of which there are now sevaral different flavors the WWW was quickly headed for disaster. It still is, but stylesheets slowed the process down. What was happening is that the abovementioned web developers, who are of course incredibly gifted and just ejaculating with talent, were feeling very confined by the good-old HTML standard that built the WWW. They just could not express their throbbing sense of creativity with nothing more than font sizes and graphics images.

Eventually in their lust to show us all how great they were, they started abusing CGI, the standard that has always been useful for online surveys and administrative tasks, but nowadays is more often used to send our personal information to entire corporations full of complete strangers for no good reason. They figured they could get around the chafing they felt by making web content "dynamic."

This would allow them to ascend to a whole new level of fickel. Before, when they wanted to give their websites another pelvic tattoo, they had to go and alter all the pages on the website. With the birth of PHP, they now could change their minds as often as a teenage girl does when choosing a blouse. The way it worked is that instead of going directly to the content, the users would go to CGI script which would then go find the content and primp and tease it with all the fanfare of prom night before showing it to the user.

In one sense, this would be a good thing, because it would keep the web developers far away from the content. But it backfired. You see, web developers do not necessarily know how to program very well, and many of them found it too difficult to keep the URLs where they used to be, if they even thought enough to care, and more to the point, they missed a simple trick that would allow you to bookmark a dynamic page. So for a while, and still in many places, we had websites where none of the pages really had their own URL.

Fortunately, stylesheets not only provided a irresistably sexy and chic distraction to our intrepid savants, they provided the features sought from those less clueful among the PHP crowd, reducing the use of dynamic content for purely cosmetic reasons to a lingering fad.

The Cookie Monster

Cookies should receive a brief dishonerable mention. While the modern-day web unfortunately still seems to be addicted to them, at least they are used less and less in a way that actually conceals content. For a while, though, the enemies of the URL were up to their elbows in the cookie jar, abusing them to the extent that they started baking their own homemade cookies that acted like a URL. These weren't very filling for your bookmark file, of course, and had the side effect that a website could behave very differently depending on which computer you used to access it and if/how you had accessed it in the past.

The forum and the almighty wiki

You probably feel pretty good about yourself when you can answer the questions of an essentially anonymous stranger in a tech support forum. And the great thing about web forums and wikis is that anyone else with the same question can find your answer, so it is the gift that keeps on giving, right? Well think again. A large number of online discussion, blog, and knowlege gathering applications don't get indexed by the global search engines.

Sometimes this is due to problems with the software itself, which is not written in such a way as to give each comment or thread a unique, permanant, URL. Some of these packages even pollute the search engines by giving out URLs for lists of most recent topics -- by the time these URLs appear in the search engines, the subject matter the seeker is after is hopelessly lost in the site archives. Having site-wide search engines is a good thing, but it doesn't make up for the way these supposedly public forums conceal their payload from the world at large. Besides, do not be deluded, if there are not already several other related forums that are just like yours that are not covered by your site-wide search, there soon will be.

Other times, the software is perfectly capable of providing unique URLs, but the person or company running the site has elected not to turn this option on, or has no idea the option exists or is just lazy. Quite often it is the case that the person in charge does not see the value in doing so. Even the silliest discussion list can have a few gems of information, and besides, why should the WWW be just for "serious" information? A lot of people google for fun, too. Forum moderators and the like need to consider this.

The dreaded "subscription"

With due acknowlegement to those who run semi-confidential services or have legitimate concerns about intellectual property, sites that require online registration for read-only access are not helping matters either. A great number of these sites have no real reason to do so -- they just figured they would milk the users for some amusing, insightful, or (for the less scrupulous) resellable demographics information. In the meantime, they are locking their content away from search engines and depriving the greater part of the Internet community.

The situation with Google News highlights this the best. In order to get a decent news search enigine to work, Google has to go around subscribing their spider to a whole lot of outlets. They should not have had to invest this level of time and effort to do so. The online papers should either put a news article online for free access without a login, or they should not. This "free but registration required" scheme just makes all our posteriors hurt in the end.

Want to help?

Both users and, of course, web developers can help put a stop to this. Making the search engines more useful by improving their content and reliability will increase their usage. As their usage grows, "being indexed" will again become something the rest of the Internet cannot afford to ignore.

Even as a user, you often have many choices where you go to engage in the online exchange of ideas. Among those groups, forums, wiki's, etc, choose one that A) doesn't require a login for read-only access B) keeps old articles around and C) shows up (with good links) when you search for something inside with a major search engine.

Web developers can help by of course not committing all the abovementioned sins. It would also be a help if even the amateurs took the stability of their web provider seriously, if possible registering their own domain so they can move providers without nuking URLs.

Copyright (c) October 2004 Brian S. Julin