A return to prehistory, and an HTML header detective story

I just resurrected one of my old articles from when I worked for the Dark Lord Murdoch. Most of the stuff I wrote back in the Before Time of the dot-com boom I also published on dansdata.com. I think there was some kind of News Limited copyright contract that expressly forbade that; I dealt with that problem by never signing it. But some articles were News Interactive only, and thus disappeared into the ether.

The sporge article was copied with permission here and there, but it vanished quickly from the AustralianIT site because, wait for it, they didn't archive old material.

At all.

It was as if Rupert Murdoch could only afford a two megabyte Geocities page, or something. Stuff just got thrown on the floor after a month or three.

Oh, and the article URLs were encrusted with gunk by the Content Management System. There'd be lots of gloriously typeable stuff like http://www.australianit.com.au/common/storyPage/0,3811,1588489%5E504,00.html.

Those attuned to the pre-blog Force will recognise these URLs as being the unvarnished output of the Vignette Content Manager. Many sites used Vignette back then, because they did not listen to Phil. But those with a bit of civic pride put something between Vignette and the world that turned the URLs into something non-horrible.

We didn't.

And, when I was there, our site had no search function. Not even for the miserable quantity of content that was on it at any given time.

Oh, there was a search that we could use. And the people who sold Vignette to News showed, I believe, a feature list with a great big tick in the "searchable" box, obviously, because what kind of cockamamie CMS makes a site that can't be searched.

But, as it turned out, the only way to actually search the content of a Vignette site at the time was by doing something like signing in each searching user as a user of the actual Vignette server itself. This, for no readily identifiable reason, rapidly paralysed the very expensive servers that were all humming away on the lower floors because Rupert (or more likely Lachlan, who was in charge of the whole Australian online endeavour, but ain't no more) could not abide the concept of off-site hosting.

At the time, I got the impression that Vignette's advice to people who were wondering why their $250,000 server couldn't handle the traffic of their medium-popularity site, which Apache could handle happily on a single processor 400MHz P-II box, was "buy a $500,000 server".

Yeah, yeah, I hear you say. So far, so unremarkable. Messed-up stuff happening during the dot-com era in places where the big bosses made eleventy-three figures and people with piercings played pool and Dreamcast when they were meant to be working. There were only about a million of those stories, right?

The punchline, though, is that the Australian IT site is still, seven years later, now that the Web is more than twice as old as it was then, exactly the same.

Check out this recent piece.

(Or don't, if you're reading this long enough after I posted it, 'cos it won't be there any more!)

Ugly URL? Check.

(Apparently the latest, version 7, edition of Vignette finally kills the comma-ed monster URLs. Either News still isn't running that version, or they're running it in backwards-compatible mode.)

No search box? Check.

And don't expect to be able to find anything by using that new-fangled Google or anything, either, because every page has a big bold <META NAME="ROBOTS" CONTENT="NOARCHIVE"> header, which tells search engine spiders to bugger right off. And there's a matching robots.txt, of course.

All of the other news.com.au pages have the robot-repellent, too. Though they add a trailing slash, 'cos they're cool.

News.com.au itself is, to be fair, not quite as bad as my old sub-site (which was australianit.com.au at the time, but is now a news.com.au sub-domain).

News.com.au has a search box, and an archive that goes back a whole thirty days - a fact I determined by searching for the name of Australia's prime minister, then chugging through to the oldest article, which when I found it was 29 days, 23 hours and 58 minutes old. See if it's still up when you read this - the searcher has, of course, now lost it).

But the searcher only searches news.com.au pages, not the australianit.news.com.au subdomain. Search for computery stuff, find nothing.

News Corporation, of course, owns a buttload of other newspapers and TV stations and such.

The UK Sun's site betrays in its headers that it's still running Vignette version five (with, of course, the commaed URLs to match), and it's got the NOARCHIVE tag too, plus an entertainingly officious robots.txt. It's got its own search box, though, and that seems to give access to miles of old articles.

Rupert owns the UK Times and News of the World, too. The Times is the same deal - NOARCHIVE again, but with an, um, archive of articles. But NotW breaks ranks, with less painful URLs, search engine archiving allowed, but no search box of its own.

(I wonder if someone sweated blood to get that to happen? There's a bloke called Dave whose e-mail address is at the top of the source of every NotW page. Perhaps he'll tell you.)

And then there's the New York Post, another News property. It appears to be entirely untouched by the pernicious spread of Vignettery, and behaves as a newspaper site should; 1337 h4XX0rZ may enjoy probing the short list of directories that robots.txt wants Google to leave alone.

Back here in the land of jumbucks and billabongs, though, News Interactive seems determined to keep itself as irrelevant as possible, by preventing people from finding articles current and old, even if they're willing to pay for the privilege.

This seems, on the face of it, to be a strange decision for Murdoch to make. Whatever else he may be, he's not stupid, and someone who's supposed to be so gosh-darned enthusiastic about bending the world's opinions to his will would, you'd think, be more keen on letting people read what his minions write.

If you excise yourself from the search engines and don't even keep a pay-for-access archive of old stories, then as far as the Web's concerned, you pretty much don't exist.

Then, however, it hit me.

All of the Web-excised newspaper sites may say things that Rupert wants the world to believe, and they may not. Rupert's flagship educator-of-the-proletariat, though, is not any of these mere papers; it is the entirely unbiased Fox News.

The Fox News site has Vignette URLs (and that Version 5 tag, too).

But it also has a big old archive, and no ROBOTS taggery at all, beyond this. Google indexes all of its pages just fine, just as it does with the similarly conservative New York Post.

So, in the end, all this is not WTF-worthy at all. It's part of the plan.

So take heart, everyone who still works for one of the News properties whose Web sites are hidden from the world.

Even if you have a hard time looking at yourself in the mirror these days, Rupert still does not believe you are trustworthy. Even The Sun isn't on-message enough any more.

I bet you didn't feel like a bunch of dissidents, did you?

2 Responses to “A return to prehistory, and an HTML header detective story”

  1. rsynnott Says:

    Note, though, that the UK Sun's robots.txt only bans spiders from articles from 2002-2005. Whether this is generosity or mere absentmindedness I don't know.

  2. » MAKING money for nothing and FINDING chicks for free How to Spot a Psychopath Says:

    [...] have always had it very easy. This is partly because I was smarter with my money during the dot-com nonsense than some of my friends. (Shiny new car and inner-city apartment? Nope, I'll go with rusty used car [...]


Leave a Reply