Must Consume Everything

Monday, July 14 2003

I’m probably echoing what’s already been said, but it occurs to me that things in the blogosphere aren’t as they could be.

There are so many weblogs now – more than I know. You could tell me there are ten thousand and I’d believe you just as easily as if you told me there are 3 million. I only read about ten, at the most, and it’s because I don’t want to take the effort to find new ones. There are so many, I don’t even know where to start.

Because I know there is good unknown content out there, I’ve installed tools like Newsgator

and other “feed” based tools to help me discover new content, but rarely do I discover anything, except that the feed doesn’t work, or I’m back to the problem of finding new feeds.

Something needs to be done. Anil mentioned something at SXSW that I thought he was saying with a bit of sarcasm at first, but realize now that he was probably being more serious than I assumed (and foreshadowing services like TypePad at the same time).

He said something alluding to the fact that we should be able to search 20,000 weblogs at a time. At first it sounded crazy and impractical. Why would I want that much data when perhaps only a small percentage of it would interest me.

That’s exactly the point I think he was trying to make. If I have thousands of blog entries at my finger tips I can do some pretty amazing things. I can make relationships with the entries, I can categorize them, I can search against them, I can find new content very easily. By having thousands of available weblogs I would at least be able to find the small percentage of weblogs I’m interested in.

So basically what I want is a relational database snapshot of the entire blogoshere. I want a data warehouse that I can perform not only statistical analysis on, but make intelligent relationships between blog entries. I want to be able to see trends appear, spot memes as they’re happening. I want a stock ticker that tells me the common theme for the day.

I once set about making this tool. Getting the data wasn’t difficult. I built some functions to grab XML based feeds and stored them in a database. But the trials of importing all the different flavors of properly formed XML, let alone poorly formed XML, were great. There are far too many formats, and versions of formats, and unknown formats.

So I gave up. I now I sit and wait for someone else to make it.