|
| Navigating the Noosphere |
While I was doing some research on the Web over the last couple of weeks, I began to get a real sense of how annoying doing research on the web can be. Two of my topics were Mems, and BioChips - a fairly obscure topic, but not obscure enough to exhaust the possibilities for web-searching after a couple of days. As the project progressed, I would continually come across new sites and new avenues of research that had previously been obscured from view. I've never really spent entire days trying to systematically research a topic before, so this was a rather new perception of the Internet for me. Although I'm an Internet junkie, I think like most internet users I usually stick to my usual haunts, with the occasional foray into new and unexplored realms. The Bottom Line: the web is chock full of information, but we are currently stuck with a very crude set of tools to handle this torrent of information.
First, it's pretty clear by now that Yahoo-style manual indexing of the web isn't going to be viable for much longer. The pages keep coming, the branches of the index grow more complicated, and it's just not feasible to keep throwing bodies at the task. And on the other hand we have Altavista, HotBot and the other text-centric indexing systems that seem to bring more inaccurate retrievals as the months go by. It's not a particularly optimistic trend for the keyword search engines. One interesting newcomer is the search engine Google. As you probably know, it develops a measurement of significance by the number of pages that link to any given page for a topic. It works great for getting rid of the spam porn sites, and works surprisingly well for many searches. But one drawback to it's indexing system is that it tends to reinforce the "status quo" - well established sites already are linked to, but it takes some time for a new and obscure site to develop the "grass-roots" support that would increased its perceived significance. And that can be a real problem, because often the best information is tucked away on some out of the way site, unknown to the rest of the web. While Google can rank the significance of any given document through an automated process that would scale up nicely, it still misses the mark. Another site tackles the significance problem by depending on the good judgment of a randomly selected rotating group of users. Slashdot.org is one of the better internet discussion forums for geeks - it's very pro-Linux but spills out into other computer related topics. Users can send in articles for posting, which are filtered by the moderators, and then a long discussion thread will ensue. But as the thread progresses, a random rotating selection of semi-regular users is awarded a limited number of point to either promote or demote a response. In theory, (and seemingly in practice) the good articles will get promoted, while the flames and mindless comments get demoted. Readers of the site can set their sensitivity to the ranking of posts - allowing only the best posts to surface, or viewing every single post, or some mixture in-between. It allows a filtering of significance without outright censorship, and seems to work pretty well. But, they have a self-selected group of pretty responsible users, and the signal to noise ratio is pretty high even without the promotion/demotion system. While it works good in this limited implementation, I don't think it could be scaled up to supplant a search engine.
The problem as I see it is to somehow track the attention of users, and do it in an automated fashion. One solution could be to integrate an internet browser and a search engine into and information gathering and dissemination tool. Netscape has already taken the first baby step towards this with it's "what's related" button, which tracks the link-hopping of Netscape users and distills it and spits out a short list of sites that are related to the site you are currently browsing. I envision doing this in a much more systematic and relevant fashion and feeding the results back to a search engine. The browser would keep track of your progress through the internet, and most importantly keep track of the attention you give to the sites you link-hop to. The easiest way to do this would be to measure the time spent on a link that is followed. If you follow a link, and then hit the back button after 2 seconds, then obviously the site was not too relevant. But if you spend time on the site (presumably reading the text) and digging deeper into the site, the browser can measure that as well. The browser could check to make sure that that the mouse is moving to figure out when you are actually paying attention, or if you have just left the page up during an interruption. Then, the next step would be for the browser to feed this information back to a central search engine site, where it could be combined with Google or Altavista type information. There the history of pages viewed in the browser, and the links followed, could be combined with the pages gathered and indexed with web-search 'bots. One function would be to "tell" the search engine about obscure sites for further follow-up - every site gets viewed by somebody's browser eventually. The other function would be to provide a "roadmap of significance" - measured by the aggregated activity of millions of users. If Google can weight significance by measuring the weblinks "upstream" of any given site, the time and attention data gathered from the "downstream" of any given site should be able to be evaluated as well. While in the long run, expert systems and neural nets may be able take up some of the slack, in the medium run, tapping into the aggregated individual interest in the internet could be a way to fill the gap. I wouldn't go so far as actually forecast the emergence of this type of significance ranking system, but it does seem like a rather sensible approach. The websurfing habits of hundreds of millions of internet users is a goldmine of information that is just waiting to be put to good use... November Note: one search engine - Direct Hit - has adopted a crude implementation of this idea.
One site I came across seems to be a small step in the right direction towards creating some kind of useable graphical interface for web data. Newsmaps takes collections of news articles (up to a few thousand) and plots their relationship to one another from the perspective of an aerial view. The data gets clustered by similarity into peaks and valleys, and the picture that emerges are islands of data within a blue sea. Sadly the program is deathly slow, and not implemented particularly well. But, it is rather fascinating to be able to get a visual image of the "landscape" of news and an immediate representation of the most abundant articles and stories. What I would like to see is this technology sped up considerably, and have the sources of data widened considerably. A site like Yahoo would benefit considerably - no longer would you be faced with a wall of links - instead you would get peaks and valleys which represent the depth of data, as well as the degree of similarity between the content. When the bandwidth gets fast enough, what I would really like is to be able to use one of those 3-d pointing devices (used for games at the moment) as the navigational tool. One would see the landscape of data "below" and then you could rapidly use the pointer to zoom into a field of interest, and have each subcategory come into view seamlessly as you dive deeper into the data. Besides being pretty neat looking, it would have the added benefit of keeping the content you are reading within a context of related content. It would also allow you to change "elevation" in order to move from the abstract to the particular. This graphical interface for internet data could also be applied to search engines, portals, or even (a big maybe)..the internet as a whole.
In viewing the NewsMaps site I couldn't help thinking of Graham Molitor's curve of an emerging issue. Although the NewsMaps site only takes a frozen snapshot of an area of interest for a specific time. If you took ALL the articles over a period of time...oh lets say a few years...you could maps all the articles to the graphic overlay and create an animated sequence of daily data. You could see the ebb and flow of issues over time, from blip to mountain, and then back again. I have to wonder if there are would be any patterns or cycles in the data patterns that could be created, or if it would be possible to forecast the region in which a new emerging issue would emerge from the data landscape. Well, pending the venture capitalists investing in Justman Inc. and providing me with a fleet of crack Indian programmers....that's all for this week. Written by Mark Justman Copyright 1999 Posted 10/23/99 http://go.to/futureplex |