Article - Issue 6, November 2000

Searching the web, or Mr Livingstone I presume

Michael Holyoke

Download the article (174 KB)

You probably use the internet, mostly for work and mostly to send email or to consult well-known resources, but some of the time for entertainment, or just out of curiosity. If you're like me, you've probably thought that with well over a billion web pages published, whatever you've wanted to find must be out there somewhere. And like me, you've probably tried a search engine, lots of them, and been frustrated by the results, and wondered why you can't just jump directly to the site or page that you just know must be out there, though I hope that like me you’ve found something interesting along the way.

So you may have wondered how search engines work, or try to, and how the companies that make them or offer them make money, if they do. And is there any real reason why the tiny clutch of Stanford graduates who started them all up are so phenomenally rich?

How they work

Any search site will need to have three components, which they may own and create themselves, or get from partners, or both. The three components are: a way of gathering information about web pages, a way of indexing them, and a way of retrieving results of queries and displaying them ranked in a particular order to users. All search sites have these components, but all of them differ slightly or greatly in the way they go about it.

Gathering it

Web search facilities get their information in one of two fundamentally different ways, and the difference defines a tension that runs throughout the web. On the one hand, there are teams of web editors and surfers compiling directories and lists of websites by hand; on the other, there are ‘spiders’, ‘crawlers’, or ‘bots’ (from ‘robots’) that use a computer program to read and index web pages.

Human indexing

The most famous, and oldest, of all the web search facilities is Yahoo. Yahoo is a classic directory-based search facility. For years, web site owners have prayed and begged for their site to be accepted by one of the hordes of web editors at Yahoo, because Yahoo is still the most popular of all the search sites on the web. They gather their sites from user recommendation, I suspect usually from a very anxious webmaster. But the key to the attraction of directory-based sites is not just the human gatepost, it’s the value they add in the categorization and taxonomies of information they develop (and evolve on the basis of user input) for their directories. If you search in Yahoo, you will have noticed how it will always return the category of the matches, as well as individual sites and pages; Yahoo’s real value lies in the investment in making sense of the web in a structured way. On the other hand, the sheer size of the web, now estimated at well over 1 billion web pages, and probably closer to 2 billion, means that even an army couldn’t possibly keep pace. But should they try? We’ll return to this question later.

Automated indexing

Web crawlers or spiders don’t involve human intervention at all, and so have the upper hand in the ‘land-grab’ of indexing. In a web-crawler, a computer program scans web pages for text content, following links to reach as far into the web as it can. It’s a fun experiment (if you’re a closet geek like me) to start at a particular site, and without waiting too long to see what you get, to just keep hitting links. I’m sure that one of my Royal Academy of Engineering readers will be able to point me to research which contains some way of estimating the total number of linked sites starting from a given point, but I suspect that amongst the many different webs that start from different points, there is both a great deal of overlap and also of disunion between sets, assuming that sites don’t always have reciprocal links.

Which brings me to the question of scale and coverage. One of the key points of web crawlers is the vast number of pages they claim to visit. This is a battle for biggest that has been running feverishly since the mid-to-late nineties with occasional quiet periods, and has just recently skyrocketed again with the claim of Google to have indexed over a billion pages, and to have scanned for full indexing 560 million. So there is no doubt that web crawlers can visit more of the web than teams of editors. Other crawler sites, such as Webtop, now claim 500 million, and Alta Vista and FAST weigh in at around 300 million plus each. But what no one seems to know is the degree of overlap between these enormous indexes, and if there are categories in which (if this is possible in so huge a field) one or another index excels.

And there are key areas of the web that crawlers cannot reach: pages which can be (and certainly frequently are for major sites) barred from spidering to prevent access directly to specific pages when the site only wants subscribing members to come in through the front door, and sites whose content is served dynamically from databases. The latter of course are some of the most valuable research sources on the web, particularly for sciences and engineering.

Storing it

Web crawlers index as they read; directories in effect are indexed as their editors compile and categorize the sites they add to the list. One of the key selling points for directory-based sites is the intelligence of their taxonomies, but there are only subtle differences between the indexing strategies of the different crawlers. But between search indexes compiled by people and those compiled by machine, there is an interesting difference. It is hard to deceive even a credulous editor about the true focus and depth of your site, but a robot is easier to deceive. There is a constant battle between webmasters desperate to boost the frequency of indexing of their site for key terms, and the bot programmers to detect and frustrate these ruses. A few years ago it was not uncommon for sites desperate for traffic in order to boost advertising revenue to conceal in the ‘metadata tag’ of the home page a vast repetition of sexual references, as these were then far and away the most commonly searched terms. Viewers cannot see them, but the robot often can. There is in fact an entire industry built around improving your ranking in search engine results, and one company that specializes in these techniques (Website Results based in Los Angeles) was just sold for $95 million dollars.

Serving it up

This takes us to ranking, that is, how search engines order the results of your queries according to what they judge to be relevancy. For most search engines, this is pretty basic, with some tweaking, and it depends on frequency, location, and for multiple search terms, proximity. What that boils down to, is that if you are searching for ‘dog’, the search engine will count the number of times that dog occurs in a page, and where it occurs in a page. If it occurs once, then the ranking is low, if it occurs 2000 times, then obviously the ranking is higher. If it occurs in the title, or in the first sentence, or any number of selected positions, then the ranking for the page against ‘dog’ goes up again. For multiple search terms, such as ‘black dog’ the proximity of the terms can (and should) make a difference to the ranking. Most search engines also will cut out ‘stop words’ (‘a’, ‘an’, ‘the’ etc.) and most allow for ‘stemming’ (i.e. automatically including roots for stems and stems for roots of words, e.g. including ‘stem’ in the search if you put in ‘stemming’ and effectively vice versa). By the time you get to searching for ‘Churchill’s black dog’, depending on how sophisticated the counting, position, stemming, and proximity algorithms are, you could either end up with a link to a site about Churchill Falls, Canada, that happens to mention that the hydroelectric power station is blacked out and periodically dogged by environmental protests, or you could be shown to the site which gives you the exact quote you were looking for about Winston’s famous references to his periodic bouts of depression. Different search engines give different weightings to these occurences, and this tinkering can make a difference.

Improving accuracy of ranking

This tinkering is all to the good, and programmers have spent a lot of time agonizing over this and testing it to improve the accuracy of the relevance. But some search engines have taken this a step further and have tried to build contextual information into the search. The most straightforward way to do this is to add thesaurus capability to the search, which means that searching for one word will automatically include a search for any cognates. The more advanced engines have built algorithms explicitly around numerical linguistic analysis. The two most famous of these have both come from Cambridge (UK). Muscat (now part of SmartLogik, formerly Dialog, and the engine behind Webtop) is a patented search strategy built around linguistic inference, and attempts with excellent results to refine the relevance ranking algorithms based on statistical analysis of language. Key to the success of this strategy is the success it has in dealing both with proper nouns and longer search strings. Autonomy works on a similar model, but is not in the business of running a web search site.

Other engines have looked to user input to refine the way they return their results based on implied or explicit user input. Direct Hit tracks the sites users select from those returned for each search to fine tune the relevance ranking. So if a thousand people hit the second entry in a results list from a search on ‘trout fishing’ then this may move it up to first in the rankings. Theoretically then, if enough people use Direct Hit just prior to a conference on Richard Brautigan, author of the surrealist sixties novel Trout Fishing in America this could disappoint some serious anglers, but it is a clever way of reintroducing user evaluation into the automatic processes of indexing and retrieving. This popularity tracking is now also part of Ask Jeeves, which bought Direct Hit. Ask Jeeves works by compiling an index of questions asked by Ask Jeeves users and the web sites which provide the answers. These sites are selected by Ask Jeeves editors, but Ask Jeeves also monitors users’ interaction with the selected site to try to evaluate the usefulness of the editorial selections. Ask Jeeves uses natural language parsing (a way of converting ‘normal’ language to something a machine can interpret) and searches not an index of sites, but of answered questions, which means that theoretically it improves simply by using it. Another recent development that has had a huge impact on searching success is Google. Google’s strategy is to index not only the text, but all the linking from pages to pages. The rationale is that, as they put it, ‘a link from a to b is a vote by a for b’. Thus the more links to your site, the better your site must be, and therefore it gets pushed up the rankings. But more cleverly still, they count not only links to your site, but the number of links to the sites that link to yours. So if someone is looking for the best petcare site, Google will deliver a site with relevant content that has in effect been voted best in its category by webmasters across the web.

Hybrids

The web being the tangled beast it is, few of the major search sites use just one strategy. Yahoo for instance searches information in its directory first, and if it fails to find sufficient matches, will turn to a partner and search their ‘spidered’ index. Formerly, they used Inktomi, which had no search site of its own and operated entirely by licensing its index; now Yahoo have moved to Google. Likewise AltaVista, which uses Ask Jeeves, which in turn uses webcrawlers if it can’t find the answer in its index. Excite has its own search strategy based on automatic directory building, but in turn owns Magellan and Webcrawler. On top of all of these sit a new breed of ‘MetaSearch’ engines which pass your search to a collection of search engines and then collate the results. A colourfully named example is ‘Dogpile’. Recent reports have shown good performance, though the jury is still out on their benefits.

Money

How they make money

Until recently internet search engine companies have had two ways to make money: license their technologies to companies who want to run some sort of searchable service (web searching or their own datasets) or run a website on what is usually known in the internet world as ‘the media model’.

Licensing

This a traditional service-based way for search engine companies to make money. Companies like Verity (whose search engine and extensive hierarchical indexing grew out of work for the CIA), Inktomi (who licensed sets of indexes to Yahoo), Excalibur, Autonomy, SmartLogik all have as part or the entirety of their business to sell software on a license basis to business customers. Ask Jeeves now also licenses their suite of technologies. They all compete on either the extensiveness of their indexes, the quality of their categorization, or the smartness of their retrieval engine. A recent entrant to this market is WordMap, which has invested in a conceptual taxonomy of knowledge which is designed to allow users to refine their searches. They have an interesting business model, in that their ambition is to be a sort of application that runs within other searching applications, and they will license their technology to support other searching strategies.

Giving it away

‘The media model’ means (like a commercial TV station or a free newspaper) that you give your service away for free and make what you can out of advertising around the service. Ads come in the form of ‘banner ads’ (the ones across the top of the web site) or ‘sidebars’ (yes, the ones along the side, usually the right side of the site), which are interactive to the degree that they are all hyperlinks to the advertisers’ websites. The media model was thought to be the way to unlimited riches on the web (‘All those eyeballs! All that billboard space!’), but it’s taken a bit of a hammering over the last year. It transpired that even though an advertiser could guarantee that, unlike TV, someone has pretty certainly seen (if not actually paid attention to!) the advertisement, the advertisers weren’t satisfied. Some advertisers started insisting that they wouldn’t pay except when a viewer actually clicked on the advert. Not only does this diminish ad revenue to the website, but it means that you aren’t being paid unless someone leaves your site. Moreover, advertisers were used to the huge numbers of viewers offered by conventional media, and it has proved difficult for agencies to sell advertising unless your site has a lot of traffic. But at least for a time, it seemed that the really big sites (and the top search sites are the leaders in this league), could command enough for their billboard space to make it profitable. Yahoo in particular announced in 1999 that it had finally become ‘cash positive’. But then came the internet crash, and people began to realize that most of the advertising was coming from Internet start-ups, funded in turn by people who may turn out to have had more money than prescience, and Yahoo’s stock took a beating.

So giving away a service doesn’t look so tenable anymore. There is now a great deal of interest in other ways of making money from search engines and search sites. In one more of the many neologisms from the web, this is known as ‘monetizing’ the search.

Transactions, or crumbs from a slice of the pie

Another way for someone who has a lot of traffic to make money is to look at what might be sold to the visitors while they happen to be searching, especially as (now that the internet commercially is not entirely about pornography) they might actually be searching for something to buy. Then, all you need is a link to a site that sells something, a bit of coding to track users, and hey presto you can take a cut for delivering them to the site if they do finally buy something there. This is known as ‘transaction revenue’. On Yahoo, there is a prominent link to Amazon. If your site is as big as Yahoo, or AOL, you might be able to charge for the very presence of a direct link (i.e. a button all your own), in addition to a cut of each transaction on the ‘affiliate’ site. In fact, if you have any decent level of traffic, ‘affiliates’ will probably be jamming your commercial directors’ email offering a cut of their business, such as it is.

Paying for prominence

One of the reasons that e-commerce sites are so eager to be featured as a direct link is that there is always a multitude of competitors and, given the vagaries of the ranking software, even if you have the best offer you won’t necessarily come at the top of the list. So why not just buy your way to the top of the list? Horror! Compromise the integrity of the index and impartiality of the ranking? Unthinkable? Not any more. Yahoo now has a fast track submission system, which even while it has no direct bearing on your rank or even on whether or not you get added to the directory, jumps you to the top of the list for processing. Another search service, GoTo, has been far more explicit. In GoTo, webmasters bid for prominence in results. If Helicon wished to be featured top of the list for ‘encyclopedia’ we’d put in a bid. If somebody else bid higher, they’d come higher in the lists of results on searches for ‘encyclopedia’. The rationale here is to weed out the people who just know how to make their site come top of the list (as discussed above), and I suppose a presumption that the sites with the most cash and most focussed search criteria will probably also have the best content. Another way that search sites might make money is to charge not for coming top of the list, but just to be featured in a special spot on the results page, disassociated from the list, but related. This amounts to a form of contextual advertising which is what all sites with a media model are trying to achieve as a way of driving up the perceived value of web-vertisements.

Paying for presence

Another way of trying to get sites which are served up to pay up is to own advertising opportunities for these sites. One way of doing this is to show the site that is selected by a user from a results list within a frame of the search site itself. Ask Jeeves already does this, though it leads to multiple layers of advertising; the advertising presented on the search site and the adverts on the ‘destination’ site. This in turn can mean precious little space left for the content you’ve actually searched for.

Paying for the service itself

So far the ways I’ve outlined for the search service providers to make money involve what they can charge businesses, either by way of charging for software, or selling prominence, or sharing ad revenue. But none of these costs a cent to the person who uses the search engines. There are sites which charge. More commercial services such as special business directories or academic search sites have happily charged a subscription fee, which users who recognize the value of the service have been happy to pay. But what about the ‘freedom’ of the net? Can you make general users pay against the background expectation of so much free service? Well, as search services go bust or get bought out, I think we can expect to see new ‘business models’ come into play. In particular, as new charging mechanisms come online that allow for ‘micropayments’ that are directly related to usage, I think that search sites could begin to differentiate at least between full and ‘lite’ indexes and services. Two new methods of payment are coming on stream at the moment. One is the ‘automatic’ switching to premium rate numbers (a switch that under law must be made completely obvious and consensual to users). This allows users to be billed on their standard telephone bills and the services which use the premium rate lines to be paid per minute at a given rate. In between users and search sites lie the middle men who operate premium rate numbers, and act as billing agents. This is another model of making money on the internet invented by the pornography industry, though interestingly, in the UK there is at least one such provider, Acquist, who refuses to operate the service for porn sites.

Another way of taking tiny bits of cash is for users to have a ‘wallet’ which they charge up and spend on content. Currently this is aimed at discreet chunks of pre-wrapped content, though there seems no reason, technically at least, not to extend this to served pages. Again the leading player in this field is a UK company, called Magex, a start-up with investment from NatWest, among others.

So what’s it all about?

Search strategies and inventions abound, and I’ve only outlined a few. A great deal of work is being done on developing agents which learn about what users find useful, and what users want. The former helps to deliver better search results, the latter to deliver more targeted advertising and services. Companies spring up with ideas and new business models, and in spite of the internet crash they are still getting funding. Because the one thing about the net is that it’s a mess; it sprawls and mutates constantly, and there are people out there with something useful to contribute and, incredibly, often willing to do so for free. But you can’t find what you want on the web by wishing for it. So in principle the people who can organize this rich mess for you should be in position to make a lot of money. At the moment, everything is about land-grab, and users and telcos are the beneficiaries. But I suspect that there really is no such thing as a free lunch, not for long. One way or another, I suspect that soon if you want to find your Mr Livingstone in the jungle, or find him faster than your competitor, you will have to pay your guide.

It is hard to deceive even a credulous editor about the true focus and depth of your site, but a robot is easier to deceive the people who can organize this rich mess for you should be in position to make a lot of money.

Michael Holyoke
Director of Online Development, Helicon

Michael Holyoke is Online Development Director of Helicon Publishing. Helicon, a wholly-owned subsidiary of W H Smith, publish the Hutchinson Encyclopedia, Almanac, Chronology and other reference works, which are licensed in whole or part to online service providers, including NTL, BBC online, AOL-UK, Compuserve, and Research Machines among many others in the UK and US. In the constant effort to structure data that best satisfies search criteria, Helicon has a keen interest in the way search engines work.

[Top of the page]