|
09 June 2010 ,
Written by Dhruv Tanwar
|
|
Google has woken up the world of online search by serving up an ingredient that millions of people rely on to kick start their workdays – Caffeine.
Caffeine is Google's latest web indexing system, which provides 50 percent fresher results for web searches than its own previous index, and has the largest collection of web content Google has offered. Speed and relevance is what Google is aiming at with Caffeine, saying that speed-to-content would be much faster, specially for recency-based content such as news stories, blogs or forum posts that will now be available sooner than ever before after being published.
Google software engineer Carrie Grimes explained the mechanics of Caffeine to the average user. “When you search Google, you're not searching the live web. Instead you're searching Google's index of the web which, like the list in the back of a book, helps you pinpoint exactly the information you need.” Grimes graduated from Harvard with an A.B. in Anthropology/Archaeology in 1998, and an interest in quantitative methods for dealing with disparate data. She graduated from Stanford in 2003 with a PhD in Statistics after working with David Donoho on Nonlinear Dimensionality Reduction problems, and has been at Google since mid-2003. Dr. Grimes currently leads a research and technical team in Search Infrastructure at Google and works actively to figure out what criteria make a search engine index "good," "fast," and "comprehensive."
“Content on the web is blossoming. It's growing not just in size and numbers but with the advent of video, images, news and real-time updates, the average web page is richer and more complex. In addition, people's expectations for search are higher than they used to be. Searchers want to find the latest relevant content and publishers expect to be found the instant they publish,” Grimes said in Google's statement introducing Caffeine to the world.
In Google's old index, there were several layers, some of which were refreshed faster than others, while the main layer would update every couple of weeks. To refresh a layer of the old index, Google would analyze the entire web, which also meant a significant delay between when a page was found and when it was made available. Caffeine, on the other hand, lets Google analyze the web in small portions and update its search index on a continuous basis, globally. As the search engine finds new pages or new information on existing pages, it directly adds them straight to the index, which means fresher information available faster than ever before “no matter when or where it was published.”
“Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles,” Dr. Grimes explained. More improvements to Caffeine are said to be in the works. |