|
03 February 2010 ,
Written by Dhruv Tanwar
|
|
Search giant Google, which moved to supporting the Unicode character set early on, has said that the standard has now exceeded all other encodings of text on the web. In the past two years, the Unicode format has scored over most other character sets, including pure ASCII and Latin-1 pages, which according to its indexing of the worldwide web, no hold less than 20% each.
 Based on data from its indexing of web pages, Google says Unicode is now used for just under half of all web pages. Mark Davis, Google's senior international software architect, in a blogpost said Unicode has posted dramatic growth over all other encoding standards in the 18 months since Google first published a graph showing Unicode's growing dominance on the web. Back then, Unicode had just scored as the most used standard. In a May 2008 blogpost, Davis had announced Google's support for Unicode 5.1 less than a month after it was released, so that “people speaking languages such as Malayalam can now search for words containing the new characters in Unicode 5.1.” In that blogpost he had said that web pages used various character encodings, such as ASCII, Latin-1, or Windows 1252, or Unicode, with most encodings being capable of representing only a few languages. As he points out, Unicode can handle anything from Chinese to French to Arabic to Zulu.
 Google has long used Unicode as the internal format “for all the text we search: any other encoding is first converted to Unicode for processing.” Davis now says Unicode has exceeded all other encodings of text on the web. Showing a recent graph from Google internal data, based on its indexing of web pages, “the trends are pretty clear,” Davis said. Unicode is growing both in usage and in character coverage. Google recently upgraded to the latest version of Unicode, version 5.2 (via ICU and CLDR) that adds over 6,600 new characters, some of mostly academic interest, such as Egyptian Hieroglyphs, but many others for living languages. |