Review published in: Journal of the Chinese Language Teachers Association (2004) 39.2:123-131.
Written by: Marjorie K.M. Chan, The Ohio State University
(This is a Unicode (UTF-8) encoded page.)
Google Search Engine: UTF-8 and Searches in Chinese. URL: <google.com>.
In light of 1) the importance of retrieving information online in the 21st century and 2) Google's phenomenal rise to first place amongst the online search-and- retrieval processes, this review discusses some interesting aspects of the search functions supported by Google, with particular attention paid to searches in Chinese. A proposal is offered here to make some small, customized changes to the parameter settings in Google to improve viewing of the search results. The setting changes proposed here aim to make better use of Google's ability to handle encoding in UTF-8, a capability which, at this time, does not seem to have been fully exploited. Before presenting the proposal, some preliminaries are need. But first, a brief review of history...
For the younger generation today, it is difficult to fathom a world without the World Wide Web. And for all of us, the web (no longer requiring upper-case) is indispensable for information retrieval. Equally indispensable are search engines, those online applications that enable us to navigate the web in search of information.
The web, which is a subcomponent of the Internet, made its debut in 1991, and has played a critical role in the worldwide, online dissemination of information. That dissemination was facilitated first by the release of web browsers -- Mosaic in 1993, Netscape in 1994, and Internet Explorer in 1995 [1] -- and thereafter by the release of search engines -- Lycos, Webcrawler and Yahoo! in 1994, and AltaVista, Excite, Infoseek, and Looksmart in 1995. Search engines have come and gone, with some having more staying power than others. [2]
The top search engine in the United States is Google (google.com), founded in September 1998 by Sergey Brin and Lawrence Page. Since 2002, Google has outpaced its top competitor, Yahoo (yahoo.com). [3] Google is also the world's leading search engine. [4] Today, the critical role of search engines is fullyappreciated. As former Lycos CEO Bob Davis noted, "Search is the ultimate killer online app; the Internet without search is like a cruise missile without a guidance system." (Levy 2004). At the same time, as Levy (2004) observes, "If you're not indexed by Google, you pretty much don't exist." And that is the bottom line.
Over the years, numerous features have been added to Google. These include the searching of various file types besides web pages (.html files), viewing of these other formats as HTML files, adding Unicode-based multilingual support, and so forth, all features that increase searching power and justify Google's current status as the most popular search engine worldwide.
With respect to searching in Chinese, Google very recently has even added Pinyin-search capability to its Simplified Chinese search site. [5] For those who maintain web sites, Google also provides the codes for adding a stand-alone, fully-functional, Google-powered search box within their web pages to search web-wide and/or site-internally. [6] Just such a search box has now been added to the Chinese Language Teachers Association home page <clta.osu.edu>. [7]
The main Google page, at <google.com> (i.e., <www.google.com>), is minimalistic, containing only its search engine, inviting web surfers to enter a search; that is, to enter some keywords to begin a basic search of the web. Such a Google-search finds pages on the web that contain the matching words. Certain selections have been preset for this search (the Google "default" parametersettings). Specifically, Google defaults to searching the entire web, in any language, for any format (that is, any file format that Google supports. Currently, these are of .htm, .ps, .ppt, .xls, .doc, and .rtf). Often, such basic keyword searches suffice, so that many web surfers probably have not tried to further fine-tune their searches by conducting what Google terms an "Advanced Search." This choice is presented as a hot link on the main search page, and leads to a new "bookmarkable" Google page having several pull-down menu options including selection of file format to seek, as well as language in which to seek, such as "Chinese (Traditional)" and "Chinese (Simplified)."
The Google search engine's "Advanced Search" setup, with its choice of two "varieties" of Chinese, namely, "Traditional" and "Simplified," accommodates web pages in which the Chinese characters on the page are encoded either in Big5 for traditional Chinese characters or in GB(K) for simplified Chinese characters. A search for 'modern Chinese language' by entering "近代漢語" and selecting "Chinese (Traditional)," for example, will yield very different search results from one in which "近代汉语" is entered and "Chinese (Simplified)" is selected. The Traditional-Chinese search in Google returns a page showing that the choice selected is "Search Chinese (Traditional) pages"; that is, the search is conducted on the subset of pages on the web in which the character-set encoding is "big5." These web pages, which are HTML-formatted files, contain the following meta-tag near the top of the file:
Predictably, the resultant top hits come from websites in Taiwan, where Big5- encoding is used. Note further that the Google pages showing the search results are themselves generated from within the "big5" as their character set. [8]
As one might expect, conducting a Simplified-Chinese search in Google returns a page showing that the choice selected is "Search Chinese (Simplified) pages," with the top hits this time coming from websites in the People's Republic of China, and with the pages presented in simplified Chinese characters from within the specified "gb2312" character set. Corresponding to the above scenario, the Google pages presenting the search results are also meta-tagged with "gb2312" as their character set. While the situation has become more complex over the years, the above is the basic scenario. [9]
The options that Google provides in its "Advanced Search" accommodate only the two dominant character sets used in the Chinese communities in Asia. They do not, however, accommodate the searching of web pages that are encoded in UTF-8 and contain what might be called simply, "Chinese characters," irrespective of 1) whether these be traditional Chinese characters or simplified Chinese characters or even a mix thereof (such as 近代漢語 and 近代汉语 on the same web page) or 2) whether such so-called "Chinese characters" come from a Chinese, Japanese, Korean, or Vietnamese web page.
At this point in time, if someone is using an English-language Google on a computer running an English-language Windows operating system that is pre- Windows XP (e.g., Windows 98 or Windows 2000) and, say, using Netscape 4.7x or MS Internet Explorer 5.x, s/he cannot conduct a basic search by entering some Chinese characters and expect the default setting of "any language" to work in retrieving, from all over the web, those pages that contain the"search for THIS" characters as entered. Why is this so?
It is so because the character set used in those English-language Google web pages is "ISO-8859-1" (also called ISO-Latin), which is an extension of the ASCII (American Standard Code for Information) character set, and as such is used for the alphabetic languages used in Western Europe (such as English, French, Spanish, German, Finish, and Swedish). This is a single-byte character set, and it is the default setting both for Google's main page and for its Advanced Search page. You may be asking, "Well, so what?"
Well, this is the what: if Google's "any language" default is selected, Google cannot then retrieve the double-byte Big5 and GB-encoded Chinese characters entered into a Google search. In short, Google is "single-byte" oriented, whereas "Chinese characters" are not single-, but rather are double-byte entities. Is there a fix for this?
Google's solution for conducting a Chinese search in English Windows using English-language Google is to require the individual to go to the Advanced Search page and select from the "language" pull-down menu either "Chinese (Traditional)" or "Chinese (Simplified)".
But what can one do if s/he wishes to conduct a Google search using UTF-8 -- an 8-bit encoding of Unicode -- both as input and as output? [10] The character sets for Chinese have been developing in synchrony with the Unicode standard, with expansions of the GB code, for example, covering both simplified and traditional Chinese characters that are now included within the most recent Unicode standard. For scholars and educators working in communities outside Asia, such as those in North America, a viable third option for searching web pages containing Chinese characters would be to use the still-developing "international standard" called Unicode, with UTF-8 as the character set selected in the meta-tags in such pages, because doing that would accommodate not only simplified and traditional Chinese characters, but also characters found both in the ISO-8859-1 code and in the codes of other scripts (Greek, Russian, Persian, etc.).
Google's current setup for English Windows can and does provide the selection of UTF-8 as the character set for the output pages of a search, but only if one is using the latest operating system, such as Windows XP, which has the most advanced Unicode and multilingual support. For example, using WinXP with either MS Internet Explorer 6.0 or Netscape 7.1 to go to Google's main page to conduct a basic search, one finds that the character set in that Google page is "UTF-8", and not "ISO- 8859-1," as is the case for those going to the same URL <www.google.com> but entering the site from an earlier Windows operating system!
Google currently does not provide the selection of UTF-8 as the character set for the output pages of a search for those with earlier operating systems. [11] Nonetheless, if a Google search box is placed in one's own web page, might there be possibilities for changing the parameter settings and for customizing the Google-powered search engine residing in one's own website, so that the search box 1) resides on a page containing UTF-8 as the character set, and 2) has the output pages (those that present any search results) also set up for UTF-8?
The answer is a resounding yes! This is precisely what I did, with the testing done using Windows 2000 as the operating system and MS Internet Explorer 5.0 as the web browser. The results are very promising.
Our most important adjustment concerns the character set for the encoding input and for the encoding output. The code given for the online form in (1) below is the "Google Free web search," a simple but fully-functional Google search box provided at Google's site for copying-and-pasting to one's own HTML file. [12] Note that it is very basic and does not specify any parameter setting for character set, thereby defaulting to ISO-8859-1 unless one's web page contains "UTF-8" as the character set. [13] At (1) below, note that the search box as shown is for searching the web at large; at (2) below, see the additional code to be inserted into the code given in (1).
- Code for Google search box -- Example from Google:
<FORM method=GET action="http://www.google.com/search">
<TABLE bgcolor="#FFFFFF"><tr><td>
<A HREF="http://www.google.com/">
<IMG SRC="http://www.google.com/logos/Logo_40wht.gif"
border="0" ALT="Google" align="absmiddle"></A>
<INPUT TYPE=text name=q size=25 maxlength=255 value="">
<INPUT type=submit name=btnG VALUE="Google Search">
</td></tr></TABLE>
</FORM>The resultant Google search box as it would appear on a web page:
![]()
The code in (1) does not specify encoding; hence, insert into that form the two lines given in (2), below to specify UTF-8 as the character set for the two query parameters, <ie> and <oe>.
- Lines to add to the code in the Google search box, preceding, at (1).
<INPUT type=hidden name=ie value="UTF-8">
<INPUT type=hidden name=oe value="UTF-8">
An example of the coding for a still more elaborate search box is given in (3), from Blogistan's website. [14] The search box is for searching in the subdirectory, 0001111, within the site, salon.com. Here, replace "ISO-8859-1" with "UTF-8" and add the second line from (2) to the code. The resulting search box on a web page would be simpler than that in (1), since there is no Google icon inserted as part of the search box, and the wording to conduct the search here is a spunkier "Just Goog It!" rather than Google's standard wording, "Google Search." Also, as visually displayed here, the width of the search box for (3) would be narrower, as the size specification is "10," and not "25," as in (1).
- Code for Google search box -- Example from Blogistan.
<form action="http://www.google.com/search" method="get">
<input type="text" name="q" size="10">
<input type="submit" value="Just Goog It!">
<input type="hidden" name="hl" value="en">
<input type="hidden" name="ie" value="ISO-8859-1">
<input type="hidden" name="as_qdr" value="all">
<input type="hidden" name="q" value="site:salon.com">
<input type="hidden" name="q" value="inurl:0001111">
</form>The Google search box as it would appear on a web page:
![]()
The above provides two sample codes for designing a Google-powered search box to search either the web or one's own website. The search boxes can be made still more functional, such as adding radio-buttons with the necessary code to enable the user to choose searching the web or searching just the specified website. Be sure also to include the following meta-tag in the web page containing the search box:
<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
With this new Google search box set up for UTF-8, irrespective of whether you have the latest operating system or an earlier one, you can test it by entering English, traditional Chinese characters, and simplified Chinese characters in the search keyword search -- such as "reading 閱讀 阅读" -- and see if the resulting search pages had correctly retrieved the searched items and if they display correctly. Keep in mind that with the query parameters, <oe> and <ie>, set to "UTF-8", a basic search for Unicode-encoded web pages does not distinguish CJK characters on a Chinese page from those on a Japanese page; for example, a search for 日本 (Japan, Japanese) retrieves not only Chinese web pages with "big5" or "gb2312" character sets, but also Japanese web pages with "Shift-JIS" (or some other Japanese encoding) as the character set, as well as English pages encoded in UTF-8, at least for that keyword search.
In the above example, to restrict the search for 日本 (Japan, Japanese) to, say, just those web pages with Simplified Chinese characters only, one couldalways opt for an alternative searching strategy, namely, that of using 1) the "Advanced Search" page in English Google, 2) the Simplified Chinese Google page, or 3) one's own customized "Advanced Search" page. The third option uses Google's page as a template, keeping in mind to specify "UTF-8" as the character set on that new page. [15]
To conclude, the minor adjustments proposed here for the Google search box in one's own web site (or in a web page residing in one's computer) offers a valuable option for web surfers: that of simultaneously searching for either Traditional Chinese, or Simplified Chinese, or a mix thereof. UTF-8 enables you to have these options, and readers -- owners of the latest operating system and owners of earlier operating systems -- are all welcome to go to the CLTA home page to test out the Google-powered search box.
Notes[1] Browser Timelines. <http://www.blooberry.com/indexdot/history/browsers6.htm>.
[2] A useful site is Search Engine Time Line. <http://www.itglossary.net/search-engine-time-line.html>. For an extensive list of general search engines, see Phil Bradley's Web Search Engines at <http://www.philb.com/ webse.htm>.
[3] Relevant statistics are given in a press release dated 23 March 2004, "Google's Search Referral Market Share Reaches an All-Time High, According to WebSideStory -- Top Search Engine Opens Up Widest Lead Yet," which is online at <http://www.websidestory.com/pressroom/pressreleases. html?id=219>.
[4] For some recent journal articles on Google and the co-founders, see, for example, Tayler (2004) and Levy (2004).
[5] URL: <http://www.google.com/intl/zh-CN/>. A brief description along with a test drive is presented at <http://chinalinks.osu.edu/c-links1.htm#google>.
[6] "Want to drive traffic? Try the Google free way." <http://www.google.com/searchcode.html>.
[7] As CLTA webmaster I added a Google-powered search engine to the CLTA home page on 4 April 2004, after some experimentation on pages at my own personal website. A new search engine was placed in the CLTA website after loss of access to the original search engine as a result of the move to a new, updated server. In moving to the new server, CLTA was also granted its own, independent domain (viz., clta.osu.edu).
[8] For a succinct discussion of encodings and characters sets -- especially as they pertain to Chinese language computing -- see T. Chan's (2003) article.
[9] The situation has become more complex for several reasons, one of which is that the GB code has been extended immensely since it was first established as GB 2312 in 1981, with only 6,762 characters that do not contain traditional characters for which a simplified version had been adopted. More recent encodings, however, have incorporated traditional Chinese characters, so that GB 18030 (introduced in 2000), for example, contains over 70,000 characters, incorporating Han Chinese characters from version 3.0 of the Unicode standard (at <http://unicode.org>), thus containing all the characters in Big5 (and its successor, Big5 Plus) and many more. See T. Chan (2004) and sources cited therein for further details. A second reason for greater complexity is that Google behaves somewhat differently depending on the technology accessing its search engine. (The observation emerged from a fruitful discussion with Joshua Gilliland.) More on this coming up.
[10] UTF-8 is a variable-width encoding in which a Unicode character may be 1- to 4-byte in that encoding. For example, ASCII characters use one byte, European (except ASCII) two bytes, and CJK characters three bytes.
[11] I have not tested all possible combinations of operating system and web browser to provide more detailed information on this.
[12] URL: <http://www.google.com/searchcode.html>.
[13] Note that HTML (HyperText Markup Language), the publishing language of the World Wide Web, has, until 1998, supported ISO-8859-1 as the default character set, but with the implementation of HTML 4.0 (24 April 1998), the World Wide Web Consortium's recognition of the need for internationalization can be seen in their replacement of ISO-8859-1 with Unicode, and the resulting support of UTF-8 as the character set for HTML files.
[14] URL: <http://radiofreeblogistan.com/stories/2002/09/30/finetuningCustomGoogleSear.html>.
[15] URL: <http://www.google.com/intl/zh-CN>.
ReferencesThe URLs cited in this article were last accessed in the first two weeks of April 2004.
Blogistan Corporation. "Fine-tuning custom Google search." <http://radiofreeblogistan.com/stories/2002/09/30/finetuningCustomGoogleSear.html>.
Browser Timelines. <http://www.blooberry.com/indexdot/history/browsers6.htm>.
Chan, Thomas A. 2003. "Character sets and characters: The basis of Chinese language computing." Journal of the Chinese Language Teachers Association 38.2: 87-108.
Chinese Language Teachers Association. <http://CLTA.osu.edu>.
Google. <http://www.google.com>. "Want to drive traffic? Try the Google free way." <http://www.google.com/searchcode.html>. Google Simplified Chinese. <http://www.google.com/intl/zh-CN>. Google Traditional Chinese. <http://www. google.com/intl/zh-TW>.
Levy, Steven. 2004. "All eyes on Google." Newsweek (Technology and Science section, March 29, 2004). <http://msnbc.msn.com/id/4570868/>.
Marjorie Chan's ChinaLinks. <http://chinalinks.osu.edu.
Search Engine Time Line. <http://www.itglossary.net/search-engine-time-line.html>.
Tayler, Chris. 2004. "The Google Guys." Time (Special Issue: The Time 100), Volume 163, No. 17 (April 26, 2004): 70. <http://www.time.com/time/subscriber/2004/time100/builders/100brin.html>.
Unicode Consortium. <http://unicode.org>.
Web Search Engines. <http://www.philb.com/webse.htm>.
WebSideStory. "Google's search referral market share reaches an all-time high, according to WebSideStory -- Top search engine opens up widest lead yet." <http://www.websidestory.com/pressroom/pressreleases.html?id=219>.
World Wide Web Consortium (W3C). <http://www.w3.org>.
Marjorie K.M. Chan
The Ohio State University
To Publications Page
![]()
Created: 9 May 2004 by Marjorie Chan.
URL: http://people.cohums.ohio-state.edu/chan9/pubn/google_rev.htm