The instructions on this webpage are for using R.J.C. Watt's Concordance program to make concordances using e-texts that are encoded in double-byte, East Asian encoding systems, with examples given from Big5- and GB-encoded Chinese e-texts. The instructions on this webpage, however, are also applicable for e-texts encoded for Japanese and Korean. Concordance 3.0, running under English Windows 2000, was used for preparing this set of instructions. The current version of the program, Concordance 3.3, works well under all current English Windows operating systems (e.g., XP, Vista, 7). A KWIC (keyword-in-context) display of a non-spaced, Big5-encoded text file is shown below. The source text file is the result of an online search of the classifier tiao in the Sinica Corpus, with 1,724 hits. The concordance of tiao using the Concordance is shown with sorting of the context to the right of the keyword. Line number represents the order of occurrence in the source text. The original HTML-tagging around the keyword in the Sinica Corpus search result is replaced here by the pound (#) sign in preparing the source file. The Chinese characters are displayed using a user-specified Big5 font.1
Concordance, a commercial concordancing program developed by R.J.C. Watt, is a powerful program that can do full concordances. A "full concordance" means that *every* single (orthographic) word (i.e., a string of text separated by a space) is searched and compiled into a concordance. Other concordancing programs typically only allow searching for one or more words and phrases, but not the entire inventory of words in the corpus. Moreover, the concordance file can be saved for further sorting, extracting, and analysis. Needless to say, the program can also handle selective concordances, namely, searching for words, phrases, and regular expressions (substrings of words, etc.), as shown in the above example. Please note that "selective concordance" is used in the current version of the program, replacing the earlier term, "fast concordance", which appears in a couple of the figures on this webpage (viz., Figures 19 and 21, as the figures were created in 2001, hence, using an earlier version of Concordance; however, see Figure 21', which displays the current version).
Chinese e-texts (and East Asian texts in general) are typically prepared with no spacing between Chinese characters. As a result, for a full concordance, text with spacing between words or between Chinese characters is needed. To perform this task, one needs a segmenter. For simple segmenting, the Universal Code Converter utility program in NJStar's NJStar Communicator, for example, can add spacing (or remove them) between Chinese characters, as well as add line breaks (via the "Wrap Text" option). (See my online Instructions for Adding Spaces Between Characters in Chinese E-Texts using NJStar Communicator.) Both Erik Petersen's online Chinese Annotation Tool and Wenlin Institute's Wenlin software for learning Chinese can segment Chinese non-spaced text into (polysyllabic) words. (See my online Instructions for Adding Spaces Between Words (詞) in Chinese E-Texts) using Wenlin 3.x (version 3.0 or above). With Wenlin 3.x and above, one can also break paragraphs into lines (under "Make transformed copy" in the Edit menu). Similarly, programs such as Microsoft's MS Word (MS Word 97 and above) can add manual line breaks to texts, as well as add spaces to non-spaced double-byte East Asian texts (via the search-and-replace function).2 For HTML-tagged e-texts, the tagging could be removed by web-authoring software or via Concordance's text-filtering capability (via "Filter a file", in the "File Conversion Tools" under the Tools menu).
The following is an illustration of a GB-encoded e-text from Shuku Net, with manual line breaks added, spacing added between each character, and all HTML-tagging removed.
To make either a full concordance or a selective concordance of a GB- or Big5-encoded Chinese e-text, three preliminary steps are needed:
- Setting the font and character set.
- Setting the program to treat upper and lower case separately.
- Selecting the context style.
Step 1. Setting the Font and Character Set. For the first step, setting the font and character set for display of the concordance involves several sub-steps. First, click the "Text" pull-down menu, and select "Language & Font Control," as shown below.
In the Language and Font Control window, click "Change Font and/or Character Set":
In the Font window that is opened, select your Big5 or GB font, along with font style, font size, and script (or character set). In the screenshot below, the following settings were selected: SimSun font, Regular style, font size 12 point, and CHINESE_GB2312 as the script, or character set. Please keep in mind that it is Windows that has fonts, not Concordance. Hence, what shows up in the font selection are the fonts that you have installed in Windows. If the font you want, such as SimSun, displayed here, is not shown in your copy of Concordance, it is because you have not installed it in your computer. (Concerning online resources for Chinese fonts, see footnote 1.)
With the Language and Font Control window re-opened, click "Apply Font and Character Set throughout program," click "Clear" to clear the Character Set Details and Scratchpad, and then close that window.
Step 2. Setting the Program to Treat Upper and Lower Case Separately. Having set the font and character set, you are now ready for the second major step, namely, setting "Text" to treat upper and lower case separately. Click the "Text" pull-down menu, select "Special," and then select "Treat upper and lower case separately." This is shown below. Note that with the selection of "Treat upper and lower case separately," no setting needs to be made in Concordance's Alphabet dialog in the Text menu.
It is also important to remember to perform -- or at least to check -- this step each time, since the program may revert to the default setting, This is because the program was developed for use with English e-texts, where an upper-case 'A', for example, is treated "the same" as a lower-case 'a'. There is nothing comparable to this in Chinese orthography; without performing (or double-checking) this crucial step, the result could be two different Chinese characters being treated as "the same" character, which is not what you would want for Chinese concordanced results. (Interestingly, Concordance 3.3 running under English Windows 7 (Professional edition) does remember the special selection from a previous session. (05.14.2011))
Step 3. Selecting the Context Style. Having set upper and lower case to be treated separately by the program, the third and final step is to select from "Context Styles" in the Text menu:
In the Context Styles window, Concordance offers two choices: "Actual line" and "Selected length":
"Actual line" is self-evident; this is the actual, physical line in which the keyword -- the "headword" -- to be searched occurs. As stated in the Help menu, this is "the physical line from the source text which contains a particular word," making this context style especially appropriate for such e-texts as lists and lines of verse, where the text in each line forms a separate, logical or syntactic unit. "Selected length" allows for such choices as number of words before and after the keyword in the KWIC (keyword-in-context) display. This is particularly important when the lines of e-text are fairly long and not every word can be displayed in the KWIC display window, resulting in a series of dots ("....") displayed in lieu of the entire context to the left or right of the headword, for example. Within the options in the "Context Styles" window, there is also a choice of delimiting length by the number of words or by sense units, the latter marked orthographically in the source e-text by punctuation marks (comma, period, exclamation mark, etc.). (Note: While concordancing speed depends in part on such factors as size of the source e-text, computer speed, amount of RAM, etc., the size of a concordance is determined in part by the size of the source e-text and in part by how much precedes and follows the headword (shorter lines or selected length results in concordances of smaller size than longer lines or selected length). (See "Context Styles" in the Help menu for further explanation.)
Now, you are ready to make a concordance. To create a full concordance of your spaced e-text, go to the menu bar and select "File" > "Make Full Concordance" > "From Files...":
Alternatively, click the leftmost button immediately below the "File" menu:
Upon making the selection, a menu window opens up. Click "Add files" and choose the file you had created. The file is then added to the list of files that you may include as your corpus. That is, one or more files can be selected. Highlighting the file would enable you to preview the contents in the right-hand window, if "Preview file" is checked. When you are ready, click "Make Full Concordance." In the illustration below, one file has been selected for making a full concordance:
[Note: When you perform this step, the right-hand window may not display the text as (double-byte) Chinese characters. Keep in mind that the program was developed for texts containing single-byte Roman letters. Hence, ignore the display on that right-hand window when you perform this stage of preparations for creating a full concordance using Chinese e-texts. In Figure 8, the Chinese characters display correctly precisely because an external Chinese decoder was used, and that was mainly to enable display of the Chinese characters for the screenshot. The important thing is that the display in that right-hand window plays no role in the results of the concordance itelf.]
Now, having clicked to "Make Full Concordance", the result is illustrated below, with all the "Headwords" are displayed in the Headwords window on the left, and a KWIC (Keyword-in-Context) display of the highlighted headword, sentence-final particle ba, with 29 hits in this corpus -- in the Context window on the right. There will be some headwords that are not Chinese characters. These include punctuation marks and other superfluous material. To delete unwanted headwords, highlight the unwanted headword and press the DEL button on your keyboard. To delete multiple lines, go to the Edit menu and choose "Multi-select." Deleting of lines in the Context window can also be accomplished via highlighting and deleting with the DEL button. Do be careful in deleting headwords and context lines, since there is no "undo" function (although Concordance does create backup copies). Concordance automatically creates a concordance file when a selective or full concordance is made. The concordance filename is the name of the source text plus the file extension, ".concordance." Save the concordance to a filename of your choice to avoid losing the concordance when another concordance is made using the same source file. Remember to save periodically to update the file as headwords and context lines are deleted.
In the above display of the concordance, the Headword window has been string-sorted "alphabetically." Note that the source file was GB-encoded; as a result of the sequence of glyphs in the GB (and GBK) encoding system, the order of the headwords in Figure 9 above follows the Hanyu Pinyin romanization system. Sorting of the headwords is via "Headwords" > "Sort by" > "Alphabetic ascending (string)":
The Context window in Figures 9 and 10 shows a particular sorting order, namely, sorting based on the "word" to the left of the headword, ba. Sorting could also be to the right of the headword, or based on "Order of occurrence ascending" or "Order of occurrence descending." Sorting by "Word before headword" was performed as follows: "Context" > "Sorted by" > "Word before headword."
The result of the sort is shown below, with the View window also displayed, where the precise line in the source text is located, corresponding to the highlighted context line in the Context window. The keyword ba is highlighted in the View window. With respect to sorting in the Context window, other sorting orders can be performed, such as sorting by frequency in ascending or descending order, order of occurrence, and so forth.
In the Headwords window, the Context window, and the View window, Concordance allows the user to over-ride the font selection used in the overall displays, or to select a different font size or font color. To select a different Headword font (or font size/color), proceed as displayed in the following screenshot.
From there select the appropriate language font. The MS Song font for GB2312 encoding is selected. After all selections are made, click "Apply" to make the selections effective immediately.
Correspondingly, for the Context font, proceed as shown, and from there select the desired language font.
The small View window, which appears when a line in the Context window is double-clicked, shows the full context of the line from the source file. The headword is highlighted. To access the font selection in the View window, click "Options" in the View window's menubar, and then "Font" and select the desired font, font size, font color, etc. In the figure below, red was selected for the font color, and font size is increased from the default size of point 9 to point 12. (Note that the fonts that the View window can select for display are limited to fixed-width fonts, whereas the Headwords window and the Context windows permit both proportional-space fonts and fixed-width fonts.) In the figure below, the lines in the Context window are sorted by "Word after headword," and the keywords in the Headword window are sorted by "Frequency descending." From there, explore other searching functions using a full concordance, such as extracting from the full concordance a subset of the concordance via "Headwords" > "Select Words" > "Select words matching" using wildcards (* and ?).
Besides sorting and searching functions, there is also another feature of concordancing programs that is found in the Concordance program, namely, generating collocations to study the contexts before and after the keyword searched. As shown in the screenshot below, to generate the collocations of the highlighted headword, click "Headwords" in the menubar, and then select "Collocations."
When "Collocations" is selected, a Collocations window opens, as shown below, with collocates to the right of the Headword displayed across the top half of the window, and collocates to the left of the Headword displayed across the bottom half of the window. The collocates immediately to the right of the Headword is in the top-left column, next the collocates second to the right of the Headword, and so forth. Within each column, the collocates are sorted by frequency of occurrence. Hence, at a glance, some patterns can be revealed that might not be obvious simply by sorting in the Context window.
Observe, also, that the Collocations window allows for other orientations in the displaying of the Collocations window. Changing the orientation by clicking the "Orientation" button results in the following display, with all left collocates aligned to the left of the Headword, and all right collocates aligned to the right of the Headword. (Think of the Headword as invisibly located where the thick red vertical bar is placed.) All columns are arranged in the same sequence as the words would appear linearly in the text in relation to the Headword. Besides options to change the orientation of the display of the collocates, the program also has an "Export" function; that is, the collocations generated can be exported to a text file for studying later.
We turn now to making selective concordances using spaced e-texts, where one or more words are chosen to be searched. As with making full concordances, remember to take the three preliminary steps before making your concordance: (1) set font and character set (Figures 3a-d), (2) set "Text" to treat upper and lower case separately (Figure 4), and (3) select the context style (Figures 5a-b). In making a selective concordance using spaced e-texts, input your search word(s) using the operating system's input locale (in Windows 2000 or higher), or an external encoder.3 Choose, from among the options in the "Word Selection Method" at the bottom-left portion of the program, the Pick List option. Click "Edit" to edit the Pick List. Enter the words to be searched in the corpus as shown in the screenshot below. When the selections are made, close the Pick List window, save the changes that you have made to the Pick List, and then click "Make selective concordance" (or "Make fast concordance" in older versions of the program)
The result of using a Pick List to enter the words to be searched is shown below, where the items in the Pick List are displayed as items in the Headword window. Frequency, or number of tokens (or hits), is shown for each Headword, along with percentage figures for each headword, based on the number of tokens in this corpus of 770 tokens). Note also that in the figure, the four words in the Headword window have been sorted by "Frequency descending," while the lines in the Context window have been sorted by "Word before headword."
We turn next to making selective concordances (formerly, "fast concordances") using non-spaced e-texts, that is, text that does not contain any spacing between words, as is conventionally the case in written Chinese materials. For non-spaced Chinese e-texts, the search is for a word, consisting of one or more Chinese characters, or a subcomponent of a word, such as a stem, affix, etc. To search for a word or subcomponent of a word in a non-spaced e-text, conduct the search using "regular expression", where the search is for a string of text, regardless of whether that string of text is a "word" or a subcomponent of a word. In conducting such searches, keep in mind that the software program does not need to know what is a "word"; it only needs to match up a string of symbols embedded in an e-text. A single Chinese character -- that is, a double-byte character in GB(K) and Big5 encoding -- is formed by a string of two symbols, each a single byte in length. With that in mind, conduct the search as follows: From the options in the Word Selection Method, choose "Regex" (regular expression); then click "edit," and enter the string to search. In the example below, the search is for a word that consists of a combination of two Chinese characters, 然 and 后. Note that each of these Chinese characters is in fact composed of two symbols -- È» in the case of 然 and ºó in the case of 后 -- that is, 然后 is composed of a string of four symbols -- È»ºó -- as they would be displayed if the string of symbols were not decoded. In other words, in conducting a "Regex" (regular expression) search, what appears as two Chinese characters is, in fact, simply a string of four symbols that is being searched -- and matched -- up in the e-text. After selecting the text by Regular Expression, click OK, and click the button, "Make Fast Concordance", as shown in the older version of Concordance in Figure 21 below, or "Make Selective Concordance", in Concordance 3.3, as shown in an updated screenshot in Figure 21' (prepared on 05.14.2011).
The result of the search of a non-spaced e-text is as follows.
The making of full concordances (as depicted in Figure 9) and selective (formerly, "fast") concordances (as exemplified in Figures 20 and 22) have been demonstrated in this set of instructions. Note that when you click to make a selective or a full concordance, the resulting concordance is automatically saved into the drive and directory of your computer that contains the source e-text. If you make a subsequent concordance using that same source e-text, the program will over-write the previous automatically-generated concordance file. The concordance file that is automatically saved by the program is essentially a text file with ".Concordance" as the file extension. In automatically saving the concordance, the program uses as the filename for the concordance a name that is identical to that of the source e-text plus the original file extension of that source e-text. Hence, if a source e-text is named "sample.txt", the concordance file that is automatically generated by the program (via selective or full concordance) has the name "sample.Txt.Concordance". Hence, if you wish to keep a concordance that you have made and do not want it to be automatically overwritten, it is recommended that you re-save the concordance under a new name (via selecting "Save Concordance As..." from the File menu). (More on this topic of "Saving a concordance" is provided in the program's Help menu.) Concordances can be re-opened later to study, resort, etc. The ability to save and re-open a concordance for studying and for re-sorting and other manipulations is one of the very nice features of this program.
In re-opening a concordance, it is important to know that the concordance file contains information on where the source e-text, the source file, resides. As a result, if you have moved the source file to another drive or to a different subdirectory in your computer, you will need to edit the concordance file to point it to the source file's new location. Since the concordance file is simply a text file, any text editor can be used to modify the path to the source file. That path is located near beginning of the concordance file, as shown in the highlighted line of the screenshot below. In the screenshot, the concordance file is opened using the Multiple Document Editor, which is bundled with the program (located under "Tools" in the Menu). The fifth row of the concordance file, containing the source file's filename and its location, is highlighted. Modify the path to the source file as needed to enable the Concordance program to open the concordance file and to load its source file. This step of editing the concordance file is currently a manual one, but as small, portable, storage devices become increasingly popular as alternatives to mass storage on the hard disk, perhaps a later version of the program will make this editing process an interactive one, with a dialogue box, when the program cannot find the concordance's source file.
The above provides some basic instructions to get started with using Concordance (version 3.0 and higher) for concordancing using East Asian double-byte, spaced and non-spaced e-texts. The program itself has very well-documented help menus for using the program. Needless to say, these help menus should be read to make the best use of the program for concordancing. In addition, a manual and other information are also available online at R.J.C. Watt's website, Concordance: Resources and Extras.
Concordance was developed initially for English, and then extended to other Western languages that use Latin scripts. Hence, many of the program's functions that were designed for those scripts may not necessarily be applicable to texts in East Asian scripts. Consequently, any capability that the program has for handling scripts beyond the Latin scripts -- such as the double-byte East Asian scripts -- is a "bonus" that the developer cannot make promises about. That caveat aside, from late December 2000 to early December 2001, the program was modified step-by-step -- via email correspondence between the software developer who did not know Chinese and yours truly who lacked knowledge at the programming end -- with the aim of enabling the program to better accommodate concordancing of East Asian scripts. The program continues to evolve and to be tested with new versions of the Windows operating system, and, in time, with Unicode-encoded corpora (e.g., UTF8-encoded e-texts). At this point in the development of the program, besides the illustrations given here for Chinese, the following screenshots provide examples of concordancing of e-texts in Japanese (normally without word-spacing, as in Chinese) and in Korean4 (with word-spacing).
Figure 24. Japanese. Selective concordance using Regex for search on mainichi 'daily.'
Figure 25. Korean. Full concordance of a word-spaced e-text and sorting by "word-endings (string)."
In the Korean example above, the headwords have been sorted by "word endings (string)." The highlighted word is the noun, cali-ka 'position/place,' with -ka the nominative marker.
Sorting by "word endings (string)" is also possible in Chinese and Japanese if the source text has word-spacing, as in the case of the transcription of e-texts with spacing between (polysyllabic) words, as shown below, with full concordance of an e-text from the Taiwanese-Putonghua corpus at University of Pennsylvania's Linguistic Data Consortium. In this example, the highlighted word in the headword (and, hence, also the keyword in the Context (KWIC display) window) is a resultative compound, kaowan 'finish testing', consisting of the verb, kao 'test, examine,' and the resultative component, wan 'finish.'
Figure 26. Chinese. Full concordance of a word-spaced e-text and sorting by "word-endings (string)."
The instructions provided on this webpage should suffice to assist the user to begin exploring the use of Rob Watt's Concordance 3.2 for concordancing of East Asian e-corpora.5
1For online resources on Chinese fonts, see, for example, MC's ChinaLinks: Fonts for Fonts for DOS/Windows/Mac. (The Arial Unicode MS Font, for example, contains all characters included in Unicode 2.1, thus covering Chinese, Japanese, and Korean, along with English and many other scripts of the world.) For archived e-texts online, see MC's ChinaLinks: Searchable and Archived Classical Chinese Texts. Other e-texts include online newspapers, magazines, and so forth; see MC's ChinaLinks: Chinese NetNews and E-Magazine Sites. Additional resources are available online via links from my graduate seminar, Chinese 889: Databases and Corpora for Chinese Linguistic Research (Winter Quarter 2001).
Also see:Chan, Marjorie K.M. 2002. "Concordancers and concordances: Tools for Chinese language teaching and research" (PDF, 1.03 MB, revised version with color illustrations), Journal of the Chinese Language Teachers Association 37.2 (2002):1-58.
2The following set of instructions for adding a space between Chinese characters using MS Word for Windows (MS Word 97 or above) is based on instructions provided by Thomas A. Chan (11 January 2001):
- Enter some hanzi text into MS Word.
- Then do Edit → Replace.
(In this window, one sees items "Find what" and "Replace with", as well as a "More" button.)
- Click on the "More" button to show more options, among which is the "Special" button.
(The menu under the "Special" button displays a rich array of options for which to perform matches that are more fine-grained than "*" or "?"-style wildcards.)
- For "Find what", click on the "Special" button and select "Any Character".
- For "Replace with", select "Find What Text" (which will duplicate — i.e., preserve — what was matched), and then type a space after it.
- Click on "Replace/Replace All", and you will have spaces between each hanzi!
3With the steps taken above, no external Chinese encoder/decoder is need for displaying the Chinese characters in the Preview window or any other window. An input method or decoder is needed, however, when entering Chinese text for searching specific words or phrases in the corpus. Windows 2000, for example, has built-in multilingual input support. For other Windows operating systems — including Windows XP — an external encoder/decoder that allows one to select GB or Big5 as the encoding system may be needed. For information on Chinese encoder/decoders, see, for example, MC's ChinaLinks: Chinese Language Software.
4Thanks to my advisees at the time, Thomas A. Chan and Ok Joo Lee, for their assistance with the Korean example.
5For anyone interested in exploring the use of Concordance to create a web concordance of a Chinese (CJK) e-text, this is also possible. Some massaging is needed, however, since the program was intended to create web concordances for the English language, an example of which can be seen from R.J.C. Watt's web concordance of P.B. Shelley: Selected Poems (full concordance of Hymn to Intellectual Beauty, Ozymandias, etc.). Keep in mind also that web browsers behave differently, as do different versions of the same browser. The results of a test web concordance created on 1 December 2001 is displayed in this screenshot. This small test web concordance, based on a selective concordance of four sentence-final particles, makes use of the same character-spaced, GB-encoded e-text as that shown in Figure 2 above. This small sample web concordance is online at Web Concordance Test. Note: When you save your concordanced results as an HTML file or a Web Concordance, be sure to go to "Tools", select "Preferences", and turn off "Convert to HTML entities during output". (Performing this step will prevent getting garbled European special characters. — Tip from Rob Watt, 10.26.2004.)
This webpage was initially prepared for graduate students in my Chinese linguistics courses at The Ohio State University for corpus linguistic research on the Chinese language. Build 159 (10 November 2001) of the Concordance program was used at that time to prepare the screenshots. Updates were then made to incorporate changes in the pre-release version of Concordance 3.0 (viz., Concordance 2.9.9 (Build 177, 30 November 2001)). Since then, Concordance 3.0 and Concordance 3.2 were successfully tested running under English Windows XP (Professional edition), while Concordance 3.3 was successfully tested under English Windows 7 (Professional edition).
Created by Marjorie K.M. Chan on 7 November 2001, with last major update on 5 December 2001.
Last update: 14 May 2011.
Copyright © 201x Marjorie K.M. Chan. All rights reserved.