Adding spaces between Chinese characters is a fairly automatic process; adding spaces between words (詞) in Chinese, however, is a far more complex process. The Wenlin program has an extensive dictionary of modern Chinese words and phrases, and the program has capability of segmenting a non-spaced e-text into (mono- or polysyllabic) words. Instructions are given below.
- Open a non-spaced text file in Wenlin (e.g., version 3.1 or 3.2), and then from the menu bar, select Edit and then Make Transformed Copy, as shown in the screenshot below. The e-text is a Big5-encoded file that is Lesson One from Bai et al.'s Across the Straits.
- Next, select Segment Hanzi, as shown below.
- The transformed, segmented file yields some "ambiguities", that is, cases where the program cannot decide where to perform the segmentation. In this particular file, there are 29 ambiguities that need fixing. That is, there are 29 cases of ambiguities where the program needs manual decision-making to determine word segmentation.
- After "acknowledging," the first case that needs to be fixed is highlighted. Below, the entire choice is highlighted.
- After selecting the choice for word segmentation, go to the next case needing to be fixed by selecting from the menu bar, Search and then Find Fix, as shown below. Alternatively, select Find Again. Repeat this process using Find Fix: or Find Again (Ctrl+G), until all ambiguities have been fixed.
- The end result of fixing all ambiguities is displayed in the screenshot below, with the file saved as "L01_ci-segmented_Wenlin.B5."
- The saved file can, if you wish, be further "massaged" to remove the vertical bar, add a space before and the puncuation marks, etc. To do so, you need first to enable editing, as shown below, by selecting File and then Enable editing.
- To remove the vertical bars, for example, first go to Search and then Find on the menu bar, as shown in the screenshot below.
- Next, for Find:, enter "|", and for Replace with:, enter nothing, and click OK.
- Next, select Search from the menu bar, and then select Replace & find again (Ctrl+K).
(Note: If you made a mistake, you can undo the last step by selecting Edit from the menu bar, and then Undo replace (Ctrl+Z). On the other hand, if you are satisfied with the replacement, you can then do a Search and Replace all. )
- When all replacements are done, the end result is as shown in the final screenshot below, with spacing between words (詞) and punctuation marks isolated. If further "massaging" is desired, repeat Steps 7 through 10. This Big5-encoded e-text file is now ready for full concordancing using R.J.C. Watt's Concordance software program.1
1 See my online Instructions for Concordancing East Asian E-Texts using Concordance.
To use the Wenlin software program itself for concordancing -- without KWIC (keyword-in-context) display -- see my online Instructions for Concordancing Chinese E-Texts using Wenlin.
[ MC Home ]
Created by Marjorie K.M. Chan on 9 April 2003 for a concordancing workshop at Kenyon College.
Last update: 25 March 2005.
Copyright © 200x Marjorie K.M. Chan. All rights reserved.