CHINESE COMPUTING AND CONCORDANCING
Prof. Marjorie K.M. Chan
Dept. of E. Asian Langs. & Lits.
Ohio State University
HTAC 2003 Spring Colloquium
Chinese Computing and Concordancing
2 May 2003. 11:35 a.m. - 12:30 p.m.
316 Denney Hall
Ohio State Univrsity
Chan, Marjorie K.M. 2002.
"Concordancers and concordances:
Tools for Chinese language teaching and research."
Journal of the Chinese Language Teachers Association 37.2 (2002):1-58.
Also see: Chinese 889: Seminar in Chinese Linguistics
(Winter Quarter 2001: Databases and Corpora for Chinese Linguistic Research,
- Chinese Computing and Concordancing - A PowerPoint Presentation
(Suggestion: Save the file to the desktop (as a .pps file) and click open from there.)
- CHINESE E-TEXTS: PRELIMINARY PREPARATIONS
- Sources of e-texts
- Encoding of e-text - Big-5, GB(K), UTF-8, UTF-16
- HTML-tagged, POS-tagged, or plain e-texts with no tagging (POS: Part of Speech)
- E-texts - spaced/non-spaced
Segmenters and segmenting utilities:
- Character segmentation:
- NJStar Communicator: Instructions for Adding Spaces Between Characters in Chinese E-Texts
Note: NJStar's Universal Code Converter can also break long lines into shorter ones (check/tick "wrap text at __" (with "30" the default), with or without also adding a space.
- Microsoft Word 97/2000/XP: Instructions for Adding Spaces Between Characters in Chinese E-Texts
(thanks to Thomas Chan, 11 January 2001):
- Enter some hanzi text into MS Word.
- Then do Edit -> Replace.
(In this window, one sees items "Find what" and "Replace with", as well as a "More" button.)
- Click on the "More" button to show more options, among which is the "Special" button.
(The menu under the "Special" button displays a rich array of options for which to perform matches that are more fine-grained than "*" or "?"-style wildcards.)
- For "Find what", click on the "Special" button and select "Any Character".
- For "Replace with", select "Find What Text" (which will duplicate--i.e., preserve--what was matched), and then type a space after it.
- Click on "Replace/Replace All", and you will have spaces between each hanzi!
- Be sure to save the file as an "encoded" text file.
- Word segmentation:
- Chinese Annotation Tool
This online tool for character-segmentation and other tasks, developed by Erik Peterson and currently housed at Zhang Zheng-sheng's website.
Note: Because Wenlin has a much bigger dictionary, adding of spaces between words is better accomplished using Wenlin.
- Wenlin: Instructions for Adding Spaces Between Words (詞) in Chinese E-Texts
Note: Wenlin 3.1 has added a utility to break up a line. This is accomplished by going to "Edit" and selecting "Make new transformed copy", and then selecting "Break paragraphs into lines". This is best done after a non-spaced file has been segmented with spacing between character or word.
- WENLIN 3.1 (and above)
- Pros: Can handle all encodings and spaced/non-spaced e-texts; a reliable tool for testing and verifying the search results of concordancers developed for English-language e-texts; Cross-platform-compatible (Windows/Mac).
- Cons: Non-KWIC (Keyword-in-Context) display; cannot do full concordances or searches for multiple words and strings.
- Instructions for Concordancing Chinese E-Texts using Wenlin (string search with no KWIC display)
- Demo 1 - Wenlin 3.1: Introduction to Wenlin 3.1. Conduct searches using non-spaced e-texts, and study distribution and contexts. Create a word-spaced e-text.
- MONOCONC PRO 2.2
- Pros: concordances of single or multiple word searches via Regular Expression, using segmented or unsegmented e-text; KWIC (Keyword-in-Context) display; fairly sophisticated sorting functions.
- Cons: Can handle Big5/GB(K)-encoding, but not UTF; Windows program that requires English Windows 98 to concordance Chinese e-texts.
- MonoConc Pro 2.2: Instructions for Concordancing Chinese E-Texts
- Demo 2 - NJStar Comm: E-text preparation - create specific column break to non-spaced e-text that does not have hard line breaks (e.g., Wenlin's copy of HLM).
- Demo 3 - NJStar Comm: E-text preparation - add spacing and column break to text (e.g., Wenlin's copy of HLM)
- Demo 4 - MonoConc Pro 2.2: Search via Regular Expression (Regex), including wildcards (e.g., 在.*上), using non-spaced e-text. Sort the results to study distribution patterns, etc. Save concordance results as text file.
- CONCORDANCE 3.0 (and above)
- Pros: fast concordances (single or multiple word searches) and full concordances (NB: requires segmented e-text); KWIC (Keyword-in-Context) display; sophisticated sorting functions.
- Cons: Can handle Big5/GB(K)-encoding, but not UTF; Windows program that requires English Windows 2000 or higher to concordance Chinese e-texts.
- Instructions for Concordancing East Asian E-Texts using Concordance
(Note: The computers in 316 Denney Hall currently run under Windows 98, whereas Concordance 3.0 (and above) requires Windows 2000 or higher for Chinese concordancing results.)
Created: 1 May 2003 by Marjorie Chan.
[ MC Home ]