[ MC Home Page   |   HTAC 2003 ]
HTAC 2003 Spring Colloquium - Humanities Computing in the 21st Century


Prof. Marjorie K.M. Chan
Dept. of E. Asian Langs. & Lits.
Ohio State University
Columbus, Ohio
Email:   <chan.9@osu.edu>
Poster Session:    
HTAC 2003 Spring Colloquium
Research Tools
Chinese Computing and Concordancing
2 May 2003.   11:35 a.m. - 12:30 p.m.
316 Denney Hall
Ohio State Univrsity
Columbus, Ohio
Chan, Marjorie K.M. 2002.
  "Concordancers and concordances:
  Tools for Chinese language teaching and research."
  Journal of the Chinese Language Teachers Association 37.2 (2002):1-58.
Also see: Chinese 889: Seminar in Chinese Linguistics
  (Winter Quarter 2001: Databases and Corpora for Chinese Linguistic Research,
  including References.)




    1. Chinese Computing and Concordancing - A PowerPoint Presentation
      (Suggestion: Save the file to the desktop (as a .pps file) and click open from there.)


    1. Sources of e-texts
    2. Encoding of e-text - Big-5, GB(K), UTF-8, UTF-16
    3. HTML-tagged, POS-tagged, or plain e-texts with no tagging (POS: Part of Speech)
    4. E-texts - spaced/non-spaced
      Segmenters and segmenting utilities:
      1. Character segmentation:
        1. NJStar Communicator: Instructions for Adding Spaces Between Characters in Chinese E-Texts
          Note: NJStar's Universal Code Converter can also break long lines into shorter ones (check/tick "wrap text at __" (with "30" the default), with or without also adding a space.
        2. Microsoft Word 97/2000/XP: Instructions for Adding Spaces Between Characters in Chinese E-Texts
          (thanks to Thomas Chan, 11 January 2001):
          1. Enter some hanzi text into MS Word.
          2. Then do Edit -> Replace.
            (In this window, one sees items "Find what" and "Replace with", as well as a "More" button.)
          3. Click on the "More" button to show more options, among which is the "Special" button.
            (The menu under the "Special" button displays a rich array of options for which to perform matches that are more fine-grained than "*" or "?"-style wildcards.)
          4. For "Find what", click on the "Special" button and select "Any Character".
          5. For "Replace with", select "Find What Text" (which will duplicate--i.e., preserve--what was matched), and then type a space after it.
          6. Click on "Replace/Replace All", and you will have spaces between each hanzi!
          7. Be sure to save the file as an "encoded" text file.
      2. Word segmentation:
        1. Chinese Annotation Tool
          This online tool for character-segmentation and other tasks, developed by Erik Peterson and currently housed at Zhang Zheng-sheng's website.
          Note: Because Wenlin has a much bigger dictionary, adding of spaces between words is better accomplished using Wenlin.
        2. Wenlin: Instructions for Adding Spaces Between Words (詞) in Chinese E-Texts
          Note: Wenlin 3.1 has added a utility to break up a line. This is accomplished by going to "Edit" and selecting "Make new transformed copy", and then selecting "Break paragraphs into lines". This is best done after a non-spaced file has been segmented with spacing between character or word.

  3. WENLIN 3.1 (and above)

    1. Pros:   Can handle all encodings and spaced/non-spaced e-texts; a reliable tool for testing and verifying the search results of concordancers developed for English-language e-texts; Cross-platform-compatible (Windows/Mac).
    2. Cons:   Non-KWIC (Keyword-in-Context) display; cannot do full concordances or searches for multiple words and strings.

    3. Instructions for Concordancing Chinese E-Texts using Wenlin   (string search with no KWIC display)
    4. Demo 1 - Wenlin 3.1: Introduction to Wenlin 3.1. Conduct searches using non-spaced e-texts, and study distribution and contexts. Create a word-spaced e-text.


    1. Pros:   concordances of single or multiple word searches via Regular Expression, using segmented or unsegmented e-text; KWIC (Keyword-in-Context) display; fairly sophisticated sorting functions.
    2. Cons:   Can handle Big5/GB(K)-encoding, but not UTF; Windows program that requires English Windows 98 to concordance Chinese e-texts.

    3. MonoConc Pro 2.2: Instructions for Concordancing Chinese E-Texts

    4. Demos:  
      1. Demo 2 - NJStar Comm: E-text preparation - create specific column break to non-spaced e-text that does not have hard line breaks (e.g., Wenlin's copy of HLM).
      2. Demo 3 - NJStar Comm: E-text preparation - add spacing and column break to text (e.g., Wenlin's copy of HLM)

      3. Demo 4 - MonoConc Pro 2.2: Search via Regular Expression (Regex), including wildcards (e.g., 在.*上), using non-spaced e-text. Sort the results to study distribution patterns, etc. Save concordance results as text file.

  5. CONCORDANCE 3.0 (and above)

    1. Pros:   fast concordances (single or multiple word searches) and full concordances (NB: requires segmented e-text); KWIC (Keyword-in-Context) display; sophisticated sorting functions.
    2. Cons:   Can handle Big5/GB(K)-encoding, but not UTF; Windows program that requires English Windows 2000 or higher to concordance Chinese e-texts.

    3. Instructions for Concordancing East Asian E-Texts using Concordance
      (Note: The computers in 316 Denney Hall currently run under Windows 98, whereas Concordance 3.0 (and above) requires Windows 2000 or higher for Chinese concordancing results.)


[ MC Home ]
Created: 1 May 2003 by Marjorie Chan.

URL:     http://people.cohums.ohio-state.edu/chan9/conc/htac_2003.htm