Chinese 889. Seminar in Chinese Linguistics

Databases and Corpora
for Chinese Lx Research


    TEXTBOOKS   Available from SBX (1806 N. High Street. 291-9528) unless indicated otherwise.

    1. Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London and New York: Longman. [For background reading. Required]
    2. Biber, Douglas, Susan Conrad and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge, UK: Cambridge U. Press. [For background reading. Optional]
    3. Computer software -- Chinese OCR's, Wenlin and concordancing programs, corpus analysis tools, annotation tools, etc. -- will be explored during the quarter.
    4. Corpora - databases, e-texts, e-dictionaries, audio files, tape-recordings and written, non-digitized materials, etc.
      This course will make full use of online e-text archives and other internet resources in addition to off-line resources and corpora. Spoken and written corpora for class use will include: Mary Erbaugh's recordings and transcripts of the Mandarin Chinese set of Pear stories*, San Duanmu's (1998) Taiwanese Putonghua Corpus: Speech and Transcripts and the CALLHOME Mandarin transcripts, produced by the Linguistic Data Consortium, and Jianhua Bai et al.'s (1998) Across the Straits, published by Cheng & Tsui Company. (* Our thanks to Prof. Mary Erbaugh for her generosity in permitting us to use these materials. And our thanks to Prof. Jianhua Bai for prepublication recordings and files to supplement commercially-available materials.)
    5. References and additional readings will be placed on Reserve at Main Library.
      Check OSU Libraries' Course Reserves (by Prof/TA or Course) for an online list of books placed on Reserve for Chinese 889. (Note: Reserved materials for a given course are listed online for the current quarter only.)

    This is a Chinese corpus linguistics course, combining linguistic analysis with exploring the use of databases and electronic spoken and written corpora for conducting corpus-based linguistic research on the Chinese language.

    "Corpus" ("corpora" in the plural form) is used broadly here to refer to linguistic data that range from small, specialized data sets to the multi-million-word corpora that are associated with the more heavily statistical and computational end of corpus linguistic research. The broad usage here to encompass small data sets is in part due to enabling students to compile such data sets during the quarter for linguistic analysis. It is also in part due to mega-corpora not being readily available for Chinese at this time, with perhaps one notable exception, namely, the (Big5-encoded) 5 million word Sinica Corpus that is searchable online, thanks to Academia Sinica.* This scenario with respect to Chinese is in sharp contrast to the numerous and ever-expanding inventory of (multi-)million-word corpora available to researchers in academia for English that aim to be representative of the various regional varieties of the English language at specific time periods. This course, therefore, uses the more general definition of "corpus" given in the Oxford English Dictionary (OED) Online, namely: "The body of written or spoken material upon which a linguistic analysis is based."

    (* Another online, searchable corpus for Chinese is LIVAC: Synchronous Corpus, from City University of Hong Kong, which contains texts from representative Chinese newspapers and electronic media from six localities in Asia: Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. (English/Big5) -- added 05/20/01)

    The course aims to provide a forum for students to create and use electronically-stored linguistic data in analyzing various problems and issues in Chinese linguistics. Students will gain hands-on experience using software for preparing databases and corpora, and for tagging corpora. as well as using analysis tools, concordancing programs, and other software for conducting a corpus-based approach to investigating the Chinese language, its nature, structure, and use.

    The course is data-oriented and conducted through mini-lectures, class discussions (in class and via the class mailing list), and student presentations of course assignments and projects.

    Assignments will combine a corpus-based approach to the analysis (quantitative and qualitative) of a linguistic problem with using and/or compiling a corpus and investigating it using relevant software for PC's and Mac's. Part of the assignments will consist of finding and collecting resources, creating databases, and compiling corpora (downloading, scanning, digitizing, recording, etc.). There will also be opportunities to evaluate the strengths and limitations of concordancing programs, corpus analysis tools, annotation tools, and other computer software that can be harnessed for analyzing spoken data and written transcripts, computerized lexicons, and e-texts that are encoded in GB, Big5, and Unicode.

    The term paper project is a corpus-based (broadly-construed) analysis of a linguistic topic of the student's choice that makes use of both the resources created and compiled during the course, and the tools that have been tested and used during the quarter. The project may be an extension of some linguistic inquiry chosen in one or more of the homework assignments.

    (Note: As the instructor has not taught this course before, or one similar to it, the content and approach to this course will very much be exploratory. The course will also evolve and change based on students' interests and needs.)


    Students are expected to:
    1. Attend class regularly and participate actively in class discussions and other class activities, including presenting and reporting on homework assignments.
      . A mailing list for the class will also be used for dissemination of information and student-initiated discussions concerning topics brought up in class.

    2. Submit five short homework assignments (about 2 double-spaced pages, plus references and accompanying sound files and other types of digital files as needed).
      . Students who do not have their own web account may submit their assignments on diskette (or 100MB zip disk), or via email as attachment, for the instructor to upload for class-viewing.

    3. For those enrolled for 5 credits: A class presentation and a written version of a short term paper project (about 5 double-spaced pages, plus references, appendices and/or data files as needed).
      . Obtain topic approval from the instructor no later than Week 7.


    This class meets every Wednesday afternoon during the quarter.
    The third hour or portion thereof is devoted to Q&A's on software and other computer-related issues.

    (Note: Class is held in 354 Central Classroom from Week 1 through Week 6,
    and switches to 330 Central Classroom beginning Week 7, Feb. 14.)

    WEEK 1 Next Schedule Introduction
    Jan. 3
  • Introduction and Orientation.
  • Course syllabus, web accounts, mailing list, etc.
  • Corpus linguistics, databases, concordancers, corpus analysis tools, annotation tools, Chinese OCR, etc.

  • WEEK 2 Next Prev Databases, Corpus Linguistics, and Corpus-Based Approach to Linguistic Research
    Jan. 10
  • Background reading: Kennedy, Ch. 1-2; Biber et al., Ch. 1
  • McEnery, Tony and Andrew Wilson (online course on corpus lx.)
  • See also other online tutorials and introduction to corpus linguistics under "References"

  • Guest: Prof. Jianhua Bai, Kenyon College, Ohio.

  • WEEK 3 Next Prev Databases - Word Lists, E-Dictionaries, and Lexical Analyses
    Jan. 17
  • Spreadsheet programs for sorting, and database management programs for sorting, manipulating, searching, and querying databases (including e-dictionaries containing databases in readily accessible format), and some lexical analysis tools
  • Investigations of orthographic, lexicographical, phonological and/or morphological issues (or issues at other linguistic levels).

  • Due: Assignment 1.

  • WEEK 4 Next Prev Concordances and Corpus-Based Studies of the Lexicon
    Jan. 24
  • Background reading: Kennedy, Ch. 4.2.2; Biber et al., Ch. 2
  • Electronic Texts: Guidelines for Evaluation
  • Concordancers and concordances; written corpora and transcripts of spoken corpora, corpus-based study of the lexicon
  • Ball (1997) (concordances and corpora)
  • Rodriguez (concordancers)
  • Lamy and Mortensen (2000) (concordancers)

  • Due: Assignment 2.

    Happy Chinese New Year!

  • WEEK 5 Next Prev Collocations, Synonyms, and Grammatical Analyses
    Jan. 31
  • Background Reading: Kennedy, Ch. 3.13 and 3.5; Biber et al., Ch. 2.6 and 4

  • Due: Assignment 3.

  • WEEK 6 Next Prev Annotations and Spoken Corpora with Transcripts
    Feb. 07
  • Linguistic Data Consortium: Linguistic Annotation
  • CHILDES System: CHAT and CA Transcription
  • Ball (1997) (text encoding and text annotation)
  • McEnery, Tony and Andrew Wilson (section 2 on encoding and annotation)

  • Due: Assignment 4.

  • WEEK 7 Next Prev Spoken Corpora for Pragmatics and Discourse-Level Research
    Feb. 14
  • Background Reading: Kennedy, Ch. 3.4
  • Exploring spoken corpora (with written transcripts) for pragmatics and spoken discourse, including segmental and prosodic phenomena.
  • Transcriber and WaveSurfer [see Note]
  • C889: Language and Gender (Au97)
  • C889: Intonation and Sentence-Final Particles (Au99)
  • Chinese Language and Gender On-line Bibliography

  • Due: Assignment 5.

    Happy Valentine's Day!

  • WEEK 8 Next Prev Spoken and/or Written Corpora - Annotations, Tagging, and Bilingual Texts
    Feb. 21
  • Background Reading: Kennedy, Ch. 2.6 and 4
  • Background Reading: Biber et al., Part IV (Ch. 1 and 2)
  • LDC: Little Grove - The 100 Sentence Corpus
  • Annotation tools - exploration/evaluation of tools for annotating and tagging spoken and/or written corpora for linguistic investigations
  • King and Woolls (parallel texts and concordancing)
  • Monolingual and bilingual corpora ('parallel texts') for different levels of linguistic investigation (lexical, semantic, syntactic, discoursal, etc.).

  • WEEK 9 Next Prev Implications and Applications of Corpus-Based Analysis
    Feb. 28
  • Background readings: Kennedy, Ch. 5; Biber et al., Part III (Ch. 9)
  • Other topics to be announced.

  • WEEK 10 Next Prev Special Activities for Last Week of Class
    March 7
  • Students' presentation of their term paper project (those enrolled for 5 credits)

  • WEEK 11 Prev Examination Week
    Mar. 12 Term paper project: Due by Monday, 12 March, 5:00 p.m.

    (Note: Request for extension must be made by the end of Week 10.)

    * An asterisk indicates a book placed on Reserve in Main Library. Library call numbers are included for sources that I happen to have the call numbers handy. More references and readings will be added during the quarter.

    1. Armstrong, Susan, et al. 1999. Natural Language Processing Using Very Large Corpora. Dordrecht and Boston: Kluwer Academic. [P98 .N355]
      (Articles include: "Example-Based Sense Tagging of Running Chinese Text" by Xiang Tong, Chang-ning Huang, and Cheng-ming Guo, and "Statistical Augmentation of a Chinese Machine-Readable Dictionary" by Pascale Fung and Dekai Wu.)

    2. Bai, Jianhua, Juyu Sung, and Hesheng Zhang. 1998. Across the Straits: 22 Miniscripts for Developing Advanced Listening Skills in Chinese. Student Book. Boston: Cheng & Tsui Company. (There is both a Traditional Character Edition and a Simplified Character Edition, containing glossaries and listening activities, including pre- and post-listening activities. Audiotapes are available to accompany the textbook. And for teachers, there is also a Transcript of the 22 lessons in traditiona and simplified characters.) [spoken and written corpora: commercial and pre-publication tape-recordings; commercial, hardcopy transcripts and digitized, prepublication transcripts]

    3. Baker, Mona, Gill Francis, and Elena Tognini-Bonelli (eds). 1993. Text and Technology: In Honour of John Sinclair. Philadelphia: John Benjamins Pub. Co. [P302 .T356]

    4. Ball, Catherine. [Online tutorial, March 1997] Tutorial: Concordances and Corpora. (Georgetown U.)

    5. Barlow, Michael. [Online course syllabus] Ling 330/530 Corpus Linguistics. (Rice U.)

    6. Barlow, Michael, and Suzanne Kemmer (eds.). 2000. Usage-Based Models of Language. Stanford, CA: CSLI Publications, Center for the Study of Language and Information. [P128.M6 U8]

    7. * Biber, Douglas, Susan Conrad and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge, UK: Cambridge U. Press.

    8. Bird, Steven and Jason Harrington (eds.). 2001. Speech Communication special issue on Speech Annotation and Corpus Tools, Vol 33, No 1-2 (January 2001). (Abstracts and preprint copies (in PDF format) are available online.)

    9. Botley, Simon Philip, Anthony Mark McEnery, and Andrew Wilson (eds). 2000. Multilingual Corpora in Teaching and Research. Amsterdam and Atlanta, GA: Rodopi. [P98 .M84]
      (Deals with parallel texts, multilingual corpora, and multilingual parallel concordancers; articles include: "Parallel alignment in English and Chinese" by Tony McEnery, Scott Piao, and Xin Xu.)

    10. Brew, Chris and Marc Moens. 2000. Data-Intensive Linguistics. A book-length study by Chris Brew and Marc Moens, HCRC Language Technology Group, The University of Edinburgh, with three main aims: "familiarity with tools and techniques for handling text corpora, knowledge of the characteristics of some of the available corpora, and a secure grasp of the fundamentals of statistical natural language processing." (Since launching that website in March, 2000, Chris Brew has joined the computational linguistics faculty in the Department of Linguistics at The Ohio State University.)

    11. Burnard, Lou, and Tony McEnery (eds.). 2000. Rethinking Language Pedagogy From a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora. (International Conference on Teaching and Language Corpora (3rd: 1998 : Keble College).) New York: P. Lang. [P53.28 .I585 1998] (Added 04/14/03)

    12. Chan, Marjorie K.M. 2002. "Concordancers and concordances: Tools for Chinese language teaching and research." Journal of the Chinese Language Teachers Association 37.2 (2002):1-58.
      (Note: A pre-final version of the paper was inadvertently published instead of the final, revised version. The revised version, with color illustrations, can be downloaded here in PDF format: Chan_JCLTA-2002.pdf (1.03 MB). -- Added 08/07/02)

    13. Chen, Jiang. 2000. Parallel Text Mining for Cross-Language Information Retrieval Using a Statistical Translation Model. Thesis, Maâtre en Informatique, Université de Montréal.

    14. Christensen, Mattew Bruce. 1990. Variation in Spoken and Written Mandarin Narrative Discourse. Ph.D. dissertation, The Ohio State University. [spoken and written corpora - tape-recordings, transcripts of spoken and written narratives in the appendices]

    15. Christensen, Matthew B. 2000. "Anaphoric reference in spoken and written Chinese narrative discourse." Journal of Chinese Linguistics 28.2:303-336.

    16. Cole, Ronald A. Cole (Editor in Chief). [Online version] Survey of the State of the Art in Human Language Technology. Thirteen chapters, including Chapter 12, on "Language Resources", edited by Ron Cole, containing sections on Written Language Corpora, Spoken Language Corpora (including reference to PRC's Chinese National Speech Corpus), and Lexicons. See also Sections by Content and Author. Cambridge U. Press. 1996.

    17. Duanmu, San. 1998. Taiwanese Putonghua Speech and Transcripts. Produced by the Linguistic Data Consortium. [LDC catalog no. LDC98S72]

    18. * Edwards, Jane A. and Martin D. Lampert. 1993. Talking Data: Transcription and Coding in Discourse Research. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
      (There are three parts: Part I. Transcription. Part II. Coding. Part III. Resources (Chapter 10. Survey of Electronic Corpora and Related Resources for Language Researchers, pp.263-306).)

    19. Emerson, Tom. 2001. "Segmentation of Chinese Text". Multilingual Computing and Technology #38 Volume 12 Issue 2.
      (Various approaches to the problems of separating the components of a sentence; part of Multilingual Computing: Feature Articles. Link added 10/13/01.)

    20. Erbaugh, Mary S. 1990. "Mandarin oral narratives compared with English: The pear/guava stories." Journal of the Chinese Language Teachers Association 25.2:21-42. [Main: PL1001A4C5] [spoken corpora and transcripts: tape-recordings and digitized transcripts]

    21. Fung, Pascale. 1995. "Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus." In: Proceedings of the Third Workshop on Very Large Corpora.

    22. Garside, Roger, Geoffrey Leech, Tony McEnery (eds.). 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. London & New York: Longman. (Added 12/12/01)

    23. International Journal of Corpus Linguistics. (Includes tables of content and abstracts.)

    24. * Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London and New York: Longman.

    25. Kettemann, Bernhard, and Georg Marko. (eds.). 2002. Teaching and Learning by Doing Corpus Analysis: Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19-24 July, 2000. (International Conference on Teaching and Language Corpora (4th : 2000 : Graz, Austria).) Amsterdam; New York, NY: Rodopi. [P53.28 .I59 2000 v.2cdrom, P53.28 .I59 2000 v.1] (Added 04/14/03)

    26. King, Philip, and David Woolls. Creating and Using a Multilingual Parallel Concordancer.
      (This is a report on the Multilingual Parallel Concordancing project, supported by the European Union.)

    27. Lamy, Marie-Noelle, and Hans Jørgen Klarskov Mortensen. 2000. Using Concordance Programs in the Modern Foreign Languages Classroom.
      (This is a comprehensive, 50-page length, module prepared by the Information and Communications Technology for Language Teachers (ICT4LT) to introduce language teachers to the use of concordances and concordance programs in the modern foreign languages classroom. This is ICT4LT Module 2.4 in the series. See the full Table of Contents.) (Thanks to Rob Watt.)

    28. Lawler, John M. and Helen Aristar Dry (eds.) Glossary: From Using Computers in Linguistics: A Practical Guide. Routledge. 1998.

    29. Lee, Ok Joo. 2000. The Pragmatics and Intonation of Ma-Particle Questions in Mandarin. M.A. thesis, Ohio State University. [spoken corpora with prepared sentences for utterance-elicitation in constructed contexts]

    30. MacWhinny, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. Volume I: Transcription Format and Programs. Volume II. The Database. Third Edition. NJ: Lawrence Erlbaum Associates, Inc., Publishers.
      (Includes a CD ROM containing the tools for Windows and Macs, PDF files, and database. These are also available at the CHILDES System website, including the latest version of the CLAN program for linguistic analysis.)

    31. * McEnery, Tony and Andrew Wilson. Corpus Linguistics. 1996. Edinburgh University Press.
      (See their online course below, based on their textbook. The second edition comes out in March 2001.)

    32. McEnery, Tony and Andrew Wilson. [Online course] Corpus Linguistics.
      (There are four sections: Section 1. Early Corpus Linguistics and the Chomskyan Revolution. Section 2: What is a Corpus and What is in it? Section 3. Quantitative Data. Section 4: The Use of Corpora in Language Studies.) See also Corpus Linguistics (October 2000), their online tutorial that is ICT4LT Module 3.4 of the series prepared by the Information and Communications Technology for Language Teachers (ICT4LT).

    33. McEnery, Tony, Scott Piao, and Xu Xin. 2000. "Parallel alignment in English and Chinese." In: Multilingual Corpora in Teaching and Research, edited by Simon Philip Botley, Anthony Mark McEnery, and Andrew Wilson. Amsterdam & Atlanta, GA: Rodopi. Pp. 177-191. (Added 12/12/01)

    34. Mickel, Stanley L. 1999. Dictionary for Readers of Modern Chinese Prose: Your Guide to the 250 Key Grammatical Markers in Chinese. New Haven: Far Eastern Publications, Yale U.

    35. Oakes, Michael P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. [P98 .O35 1998]

    36. Packard, Jerome L. 2000. The Morphology of Chinese: A Linguistic and Cognitive Approach. Cambridge, UK: Cambridge University Press.

    37. Peng, Shu-hui, Marjorie K.M. Chan, Chiu-yu Tseng, Tsan Huang, Ok Joo Lee, and Mary E. Beckman. [Volume under review] "A Pan-Mandarin ToBI." In: Prosodic Typology and Transcription: A Unified Approach (Collection of papers from the Fourteenth International Congress of Phonetic Sciences 1999 satellite workshop on "Intonation: Models and ToBI Labelling," San Francisco, CA. 1-7 August 1999.)
      (ToBI stands for "Tone and Break Indices". A preliminary Pan-Mandarin ToBI webpage was posted as part of my Autumn 1999 seminar on "Intonation and Sentence-Final Particles.")

    38. Reich, Sabine. [Online course outline, June 1998] Introduction to Corpus Linguistics: Course Outline. (Universität zu Köln, Germany)

    39. * Renouf, Antoinette (ed.) 1998. Explorations in Corpus Linguistics. Amsterdam and Atlanta, GA: Rodopi.
      (There are three sections: Section 1 on corpus creation, Section 2 on corpus analysis, and Section 3 'corpus linguistic results' (re taggers and other software).

    40. Rodriguez, Maria Rosario Caballero. Using a Concordancer in Literary Studies.
      (The paper introduces concordancers, their use, and their potential for developing new research techniques in literary studies. Examined are some free and commercial Windows-based concordancing software (viz., MonoConc, WinConcord, and ConApp.)

    41. Rundell, Michael. 1996. The Corpus of the Future, and the Future of the Corpus.
      (A talk presented at Exeter that is part of a special conference on "New Trends in Reference Science.")

    42. Stevens, Vance. 1995. "Concordancing with Language Learners: Why? When? What?." CAELL Journal 6.2:2-10.

    43. * Svartvik, Jan. 1992. Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm, 4-8 August 1991. Berlin and New York: Mouton de Gruyter.
      (There are four main sections: 1. Historical conspectus. 2. Theoretical issues. 3. Corpus design and development. 4. Explorations and applications of corpora. Contributors to the volume: Jan Svartvik, W. Nelson Francis, Charles J. Fillmore, M.A.K. Halliday, Wallace Chafe, Geoffrey Leech, Douglas Biber, Graeme Kennedy, Geoffrey Sampson, and many more.)

    44. Taylor, Macey. Value of Concordancing as a Teaching Aid. (A short outline of basic features and sample activities.)

    45. Thomas, Jenny, and Mick Short (eds). 1996. Using Corpora for Language Research: Studies in Honour of Geoffrey Leech. London, England, and New York: Longman [P128.C68 U85]

    46. Wang, Lixun. 2001. "Exploring parallel concordancing in English and Chinese." Language Learning & Technology 5.3 (September 2001):174-184.

    47. Wang, Ren-hua, Deyu Xia, Jinfu Ni, and Bicheng Liu. 1996. "USTC95 --- A Putonghua Corpus." (PDF). In: Volume 3 of ICSLP 96: Fourth International Conference on Spoken Language Processing.
      (The article introduces a large Putonghua corpus, consisting of four major sub-corpora corresponding to isolated syllables, multi-syllable words, sentences, and telephone speech.) (Added 2/26/02)

    48. Wichmann, Anne, Tony McEnery, and Steven Fligelstone. 1997. Teaching and Language Corpora. London and New York: Longman. [P53.28 .T4]

    49. Wong, Peggy W.Y., Marjorie K.M. Chan, and Mary E. Beckman. [Volume under review] "ToBI framework annotation conventions for Cantonese." In Prosodic Typology and Transcription: A Unified Approach (Collection of papers from the Fourteenth International Congress of Phonetic Sciences 1999 satellite workshop on "Intonation: Models and ToBI Labelling," San Francisco, CA. 1-7 August 1999.)

    50. Wu, Dekai. 1994. "Aligning a parallel English-Chinese corpus statistically with lexical criteria." In: Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, 80--87, Las Cruces, New Mexico.

    51. Yip, Po-ching. 2000. The Chinese Lexicon: A Comprehensive Survey. London and New York: Routledge.

    52. Zhou, Qiang and Shiwen Yu. 1997. "Annotating the Contemporary Chinese Corpus." International Journal of Corpus Linguistics 2.2. (Abstract) (The article (which I have not yet seen) reports on a large-scale, fifty-million-word Chinese National Corpus, an annotated corpus for Chinese grammatical research and for developing a Chinese Corpus Multilevel Processing (CCMP) system.)

      1. Linguistic Data Consortium: Linguistic Annotation. Links and info on tools and formats for creating and managing linguistic annotations; part of U. of Pennsylvania's Linguistic Data Consortium website.
        Note: The Linguistic Data Consortium (LDC) is "an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes." (OSU has been an LDC member during various years -- see LDC Online for online resources for site members, and check their LDC Catalog for their corpora,) Under construction is their useful information on Creating Data Resources, in addition to their links to LDC Related Sites webpage.) See also the TalkBank, an interdisciplinary research project hosted by Carnegie Mellon University (site of the CHILDES System) and the Univeristy of Pennsylvania (site of the Linguistic Data Consortium). The goal of TalkBank is "to foster fundamental research in the study of human and animal communication" and will do so by providing "standards and tools for creating, searching, and publishing primary materials via networked computers." See also the Open Language Archives Community, an international project to construct the infrastructure for language archives to be linked by community-specific metadata and centralized union catalogs.

        Note: A freely-downloadable program from LDC is Transcriber, a tool developed by Claude Barras for segmenting, labelling, and transcribing speech in Unix, which was then compiled for Windows NT at LDC. The program runs under Windows 9x/NT/2000, Macs, and Unix systems, and "user-chosen encoding is used for transcriptions input/output; most current encodings (ISO-8859-*, EUC-JP...) and Unicode (UTF-8) can be used." Unicode enables inputting of multilingual scripts in the same text (and hence also supports encoding of Chinese, Japanese, and Korean). For Chinese, GB and Big5 are supported. Note that Transcriber 1.4.2 may require adjustment of screen resolution for proper display. (For example, 800x600 resolution does not display the entire Transcriber window on a PC.) As stated in the article, "Transcriber: a Free Tool for Segmenting, Labeling and Transcribing Speech," by Claude Barras, Edouard Geoffrois, Zhibiao Wu and Mark Liberman (1998), the hope is "that such a portable, widely available and flexible tool will benefit the whole community and make it easier to develop and share corpora." (See other Transcriber-related publications at LDC's website for Transcriber.) New features in the next release of Transcriber (viz., version 1.4.3) include the ability to launch WaveSurfer on a selected portion of the signal. WaveSurfer, developed by Kare Sjölander and Jonas Beskow, is a multi-platform freeware program for Win9x/NT/2000/Linux/Mac/Unix that can handle many file extensions; long files (unlimited file size); mono, stereo, and 4-channels; can display time axis, waveforms, spectrograms, pitch tracks, etc.

      2. Corpus Linguistics. *Comprehensive* webpage of links and information, maintained by Michael Barlow. Text and corpora divided by languages (including Stanford U's Modern Chinese Corpus (restricted access)), links to info on free and commercially-available text analysis software and taggers, including a downloadable demo version of MonoConc Pro 2.0, his concordancing program for Windows 3.1 and 9x/2000, and his ParaConc, a bilingual/multilingual concordancer for (32-bit) Windows (95 and up) and Macs (the commercial version will be available later in the year). Not linked is his Mac freeware, MonoConc (click here to download the .hqx file directly -- sample screenshots). There is a link to various tools that can be downloaded from Mike Scott's Web, including a demo version of his WordSmith Tools 3.0, which contains his Concord program, a lexical analysis tool. MonoConc Pro 2.0 (see MonoConc: Using for Chinese, Arabic, etc.), ParaConc 1.0, and WordSmith Tools 3.0 can be used for concordancing of Chinese e-texts in Windows 9x (and Windows 2000 but the display is better under Windows 9x). The website also has MicroConcord, a freely-downloadable concordancer for DOS developed by Mike Scott and Tim Johns.

        Not included among the links in the Corpus Linguistics webpage are the following concordancers for Windows:

            (1) ConcApp Concordancing Programs. ConcApp Concordance Browser and Editor for Windows 95/98/NT (and also works in Windows 2000) for English, Chinese, and Japanese concordancing, and ConcApp Concordance Browser for Windows 3.1/95, are free programs developed by Chris Greaves, downloadable from Hong Kong Polytechnic University's Virtual Library Centre (VLC).

            (2) Concordance, a commercial program developed by Robert J.C. Watt (University of Dundee) for Windows 95/98/Me/NT/2000/XP, which can generate web concordances, in addition to making both full concordances and fast concordances (i.e., KeyWords in Context (KWIC) concordances) from multilingual texts. A fully-functional trial version is downloadable from his site. (See also Steven Vance's Review of Concordance 2.0 (June 2001), part of The CALICO Review at the Computer Assisted Language Instruction Consortium (CALICO). Review of Concordance 2.0, added 06/28/01)
            While Concordance 2.0 was primarily for (single-byte) Latin scripts, Concordance 3.0, available on 19 January 2002, can handle East Asian double-byte e-texts in Windows 2000/XP. See, for example, my Instructions for Concordancing East Asian E-Texts using Concordance. (Added 12/05/01)
            See also my PowerPoint presentation (.pps file, 2 MB, added 01/19/02), "Concordancers, Concordances, and Chinese Language Teaching," based on my presentation delivered at the 2001 Annual Meeting of the Chinese Language Teachers Association, 15-18 November 2001.
            Also see my "Concordancers and concordances: Tools for Chinese language teaching and research", (PDF, revised version, 1.03 MB), Journal of the Chinese Language Teachers Association 37.2 (2002):1-58. (This revised version of the paper has color illustrations. -- Added 08/07/02.)

            (3) Simple Concordance Program (SCP). SCP 4.0.5 (last updated on 21 December 2001) is a free, 32-bit concordancing program developed by Alan Reed for Windows 95/98/NT4/ME/2000/XP. See his webpage, A Quick Tour of SCP 4.05 for the Impatient. In addition to various searching capabilities, concordances can be saved in HTML format for viewing on the web. (Added 1/24/02)

            (4) KWIC Concordance for Windows, a freeware for Windows95/98/NT4.0/2000/XP that makes word frequency lists, concordances and collocation tables, and offers the capability of handling markup schemes, such as COCOA, SGML, the Helsinki corpus, the Penn-Helsinki Parsed Corpus of Middle English (Phase 1) (Phase 2) etc. (Added 08/08/02)

            (5) KWiCFinder, a "professional web search agent optimized for multilingual searches, which builds on AltaVista's support for complex Boolean searches and enlists XML technology to provide a complete range of report options in five different languages to display the Key Words of your search in Context." KWiCFinder Report Formats include XML format, Keywords-in-Context format, paragraph format, and table format; requires Windows 95/98/Me and Internet Explorer 5.0 or greater. Pre-release beta version (19 December 2001) of KWiCFinder is downloadable from MiniAPPolis.com Downloads. (Added 1/19/02)

            (6) WebKWiC (Web Key Word in Context), "complements Google.com's popular search engine to simplify and accelerate the task of online research, runs in Windows with Internet Explorer 5.0 or greater. Cross-platform and cross-browser versions may be implemented when Netscape and Macintosh-compatible browsers provide appropriate functionality." Downloadable from MiniAPPolis.com Downloads. (See KWiCFinder & WebKWiC Compared.) (Added 1/19/02)

            (7) Concordancer for Windows (was WinConcord), version 3.0 is downloadable freeware for Windows 95, developed by Zdenek Martinek (University of West Bohemia, Pilsen, Czech Republic), in close collaboration with Les Siegrist (Technische Universität Darmstadt, Germany). (Added 10/18/01, URL and info updated 09/03/02)

            (8) WordExpert: Translators Concordancer, is Myteam's commercial Windows concordancer developed for translators, teachers, and students. (Added 1/24/02)

        Also see Concordancing and Concordancing Software.

      3. SIL: Software for Doing Field Linguistics - Text Analysis. Summer Institute of Linguistics' links to concordancers and other text analysis tools for corpus-based research using DOS/Windows/Mac/Unix, including SIL's Conc, a freeware program for concordancing that is downloadable from SIL's website. For a general tutorial on using Conc, see Ed Beach's Getting Started with Conc: Making Concordances, Indices, and Word Lists. For instructions specifically for using Conc to make concordances of Chinese e-texts, see Fabrizio Pregadio's webpage, Making Electronic Concordances of Chinese Texts (URL with instructions thanks to Olli Salmi, 1/22/02).

        Other text analysis software for non-DOS/non-Windows include:

            (1) CLOC Collocation Package. Alan Reed's free software for Unixes (Sun-Solaris)that can make concordances of words and phrases, and can produce word lists and collocations. (Added 1/24/02)

      4. Internet Information Resources for Corpus Studies. Website maintained by Haruo Nishinoh, and includes: Text Analyzing Software and Corpus Collections.

      5. Linguist List: Texts & Corpora. The Linguist List's annotated links to online texts and corpora. (Added 08/14/02)

      6. Resources: Corpus Linguistics. Many links, maintained by Federico Zanettin.

      7. Links for Corpus Linguistics and Natural Language Processing. Useful links maintained by Serge Sharoff.

      8. Standards -- Electronic Text Center. University of Virginia's Electronic Text Center provides an introduction to different markup languages: SGML, XML, HTML, and TEI, including guidelines for Electronic Text Encoding and Interchange. The site also has information on "Special Characters and Language Codes" (viz., ISO Special Characters and ISO 639 Language Codes).

      9. Useful Links for Linguists and Teachers/Learners of English. Website maintained by Yasuyama Someya, with annotated links to corpus linguistics tools, corpora, and other corpus resources. (Eng/ShiftJIS)

      10. BNC: English Language Corpora and Corpus Resources. Site lists centers and projects from which language corpora (chiefly English language) are readily available, and includes links to resources on corpus linguistics; maintained at the British National Corpus (BNC) website.

        See also the freely-accessible, POS-tagged, America English word corpus that corresponds to the British National Corpus (BNC), namely, the TIME MAGAZINE, 1923-2007 (100+ million words) at Brigham Young University. Users can obtain the frequency of particular words, phrases, substrings (prefixes, suffixes, roots) in each decade from the 1920s-2000s; they can limit the results by frequency in any set of years or decades; users can also study changes in syntax, collocations, semantic shifts, etc.

      11. The CHILDES System. The Child Language Data Exchange System (CHILDES) site has tools and transcriptions for analysis of child language data for numerous languages, including Cantonese and Mandarin varieties of Chinese.

      12. Corpus Research Group: Part-of-Speech Tagging. Automatic tagging of short English text sent via email. Also, QTAG, a freely-downloadable, language-independent, Parts-of-Speech Tagger, written in Java and can be installed on your own machine. You can make use of several resource files to tag texts in different languages, and you can also train the tagger if you have a pre-tagged text in a different language for which no resource file is available yet. (Part of the Software Available from the Corpus Research Group.)

      13. Center for Electronic Texts in the Humanities. Resources and info from Rutgers U. on e-texts, including a webpage on Electronic Texts: Guidelines for Evaluation (CETH Workshop Series 1996: Electronic Resources for the Humanities). (A bit dated but informative overall.)

      14. A Corpus Worker's Toolkit (ACWT). This is Hongyin Tao's website containing a collection of NoteTab clips, Perl scripts and other utilities for Chinese and English text processing.

      15. Consortium for Lexical Research (CLR). Their Consortium for Lexical Research: Catalog lists the downloadable resources (via FTP) in the CLR archives from the early 1990's.

      16. CONCORD, A Concordance Creation Tool. This is a downloadable tool for DOS and Mac operating systems developed by Christian Wittern and was created in the context of the Zen KnowledgeBase project. The program automatically generates a concordance from a Chinese text file. The source file can be encoded either in JIS or in Big5.

      17. Tim Johns Data-driven Learning Page (a.k.a. "Classroom Concordancing"), part of Tim Johns Home Page. Useful info and links. Included there is info on the Windows 3.x/95 software, Multiconcord: the Lingua Multilingual Parallel Concordancer for Windows, as part of the Lingua project at University of Birmingham. See also Tim Johns' student's website, Wang Lixun's Home Page, a new website with links to his ECLEPT (English and Chinese Learning Employing Parallel Texts), his online article, "Parallel Concordancing in English and Chinese and its Pedagogic Application," his online English-Chinese Parallel Web Concordancer, and some downloadable English and Chinese parallel texts. Among the parallel text corpora available from his English-Chinese Parallel Corpus website is Jane Austen's Pride and Prejudice (in English and Chinese). (The parallel texts have been available for several years at Ocrat's website for Jane Austen's 1813 novel, Pride and Prejudice, in original English with Chinese (GB) translation side-by-side, and paragraph for paragraph, together with links and online help for learners of Chinese.)

      18. Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) The Sinica Corpus is developed and maintained by Institute of Information Science and CKIP group at Academia Sinica, Taipei, Taiwan.

      19. BabelConc 汉英平行语料库. Peking University's Babel Chinese-English Parallel Corpus.

      20. CCL 语料库检索系统 (网络版) (北京大学汉语语言学研究中心) Online searchable database of modern and historical e-texts at Peking University's Center for Chinese Linguistics. (thanks to Yang Jia for the original link, updated link thanks from Chen Litong)

      21. Beijing Language and Culture University Corpus: 语料库智能检索系统 (online search engine to search People's Daily, fiction, etc.)

      22. Leiden Weibo Corpus (LWC). This online, searchable corpus of Sina Weibo (新浪微博 <www.weibo.com>) messages, collected in January 2012, is POS-tagged and includes a searchable field for gender as well as region of origin of the bloggers.
      23. Wenzhou Spoken Corpus 温州口语语言资料库. This online, searchable corpus of transcribed spoken data of Wenzhou Chinese, released in January 2006, is developed by Jingxia Lin and John Newman, Department of Linguistics, University of Alberta.

      24. Xiamen University Corpora (online search engine to access the 语料库登陆页面).

      25. A Query to Chinese Corpora. This is University of Leeds' online search engine for making concordances from three large corpora: Chinese Internet, LDC Gigaword, and Xinhua (thanks to HoJung Choi).

      26. English-Chinese Parallel Concordancer. Lancaster University's online search engine for making concordances from their English-Chinese corpus, The Babel English-Chinese Parallel Corpus (thanks to HoJung Choi).

        Other corpora at Lancaster University: Lancaster Corpus of Mandarin Chinese (LCMC) (online search engine) -- The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) -- The PD Corpus Web Concordancer (online search engine that searches The People's Daily Corpus (month of January 1998), a one-million word corpus for Mandarin Chinese, released by the Institute of Computational Linguistics at Peking University) -- PH Corpus Web Concordancer (online search engine that searches the PH Corpus, compiled Guo Jin, consisting of about 2.4 million words of Mandarin Chinese from newswire text published by Xinhua News Agency in 1990-1991).

      27. Corpus4U. Numerous links (log-in required) to online resources for corpus linguistics in general and for Chinese corpus linguistics in particular.

      28. Classics in English and Chinese. This site has a number of parallel texts from English classics to Chinese, and from Chinese to English (with GB-encoding for Chinese). There are also some links to parallel texts at other websites. (Added 05/20/01, thanks to Fu-Dong Chiou.)

      29. Parallel Texts Viewer and Concordancer. Hong Kong Polytechnic University's Virtual Library Centre (VLC) has an online parallel text concordancer for English and Chinese bilingual corpora.

      30. Online Delivery of Prototype: A Corpus of Multilingual Parallel Texts (Chinese, English, French). Hong Kong Polytechnic University's Department of English. Together with a downloadable Parallel Text Reader, the website states that, "focusing on literary and informative texts, the total collection of the corpus consists of 40 files in each of the three languages (61,000 characters in Chinese, 38,000 words in English, 35,000 words in French)." The aim is "to provide a useful web-based learning tool for language students to read the parallel texts aligned and hyper-linked at the sentence level." Selections include online parallel texts of The Little Prince, magazine articles, etc., in French-English, English-Chinese, and Chinese-French.

      31. Corpus: Chinese Linguistics, USC. Part of USC's Chinese Linguistics website, with links to Chinese Language Corpus of texts, USC; Chinese Language Corpus of texts, Academia Sinica; Chinese Language Corpus of sentences and words, Academia Sinica; and Chinese Linguistic and Literary Knowldge Net, National Science Council Digital Library and Museum. (Big5)

      32. The Holy Bible (Multiple Versions). English-Chinese parallel texts online, Chinese Union Version, King James Version, and Bible in Basic English. (Eng/Big5)

      33. Chinese Morphological Analyzer (CMA). Quote from the website: "Basis Technology's Chinese Morphological Analyzer (CMA) is an accurate, portable, high-performance engine that incorporates comprehensive Chinese dictionaries for segmenting Chinese text. This robust software library is portable and capable of running on platforms ranging from low-spec 386 PCs to large-scale multi-CPU web servers processing hundreds of documents per minute." Online demo available for "Simplified Chinese" and "Traditional Chinese" texts.

      34. Chinese Annotation Tool. Erik Petersen's annotation tool has three main capabilities: (1) It can segment words in GB-encoded Chinese texts (on-line or off-line), adding spaces between words; (2) it can add dictionary entries (using dictionary definitions drawn from Paul Denisowski's CEDICT: Chinese-English Dictionary; and (3) it can convert the segmented text to Pinyin romanization. The Chinese Annotation Tool was originally located at Erik Petersen's On-Line Chinese Tools website, mirrored at Zhang Zheng-sheng's website at San Diego State University.

      35. The Complete Guide to Chinese Language Computing. Lots of useful resources.

      36. Resources on Chinese at LDC. Resources at U. of Pennsylvania's Linguistic Data Consortium for Chinese computing, including word lists, a frequency dictionary, Zhibiao Wu's Chinese Segmenter, a 288-line Perl script, etc.

      37. Wenlin Institute's Wenlin 3.x software for learning Chinese. Can segment Chinese non-spaced text into (polysyllabic) words, making use of the expanded version of John DeFrancis' ABC Dictionary (with over 10,000 characters in some 200,000 entries) in the program. Wenlin 3.x also supports the Unicode 3.1 standard, which had 94,140 encoded characters. (Reference is to Wenlin 3.0 through 3.22.)

      38. The Chinese Treebank Project. The site houses Little Grove: The 100 Sentence Corpus, a mini-corpus compiled and tagged by Mary Ellen Okurowski, Ron Dolan, and John Kovarik in 1998. Includes guidelines for segmentation and tagging, and illustration using the mini-corpus. For the full, 4k-sentence corpus together with other information, see the Linguistic Data Consortium's website on the Chinese Penn Treebank Project and the Chinese Penn Treebank Project Corpus: Final Release (December 2000). This is a GB-encoded and syntactically-bracketed corpus of 325 articles from Xinhua newswire between 1994 and 1998, and contains about 100K words, 4,185 sentences, and 325 data files.

      39. MC's Word Lists and Online Glossaries / Dictionaries (now part of <ChinaLinks.osu.edu>. My links to word lists and downloadable -- as well as online, searchable -- dictionaries for Chinese (and Japanese).

      40. Marjorie Chan's ChinaLinks (Homepage with Table of Contents to 4 satellite pages and their contents)
        1. ChinaLinks1: General Resources for Chinese Studies
          Links to search engines (including the DEALL Search Engine), publishers, Asian studies associations and journals (with indices), netnews, e-magazines, e-texts and e-text archives, etc.
        2. ChinaLinks2: Chinese Language Software & AV Programs
          Links to downloadable CJK fonts and encoders/decoders, IPA and Pinyin fonts, Unicode fonts, etc., and online radio and TV programs.
        3. ChinaLinks3: Chinese Language and Linguistics
          Links to Chinese dialectology and online searchable dialect databases; Chinese linguistics associations and journals (with tables of content/indices); information and links on Unicode; online, searchable Chinese corpora from Academia Sinica and from Chinese University of Hong Kong, etc. Included is a link to A Hong Kong Cantonese Child Language Corpus (CANCORP), Thomas Lee's eight-subject, longitudinal project, with description of the corpus, depositories, etc. There are two downloadable versions of the corpus: a Chinese(-only) version and a CHAT version (Chinese on one tier and romanization on another) downloadable at the CHILDES System website. Chinese display requires MS Chinese Win95/98 with Hong Kong government's Hong Kong Supplementary Character Set (HKSCS) support for viewing the Cantonese vernacular characters used in HK that are not in GB or Big5.
        4. ChinaLinks4: General Linguistics and Internet Resources
          Links to linguistics associations and journals (with tables of contents and indices, etc.), online dictionaries and references, transcription and annotation tools for spoken and written corppora, (commercial and freely-downloadable) software for speech analysis, web-authoring guides and tools, other software, etc.

      41. EALC 222: Seminar in Corpus Linguistics. Hongyin Tao's Winter 2002 corpus linguistics course. (Added 03/22/02)



      1. Unicode Home Page (for info on the current version of the Unicode standard, code charts, etc.)

      2. Library Resources:
        1. OSU Libraries (Homepage):
          Catalogs (OSU Libraries' online catalog (OSCAR) and other library catalogs).
          . See also some tips in the WWW resources section of my Chinese 680 online course page.
          Chinese Collection (Eng./Big5)
          Linguistics and Language Behavior Abstracts (LLBA) [WWW (OSU Columbus Only)]
        2. OhioLINK: MLA International Bibliography
          Part of OhioLINK's online Research Databases.

      3. General Linguistics Resources:
        1. Linguistics and Language Indexes, Abstracts, Bibs, and TOCs (links to resources compiled by U. of Houston Libraries)
        2. Linguist List: 2000 Tables of Contents (TOC) (for some linguistics journals, and links to back issues as well)

      4. Online Indices of Some Chinese Linguistics Journals:
        1. Journal of Chinese Linguistics: Index of Articles (1973- ).
        2. Journal of the Chinese Language Teachers Association: Authors and Topics Indices (1966- ).
        3. Yuyan Yanjiu: Table of Contents Index (Eng./GB) (1981- )
          Part of Wenze Hu and Hongyin Tao's Chinese Linguistics Page.
        4. Zhongguo Yuwen: Table of Contents Index (GB for now, Eng. under construction) (1995- ).

      5. Miscellaneous Chinese Linguistics Resources:
        1. MC: On-line Dissertation Abstracts (my links to online resources)
        2. MC: Chinese 694: Group Studies -- Topic: Chinese Computing (my Spring Quarter 2000 course).

