![]() |
Professor Marjorie K.M. Chan Dept. of E. Asian Lang. & Lit. The Ohio State University Columbus, OH 43210 U.S.A. |
COURSE & CREDITS:
Chinese 889. Seminar in Chinese Linguistics:
Databases and Corpora for Chinese Linguistic Research.
Call Number: 04336-7 G 3-5 credit hours*
(*Repeatable to a maximum of 20 credit hours.)
TIME & PLACE:
W 3:30-6:18 p.m.
354 Central Classroom Building [switching to 330 CC beginning Feb. 14]
with computers, multimedia, and internet connection
OFFICE HOURS:
R 12:00 - 2:00 p.m., or by appointment
Office: 366 Cunz Hall
Tel: 292-3619 (292-5816 for messages, 292-3225 for faxes)
E-mail: chan.9 @osu.edu (close the gap)
C889 COURSE PAGE:
MC's Home Page:
MC's ChinaLinks:
Chinese 889 Pages:
TEXTBOOKS
Available from SBX (1806 N. High Street. 291-9528) unless indicated otherwise.
This course will make full use of online e-text archives and other internet resources in addition
to off-line resources and corpora.
Spoken and written corpora for class use will include: Mary Erbaugh's recordings and transcripts of the Mandarin Chinese set of Pear stories*,
San Duanmu's (1998) Taiwanese Putonghua Corpus: Speech and Transcripts and the CALLHOME Mandarin transcripts,
produced by the Linguistic Data
Consortium, and Jianhua Bai et al.'s (1998) Across the Straits, published by
Cheng & Tsui Company.
(* Our thanks to Prof. Mary Erbaugh for her generosity in permitting us to use these materials. And our thanks
to Prof. Jianhua Bai for prepublication recordings and files to supplement commercially-available materials.)
Check OSU Libraries' Course Reserves (by Prof/TA or Course) for an
online list of books placed on Reserve for Chinese 889. (Note: Reserved materials for a given course are listed
online for the current quarter only.)
COURSE DESCRIPTION
This is a Chinese corpus linguistics course, combining linguistic analysis with exploring the use of databases and electronic spoken and written corpora for conducting corpus-based linguistic research on the Chinese language."Corpus" ("corpora" in the plural form) is used broadly here to refer to linguistic data that range from small, specialized data sets to the multi-million-word corpora that are associated with the more heavily statistical and computational end of corpus linguistic research. The broad usage here to encompass small data sets is in part due to enabling students to compile such data sets during the quarter for linguistic analysis. It is also in part due to mega-corpora not being readily available for Chinese at this time, with perhaps one notable exception, namely, the (Big5-encoded) 5 million word Sinica Corpus that is searchable online, thanks to Academia Sinica.* This scenario with respect to Chinese is in sharp contrast to the numerous and ever-expanding inventory of (multi-)million-word corpora available to researchers in academia for English that aim to be representative of the various regional varieties of the English language at specific time periods. This course, therefore, uses the more general definition of "corpus" given in the Oxford English Dictionary (OED) Online, namely: "The body of written or spoken material upon which a linguistic analysis is based."
(* Another online, searchable corpus for Chinese is LIVAC: Synchronous Corpus, from City University of Hong Kong, which contains texts from representative Chinese newspapers and electronic media from six localities in Asia: Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. (English/Big5) -- added 05/20/01)
COURSE OJECTIVES
The course aims to provide a forum for students to create and use electronically-stored linguistic data in analyzing various problems and issues in Chinese linguistics. Students will gain hands-on experience using software for preparing databases and corpora, and for tagging corpora. as well as using analysis tools, concordancing programs, and other software for conducting a corpus-based approach to investigating the Chinese language, its nature, structure, and use.
COURSE CONTENT
The course is data-oriented and conducted through mini-lectures, class discussions (in class and via the class mailing list), and student presentations of course assignments and projects.Assignments will combine a corpus-based approach to the analysis (quantitative and qualitative) of a linguistic problem with using and/or compiling a corpus and investigating it using relevant software for PC's and Mac's. Part of the assignments will consist of finding and collecting resources, creating databases, and compiling corpora (downloading, scanning, digitizing, recording, etc.). There will also be opportunities to evaluate the strengths and limitations of concordancing programs, corpus analysis tools, annotation tools, and other computer software that can be harnessed for analyzing spoken data and written transcripts, computerized lexicons, and e-texts that are encoded in GB, Big5, and Unicode.
The term paper project is a corpus-based (broadly-construed) analysis of a linguistic topic of the student's choice that makes use of both the resources created and compiled during the course, and the tools that have been tested and used during the quarter. The project may be an extension of some linguistic inquiry chosen in one or more of the homework assignments.
(Note: As the instructor has not taught this course before, or one similar to it, the content and approach to this course will very much be exploratory. The course will also evolve and change based on students' interests and needs.)
STUDENT RESPONSIBILITIES
Students are expected to:
- Attend class regularly and participate actively in class discussions and other class activities, including presenting and reporting on homework assignments.
. A mailing list for the class will also be used for dissemination of information and student-initiated discussions concerning topics brought up in class.
- Submit five short homework assignments (about 2 double-spaced pages, plus references and accompanying sound files and other types of digital files as needed).
. Students who do not have their own web account may submit their assignments on diskette (or 100MB zip disk), or via email as attachment, for the instructor to upload for class-viewing.
- For those enrolled for 5 credits: A class presentation and a written version of a short term paper project (about 5 double-spaced pages, plus references, appendices and/or data files as needed).
. Obtain topic approval from the instructor no later than Week 7.
| 3 Credits: | 5 Credits: | |||
|---|---|---|---|---|
| Attendance and class participation | 50% | Attendance and class participation | 35% | |
| Take-home assignments (5) | 50% | Take-home assignments (5) | 35% | |
| ------ | Term Paper Project | 30% | ||
| 100% | ------ | |||
| 100% | ||||
(Note: Class is held in 354 Central Classroom from Week 1 through Week 6,
and switches to 330 Central Classroom beginning Week 7, Feb. 14.)
| WEEK 2 |
Databases, Corpus Linguistics, and Corpus-Based Approach to Linguistic Research |
|---|---|
| Jan. 10 |
| WEEK 4 |
Concordances and Corpus-Based Studies of the Lexicon
|
|---|---|
| Jan. 24 | Happy Chinese New Year! |
| WEEK 5 |
Collocations, Synonyms, and Grammatical Analyses |
|---|---|
| Jan. 31 |
| WEEK 6 |
Annotations and Spoken Corpora with Transcripts |
|---|---|
| Feb. 07 |
| WEEK 7 |
Spoken Corpora for Pragmatics and Discourse-Level Research |
|---|---|
| Feb. 14 | Happy Valentine's Day! |
| WEEK 8 |
Spoken and/or Written Corpora - Annotations, Tagging, and Bilingual Texts |
|---|---|
| Feb. 21 |
| WEEK 9 |
Implications and Applications of Corpus-Based Analysis |
|---|---|
| Feb. 28 |
| WEEK 10 |
Special Activities for Last Week of Class |
|---|---|
| March 7 |
| WEEK 11 |
Examination Week |
|---|---|
| Mar. 12 | Term paper project: Due by Monday, 12 March, 5:00 p.m. (Note: Request for extension must be made by the end of Week 10.) |
* An asterisk indicates a book placed on Reserve in Main Library. Library call numbers are included for sources that I happen to have the call numbers handy. More references and readings will be added during the quarter.
Note: A freely-downloadable program from LDC is Transcriber,
a tool developed by Claude Barras for segmenting, labelling, and transcribing speech in Unix, which was
then compiled for Windows NT at LDC. The program runs under Windows 9x/NT/2000, Macs, and Unix systems,
and "user-chosen encoding is used for transcriptions input/output;
most current encodings (ISO-8859-*, EUC-JP...) and Unicode
(UTF-8) can be used." Unicode enables inputting of multilingual scripts in the same text
(and hence also supports encoding of Chinese, Japanese, and Korean). For Chinese, GB and Big5 are supported.
Note that Transcriber 1.4.2 may require adjustment of screen resolution for proper display. (For example,
800x600 resolution does not display the entire Transcriber window on a PC.)
As stated in the article,
"Transcriber: a Free Tool for Segmenting, Labeling and Transcribing Speech,"
by Claude Barras, Edouard Geoffrois, Zhibiao Wu and Mark Liberman (1998), the hope is "that such a portable,
widely available and flexible tool will benefit the whole community and make it easier to develop and share corpora."
(See other Transcriber-related publications at LDC's website for Transcriber.)
New features in the next release of Transcriber (viz., version 1.4.3) include the ability to launch
WaveSurfer on a selected portion of the signal.
WaveSurfer, developed by Kare Sjölander and Jonas Beskow, is a multi-platform freeware program for
Win9x/NT/2000/Linux/Mac/Unix that can handle many file extensions; long files (unlimited file size); mono,
stereo, and 4-channels; can display time axis, waveforms, spectrograms, pitch tracks, etc.
Not included among the links in the Corpus Linguistics webpage are the following concordancers for Windows:
(1) ConcApp Concordancing Programs.
ConcApp Concordance Browser and Editor for Windows 95/98/NT
(and also works in Windows 2000) for English, Chinese, and Japanese concordancing,
and ConcApp Concordance Browser for Windows 3.1/95,
are free programs developed by Chris Greaves, downloadable
from Hong Kong Polytechnic University's Virtual Library Centre (VLC).
(2) Concordance, a commercial
program developed by Robert J.C. Watt (University of Dundee) for Windows 95/98/Me/NT/2000/XP, which can generate web concordances, in addition to
making both full concordances and fast concordances (i.e., KeyWords in Context (KWIC) concordances) from
multilingual texts.
A fully-functional trial version is downloadable from his site.
(See also Steven Vance's Review
of Concordance 2.0 (June 2001),
part of The CALICO Review at the Computer Assisted Language Instruction Consortium (CALICO).
Review of Concordance 2.0, added 06/28/01)
(3) Simple Concordance Program (SCP).
SCP 4.0.5 (last updated on 21 December 2001) is a free, 32-bit concordancing program developed by Alan Reed for Windows 95/98/NT4/ME/2000/XP.
See his webpage, A Quick Tour
of SCP 4.05 for the Impatient. In addition to various searching capabilities, concordances can be
saved in HTML format for viewing on the web. (Added 1/24/02)
(4) KWIC Concordance for
Windows, a freeware for Windows95/98/NT4.0/2000/XP that makes
word frequency lists, concordances and collocation tables, and
offers the capability of handling markup schemes, such as COCOA, SGML, the Helsinki corpus, the Penn-Helsinki
Parsed Corpus of Middle English (Phase 1) (Phase 2) etc. (Added 08/08/02)
(5) KWiCFinder,
a "professional web search agent optimized for multilingual searches, which builds on AltaVista's support
for complex Boolean searches and enlists XML technology to provide a complete range of report options
in five different languages to display the Key Words of your search in Context."
KWiCFinder Report Formats
include XML format, Keywords-in-Context format, paragraph format, and table format;
requires Windows 95/98/Me and Internet Explorer 5.0 or greater.
Pre-release beta version (19 December 2001) of KWiCFinder is downloadable from
MiniAPPolis.com Downloads. (Added 1/19/02)
(6) WebKWiC
(Web Key Word in Context),
"complements Google.com's popular search engine to simplify and
accelerate the task of online research,
runs in Windows with Internet Explorer 5.0 or greater. Cross-platform and cross-browser versions may
be implemented when Netscape and Macintosh-compatible browsers provide appropriate functionality."
Downloadable from
MiniAPPolis.com Downloads.
(See KWiCFinder & WebKWiC Compared.)
(Added 1/19/02)
(7) Concordancer for Windows
(was WinConcord), version 3.0 is downloadable freeware for Windows 95, developed by Zdenek Martinek (University of West Bohemia, Pilsen, Czech Republic), in close collaboration
with Les Siegrist (Technische Universität Darmstadt, Germany). (Added 10/18/01, URL and info updated 09/03/02)
(8) WordExpert: Translators Concordancer,
is Myteam's commercial Windows concordancer developed for translators, teachers, and students. (Added 1/24/02)
Also see Concordancing and
Concordancing Software.
Other text analysis software for non-DOS/non-Windows include:
(1) CLOC Collocation Package.
Alan Reed's free software for Unixes (Sun-Solaris)that can make concordances of words and phrases,
and can produce word lists and collocations. (Added 1/24/02)
See also the freely-accessible, POS-tagged, America English word corpus that corresponds to the
British National Corpus (BNC), namely, the
TIME MAGAZINE, 1923-2007 (100+ million words) at
Brigham Young University. Users can obtain the frequency of particular words, phrases, substrings (prefixes,
suffixes, roots) in each decade from the 1920s-2000s; they can limit
the results by frequency in any set of years or decades; users can also study changes in syntax, collocations,
semantic shifts, etc.
Other corpora at Lancaster University:
Lancaster Corpus of Mandarin Chinese (LCMC) (online search engine) --
The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) --
The PD Corpus Web Concordancer (online search engine that searches
The People's Daily Corpus (month of January 1998),
a one-million word corpus for Mandarin Chinese, released by the Institute of Computational Linguistics at Peking University) --
PH Corpus Web Concordancer (online search engine that searches
the PH Corpus, compiled Guo Jin, consisting of
about 2.4 million words of Mandarin Chinese from newswire text published by Xinhua News Agency in 1990-1991).
LINKS AND WWW RESOURCES
Note: The Linguistic Data Consortium (LDC) is "an open consortium of universities, companies and government research laboratories. It creates,
collects and distributes speech and text databases, lexicons, and other resources for research and
development purposes." (OSU has been an LDC member during various years -- see LDC Online
for online resources for site members, and
check their LDC Catalog for their corpora,)
Under construction is their useful information on Creating Data Resources,
in addition to their links to LDC Related Sites webpage.)
See also the TalkBank,
an interdisciplinary research project hosted by Carnegie Mellon University
(site of the CHILDES System) and the Univeristy of Pennsylvania
(site of the Linguistic Data Consortium).
The goal of TalkBank is "to foster fundamental research in the study of human and animal communication" and
will do so by providing "standards and tools for creating, searching, and publishing primary materials via
networked computers." See also the
Open Language Archives Community,
an international project to construct the infrastructure for language archives to be linked by
community-specific metadata and centralized union catalogs.
While Concordance 2.0 was primarily for (single-byte) Latin scripts,
Concordance 3.0, available on 19 January 2002, can handle East Asian double-byte e-texts in Windows 2000/XP.
See, for example, my
Instructions
for Concordancing East Asian E-Texts using Concordance. (Added 12/05/01)
See also my PowerPoint presentation (.pps file, 2 MB, added 01/19/02),
"Concordancers, Concordances, and
Chinese Language Teaching," based on my presentation delivered at the 2001 Annual Meeting of the Chinese
Language Teachers Association, 15-18 November 2001.
Also see my "Concordancers and concordances: Tools for Chinese language teaching and research",
(PDF, revised version, 1.03 MB), Journal of the Chinese Language Teachers Association 37.2 (2002):1-58.
(This revised version of the paper has color illustrations. -- Added 08/07/02.)
Links to search engines (including the DEALL Search Engine), publishers, Asian studies associations and journals (with indices),
netnews, e-magazines, e-texts and e-text archives, etc.
Links to downloadable CJK fonts and encoders/decoders, IPA and Pinyin fonts, Unicode fonts, etc., and
online radio and TV programs.
Links to Chinese dialectology and online searchable dialect databases; Chinese linguistics associations and journals
(with tables of content/indices); information and links on Unicode; online, searchable Chinese corpora from
Academia Sinica and from Chinese University of Hong Kong, etc.
Included is a link to A Hong Kong Cantonese Child Language Corpus (CANCORP),
Thomas Lee's eight-subject, longitudinal project, with description of the corpus, depositories, etc.
There are two downloadable versions of the corpus: a Chinese(-only) version and a CHAT version (Chinese on one tier and romanization on another)
downloadable at the CHILDES System website.
Chinese display requires MS Chinese Win95/98 with Hong Kong government's Hong Kong Supplementary Character Set
(HKSCS) support for viewing the Cantonese vernacular characters used in HK that are not in GB or Big5.
Links to linguistics associations and journals (with tables of contents and indices, etc.), online dictionaries and references,
transcription and annotation tools for spoken and written corppora, (commercial and freely-downloadable) software for speech analysis,
web-authoring guides and tools, other software, etc.
. See also some tips in the WWW resources section of my Chinese 680 online course page.
Part of OhioLINK's online Research Databases.
Part of Wenze Hu and Hongyin Tao's Chinese Linguistics Page.

![]()
[ MC's Home |
DEALL Home ]
[ The Ohio State University ]
![]() |
To cite this page: Marjorie Chan's Chinese 889. Seminar in Chinese Linguistics: Databases and Corpora for Chinese Linguistic Research (Winter 2001) <http://people.cohums.ohio-state.edu/chan9/c889-w01.htm> [Accessed <DATE>] |
The logo for this webpage, "Palomar 13's Last Stand," was cropped from a photo in NASA's photo archive. "Palomar 13's Last Stand" was NASA's Astronomy Picture of the Day: 2000 November 30.
There were 5519 hits between 11 July 1997 and 24 January 2004, of which
2,035 hits were between 11 July 1997 and 8 December 2000.
(There were 1,182 hits in the two-year period between
11 July 1997 and 10 September 1999 when this webpage was the syllabus for my seminar on
Language and Gender (Autumn 1997), and
849 hits when this webpage was the syllabus for my seminar on
Intonation and Sentence-Final Particles (Autumn 1999).
See my Chinese 889 Homepage for the cumulative listing of my Chinese 889 seminars.)
Created: 11 July 1997 for the Autumn 1997 offering of Chinese 889.
Most recent major revision: 8 December 2000 for the Winter 2001 offering of this seminar. Some links have since been updated from time to time.
Last Update: 24 April 2012.
Copyright © 1997-201x Marjorie K.M. Chan. All rights reserved on course syllabus and online materials developed for the course.
URL: http://people.cohums.ohio-state.edu/chan9/c889-w01.htm