Описание:The American National Corpus (ANC) is a text corpus of American English containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
• The ANC in its current size of 22 million words is available from the Linguistic Data Consortium. A 15 million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website.
• The corpus and its annotations are provided according to the specifications of ISO/TC 37 SC4's Linguistic Annotation Framework. By using a freely provided transduction tool, the corpus and user-chosen annotations is provided in multiple formats, including the XML format conformant to the XML Corpus Encoding Standard (XCES) (usable with the British National Corpus's XAIRA search engine), a UIMA-compliant format, and formats suitable for input to a wide variety of concordance software.
• The frequency information includes counts for any token that has been assigned a part of speech tag by the part of speech tagger. Therefore, tokens such as the possessive 's are counted as a "word". The frequency counts were generated by reading the standoff annotation files for the Penn part of speech tags to obtain the lemma, part of speech, and the start and end offsets of the word in the text. The occurrence of the word was then extracted from the content and stored in the triple {type, lemma, part of speech}. Unique triples were then counted to obtain the frequency counts.
Known Problems • The accuracy of the frequency counts is dependent on the accuracy of the tokenization. We note the the following issues:
• mdash - Several documents use a pair of hyphens (-) to represent the mdash. When there is not whitespace on either side of the mdash, the tokenizer mistakenly classifies the entire string as a hyphenated word. To account for this, when a token of the form word1--word2 was encounterd, two triples were created: {word1, word1, UNC} and {word2, word2, UNC} (where UNC is the part of speech tag for unclassified).
Numbers tagged as nouns are included in the frequency counts (for example, "727" as in "Boeing 727").
• Sequences of characters that are not "words" are counted. For example, many scientific papers include strings of gene sequences of the form (a|c|g|t)*. Similarily, spoken and informal written texts (blogs etc.) contain strings representing vocal sounds, for example: aaaaahhh, aaarrrgghhhhh, etc.
• Please contact the webmaster if you have any comments or questions regarding the ANC website.
• Copyright © 2002-2010 American National Corpus Project. All rights reserved.
