Corpora

A language corpus (plural: corpora) is a collection of either spoken or written works.  Corpora can usually be searched, and can be used for many purposes, such as determining the frequency of word usage in a language, or normal collocation patterns.  (A collocation is a group of words that are often used together in a language.)

Five very useful and free online corpus sites are these:

As an example of how to get started using an online corpus, we will look at the BYU-BNC web site and how to use it to determine which prepositions typically follow the verb "rely."

When you first enter the COCA web site, you will see that the left-hand panel looks like this:



To change this to examine the prepositions that follow the verb "rely" in frequency order, make the following changes:  

  • In the "WORD(S)" field, enter "rely.[v*]" without the quotation marks
  • In the "COLLOCATES" field, enter "[ii*] without the quotation marks
  • Change the dropdown boxes to read "0" and "1" respectively.



Once the changes above are made, click the "SEARCH" button, and the results below will be displayed.  (Note that the results may be slightly different when you run this, as new data may have been entered since the time I wrote this entry.)



This shows that the most common preposition to follow the verb "rely" is "on," with a frequency of 1807 out of a total of 2072 entries.  The second most common preposition to follow "rely" is "upon" with 229 uses out of a total of 2072.

Perhaps the most difficult thing about using the BYU corpus interface is finding which abbreviations are used for each part of speech (POS).  The list is currently hosted at http://ucrel.lancs.ac.uk/claws7tags.html.  I have also pasted the current set below.

APPGE
possessive pronoun, pre-nominal (e.g. my, your, our)
AT
article (e.g. the, no)
AT1
singular article (e.g. a, an, every)
BCL
before-clause marker (e.g. in order (that),in order (to))
CC
coordinating conjunction (e.g. and, or)
CCB
adversative coordinating conjunction ( but)
CS
subordinating conjunction (e.g. if, because, unless, so, for)
CSA
as (as conjunction)
CSN
than (as conjunction)
CST
that (as conjunction)
CSW
whether (as conjunction)
DA
after-determiner or post-determiner capable of pronominal function (e.g. such, former, same)
DA1
singular after-determiner (e.g. little, much)
DA2
plural after-determiner (e.g. few, several, many)
DAR
comparative after-determiner (e.g. more, less, fewer)
DAT
superlative after-determiner (e.g. most, least, fewest)
DB
before determiner or pre-determiner capable of pronominal function (all, half)
DB2
plural before-determiner ( both)
DD
determiner (capable of pronominal function) (e.g any, some)
DD1
singular determiner (e.g. this, that, another)
DD2
plural determiner ( these,those)
DDQ
wh-determiner (which, what)
DDQGE
wh-determiner, genitive (whose)
DDQV
wh-ever determiner, (whichever, whatever)
EX
existential there
FO
formula
FU
unclassified word
FW
foreign word
GE
germanic genitive marker - (' or's)
IF
for (as preposition)
II
general preposition
IO
of (as preposition)
IW
with, without (as prepositions)
JJ
general adjective
JJR
general comparative adjective (e.g. older, better, stronger)
JJT
general superlative adjective (e.g. oldest, best, strongest)
JK
catenative adjective (able in be able to, willing in be willing to)
MC
cardinal number,neutral for number (two, three..)
MC1
singular cardinal number (one)
MC2
plural cardinal number (e.g. sixes, sevens)
MCGE
genitive cardinal number, neutral for number (two's, 100's)
MCMC
hyphenated number (40-50, 1770-1827)
MD
ordinal number (e.g. first, second, next, last)
MF
fraction,neutral for number (e.g. quarters, two-thirds)
ND1
singular noun of direction (e.g. north, southeast)
NN
common noun, neutral for number (e.g. sheep, cod, headquarters)
NN1
singular common noun (e.g. book, girl)
NN2
plural common noun (e.g. books, girls)
NNA
following noun of title (e.g. M.A.)
NNB
preceding noun of title (e.g. Mr., Prof.)
NNL1
singular locative noun (e.g. Island, Street)
NNL2
plural locative noun (e.g. Islands, Streets)
NNO
numeral noun, neutral for number (e.g. dozen, hundred)
NNO2
numeral noun, plural (e.g. hundreds, thousands)
NNT1
temporal noun, singular (e.g. day, week, year)
NNT2
temporal noun, plural (e.g. days, weeks, years)
NNU
unit of measurement, neutral for number (e.g. in, cc)
NNU1
singular unit of measurement (e.g. inch, centimetre)
NNU2
plural unit of measurement (e.g. ins., feet)
NP
proper noun, neutral for number (e.g. IBM, Andes)
NP1
singular proper noun (e.g. London, Jane, Frederick)
NP2
plural proper noun (e.g. Browns, Reagans, Koreas)
NPD1
singular weekday noun (e.g. Sunday)
NPD2
plural weekday noun (e.g. Sundays)
NPM1
singular month noun (e.g. October)
NPM2
plural month noun (e.g. Octobers)
PN
indefinite pronoun, neutral for number (none)
PN1
indefinite pronoun, singular (e.g. anyone, everything, nobody, one)
PNQO
objective wh-pronoun (whom)
PNQS
subjective wh-pronoun (who)
PNQV
wh-ever pronoun (whoever)
PNX1
reflexive indefinite pronoun (oneself)
PPGE
nominal possessive personal pronoun (e.g. mine, yours)
PPH1
3rd person sing. neuter personal pronoun (it)
PPHO1
3rd person sing. objective personal pronoun (him, her)
PPHO2
3rd person plural objective personal pronoun (them)
PPHS1
3rd person sing. subjective personal pronoun (he, she)
PPHS2
3rd person plural subjective personal pronoun (they)
PPIO1
1st person sing. objective personal pronoun (me)
PPIO2
1st person plural objective personal pronoun (us)
PPIS1
1st person sing. subjective personal pronoun (I)
PPIS2
1st person plural subjective personal pronoun (we)
PPX1
singular reflexive personal pronoun (e.g. yourself, itself)
PPX2
plural reflexive personal pronoun (e.g. yourselves, themselves)
PPY
2nd person personal pronoun (you)
RA
adverb, after nominal head (e.g. else, galore)
REX
adverb introducing appositional constructions (namely, e.g.)
RG
degree adverb (very, so, too)
RGQ
wh- degree adverb (how)
RGQV
wh-ever degree adverb (however)
RGR
comparative degree adverb (more, less)
RGT
superlative degree adverb (most, least)
RL
locative adverb (e.g. alongside, forward)
RP
prep. adverb, particle (e.g about, in)
RPK
prep. adv., catenative (about in be about to)
RR
general adverb
RRQ
wh- general adverb (where, when, why, how)
RRQV
wh-ever general adverb (wherever, whenever)
RRR
comparative general adverb (e.g. better, longer)
RRT
superlative general adverb (e.g. best, longest)
RT
quasi-nominal adverb of time (e.g. now, tomorrow)
TO
infinitive marker (to)
UH
interjection (e.g. oh, yes, um)
VB0
be, base form (finite i.e. imperative, subjunctive)
VBDR
were
VBDZ
was
VBG
being
VBI
be, infinitive (To be or not... It will be ..)
VBM
am
VBN
been
VBR
are
VBZ
is
VD0
do, base form (finite)
VDD
did
VDG
doing
VDI
do, infinitive (I may do... To do...)
VDN
done
VDZ
does
VH0
have, base form (finite)
VHD
had (past tense)
VHG
having
VHI
have, infinitive
VHN
had (past participle)
VHZ
has
VM
modal auxiliary (can, will, would, etc.)
VMK
modal catenative (ought, used)
VV0
base form of lexical verb (e.g. give, work)
VVD
past tense of lexical verb (e.g. gave, worked)
VVG
-ing participle of lexical verb (e.g. giving, working)
VVGK
-ing participle catenative (going in be going to)
VVI
infinitive (e.g. to give... It will work...)
VVN
past participle of lexical verb (e.g. given, worked)
VVNK
past participle catenative (e.g. bound in be bound to)
VVZ
-s form of lexical verb (e.g. gives, works)
XX
not, n't
ZZ1
singular letter of the alphabet (e.g. A,b)
ZZ2
plural letter of the alphabet (e.g. A's, b's)


NOTE: "DITTO TAGS"

Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22 This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:
                in_II31 terms_II32 of_II33  
The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence.
Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto tags may or may not be required for a particular word sequence:

at_RR21 length_RR22  a_DD21/RR21 lot_DD22/RR22  in_CS21/II that_CS22/DD1   

No comments:

Post a Comment