Introduction to Corpora and Concordancing and Data Driven Learning (DDL)

Query commands in Collins Cobuild Corpus Concordancer

Simple Queries

Type in a single word. This node word will be shown in the middle of the screen in a max-imum of forty contexts.

Ultimately

Nature

Absolve

Fellow

mob

Adjacent words

must be joined with + and without spaces.

Building+block

Have+long+been

a+has+been

spoon+knife+fork

Non adjacent words

add the maximum number of intervening words

Dog+4bark

Take+2for

 

Inflected Forms

show the lemma using @

Fall@

Have@+fall@+in+love+with

Trailing wildcard

show words that start with this, using *

Adjust*

Ander*

Im*

Un*

Word sets

display all these words in the centre using |

Although|though

Which|that

Home|house

Part-of-speech tags

search for a word as a particular POS,  using /CAPITAL LETTERS

House/VERB

Holiday/VERB

 

NOUN

VERB

NN

NNP

JJ

RB

VB

VBN

VBG

VBD

Stands for any noun tag

Stands for any verb tag

Common noun

noun plural

adjective

adverb

base-form verb

past participle verb

-ing form verb

past tense verb

Register

Choice of three registers allows you some flexibility in American and English, spoken and written.

Combining

these can all be combined to form a great range of possible searches.

find passives: be@+VBD

discontinuous phrasal verbs: let@+2in+on

Or this:

·        is|has+VBN

·        would|should|could+5if

·        if+5would|should|could

         

 

Lemma = full set of inflected forms, eg fall, falls, fell, fallen, etc

·        Sample version of Collins Cobuild Corpus Concordancer at http://titania.cobuild.collins.co.uk/form.html . This sample version of the concordancer gives a maximum of 40 lines per search.

·        What can we say vs what do we say? (Hypothesis, evidence, conclusion. Cognitive processing. Discovery learning.)

A basic premise of data driven learning

If we accept grammar as FACTS, PATTERNS and CHOICES, finding multiple examples of them can provide meaningful teaching material and assist the learner develop a fuller view of the language item being studied by providing multiple immediate contexts.

 

Note: With only 40 line concordances in the Cobuild sampler, there is no further need for machine intervention. This is not without its advantages.   Here are some things you might like to research using the concordancer:

How words behave (or misbehave)

fast as an adjective and adverb (and noun and verb)

record as a noun and verb

base as a noun, verb, adjective

more+adj vs more+adv

Passive

e.g. observing the difference between get and be passive (using |, or doing separate searches)

Phrasal verbs

·        let+2in+on

·        wake+up+to – sometimes literal, sometimes figurative.

·        to give up something     give@+up

·        to give something up     give@+2up

 

Can you find these combinations in the concordancer?

 

for

In

Up

Fall

 

 

 

Give

 

 

 

Take

 

 

 

Get

 

 

 

Are these phrasal verbs be discontinuous? never, sometimes, always

Can we say he fell down? Yes, but DO we?

Delexical verbs

search for the word+a. When you find a delexical verb group, write it into the table. The noun will often be modified. e.g. to have/take a long hot bath. And often there are single verbs with the same or similar meanings, e.g. give a radiant smile, to smile radiantly. To take a photo, to photograph. But …

 

Give

 

 

 

 

 

 

Take

 

 

 

 

 

 

Get

 

 

 

 

 

 

Have

 

 

 

 

 

 

 

For more on delexical verbs: http://www.netlanguages.com/demo/samples/level7/unit5/04_1.htm, for example.

American vs English

·        Different from and different than.

·        Dived or dove? Also, incidentally, Dove/VERB vs dove/NOUN

·        Momentarily Smart Fancy Football

·        For more on American and English English, try http://www.americansc.org.uk/berube.htm and http://members.tripod.com/~Duermueller/ESL2.html, for example.

Usage

·        Can can not be written as sep words? Yes, but is it? Should there be an apostrophe in the 1970s?

·        How differently are the words fact and facts used?

Spoken and Written English

·        Are moreover and whereas used in speech, or do they belong to the written language?

·        would have thought – is this chunk used in written English?

·        question tags

Gender

·        he+fall@+in+love+with  – what do men fall in love with?

·        she fall in love with – what do women fall in love with?

·        the boss

·        he looks vs she looks

·        my … husband or boyfriend etc what adjectives turn up here?

·        my … wife or girlfriend etc

Collocations

Which adverbs describe smiling, running, studying?

What’s the difference between faulty and broken and defective?

eye vs eyes

degree vs degrees

 

The collocates in the Cobuild Sampler are automatically set at four to the left and four to the right of the node word. The statistics that appear with the lists of are:

·        Raw freq often picks out the obvious collocates ("post office" "side effect") but you have no way of distinguishing these objectively from frequent non-collocations (like "the effect" "an effect" "effect is" "effect it" etc).

·        MI (Mutual Information) will highlight the technical terms, oddities, weirdos, totally fixed phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal" "prepaid post"/"post prepaid" "post-grad").

·        T-score will get you significant collocates which have occurred frequently ("post office" "Washington Post" "post-war", "by post" "the post").

 

Note: If a collocate appears in the top of both MI and t-score lists it is clearly a humdinger of a collocate, rock-solid, typical, frequent, strongly associated with its node word, recurrent, reliable, etc etc etc.

(This information comes from the end of a more detailed description of the statistics which you can read by clicking on the column headings on the collocations page in Cobuild on-line).

Using the World Wide Web as your corpus.

The Cobuild sampler does not let us see the whole sentence, let alone the larger context of its concordances, unlike non-sample software. When studying discourse markers, for example, regarding, as for, furthermore etc, a larger context is generally desirable. You can search for such items in a normal www search and get millions of hits, which you can reduce by including some topic words such as environment tidal energy to get you your discourse markers within a genre. And instead of opening on the article by clicking on the blue underlined title, click on cache and the search words will be highlighted (or highlit? – ask a corpus). 

Some Resources

1.      The home page of Tim Johns, of Data Driven Learning fame: http://web.bham.ac.uk/johnstf/.

2.      Mike Scott http://www.liv.ac.uk/~ms2928/homepage.html co-authored Microconcord (a DOS concordancer with 2 million corpus – still good and very fast) with TJ and then the more sophisticated Wordsmith Tools (for Windows).  This doesn't come with any corpus.

3.      Tom Cobb http://132.208.224.131/ The Compleat Lexical Tutor – various applications of corpus work especially for vocabulary teaching. An article by him about using  concordance software to provide learners with a rich language learning experience can be found at http://pages.infinit.net/jaguar3/lounge/concord/default.htm

4.      Spaceless http://www.spaceless.com/concord/ This concordancer takes the text of a web page and creates a list of sentences that contain the search term.

5.      The VLC concordancer. http://vlc.polyu.edu.hk/scripts/concordance/WWWConcapp.htm

6.      Corpora in the Teaching of Languages and Linguistics by Tony McEnery and Andrew Wilson This site contains the authors’ summary of their book http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus4/4fra1.htm

7.      An article by Vance Stevens Concordancing with Language Learners: Why? When? What? http://www.ruf.rice.edu/~barlow/stevens.html

Last updated: 22/03/2004

Antonia Domínguez Miguela