Linguistic Data
Consortium University
of
Pennsylvania
The
Spoken BNC: samples of "language in the wild"
The spoken part of the British National Corpus consists of about 1,800
hours (about 10 million words) of unscripted speech. It has two parts
of roughly equal size:
a demographic
part, of informal talk recorded by a socially-stratified sample of
respondents, selected by age group, social class and geographic region;
a context-governed
part, recorded in more formal situations such as meetings, debates,
lectures, seminars, religious services, radio programmes etc.
For the
demographic part, random location sampling procedures were used to
recruit 124 people aged over 15 from across the United Kingdom, with
approximately equal numbers of men and women, from each of five age
groups and four social classes. Each recruit used a portable tape
recorder to record their own speech and the speech of people they
conversed with over a period of up to a week. Recordings of people
under 16 were contributed to the BNC as part of the University of
Bergen COLT (Corpus of London Teenager speech) project, using the same
recording methodology.
The demographic
part is a vast treasure-house of "language in the wild", and is about
as close as it is possible to get (without covert recording) to "real
speech". Here are a few samples: