Audio BNC: the audio edition of the Spoken British National Corpus
John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC
About the corpus (to skip this description and jump to the access instructions, click here)
This site presents most (but not yet all) of the audio recordings from the spoken part of the British National Corpus, digitized from the analogue audio cassette tapes deposited at the British Library Sound Archive, together with associated transcription and annotation files created in a sequence of projects, especially Mining a Year of Speech and Word joins in real life-speech. Oxford University is responsible for curating and publishing the corpus, and the British Library is responsible for archiving and curating the audio recordings from the BNC and ensuring public access.
British Library Sound Archive, in collaboration with Oxford University Phonetics Laboratory, digitized all of the extant tapes in its possession in 2009-10. Under the terms of the original recording permissions agreement with the contributors, "all tapes and conversation details will be completely anonymous, and will be used for scientific study and publication by writers of dictionaries and educational material and language researchers"; it has therefore been necessary for us to locate and mute all of the portions of the audio corresponding to the anonymization <gap> tags in the TEI-XML editions of the Spoken BNC. Over 18,710 <gap> tags in the TEI-XML transcriptions have been individually checked to ensure that the anonymization has been carried out correctly. In due course, it is planned to provide long-term access via search and browsing tools to stable URI's. In the mean time, we offer this initial release, partly as a test-bed for researchers and developers, and partly to avoid further delay. (NB. We have discovered that the extant sound recordings only contain about 7.5 million words, not the 10 million words originally transcribed. There is a substantial number of XML transcription files for which we may no longer have the original audiotapes. Or perhaps we do: we also have quite a few recordings that we haven't yet related to any transcription. Also, the audio recordings from the Bergen Corpus of London Teenage Language - a part of the BNC - are not included here, but are available from the University of Bergen.)
In order to locate anonymization gaps, as well as to index the recordings with all transcribed vowels, consonants, and words, we aligned the text transcriptions to the audio using a forced aligner based on HTK, using a combination our acoustic models for British English plus American English models from P2FA, the Penn Phonetics Lab Forced Aligner. The alignment procedure yields a best-fitting phonemic transcription of the audio, together with detailed timing information: the start and end time of every vowel, consonant, word, utterance and recording. This data is encoded as Praat TextGrid files, which we also provide in this release. A short paper on the Mining a Year of Speech project, under which we began this work, can be downloaded from here.
A classification of the participant occupations by areas of work and NRS Social Grades A, B, C1, C2, D and E, created by Katie Henley, is available from here.
Previous releases of BNC spoken audio material
The BNC spoken audio recordings have been (and still are) available for study by language researchers visiting the British Library Sound Archive in person; however, until our recent digitization project, neither the online catalogue nor the TEI-XML editions of the transcriptions were sufficiently informative for researchers to be able to easily find tapes or portions of interest. By issuing our forced alignment index files, we aim to make the researchers' task substantially easier. A subset of the recordings in the BNC have previously been published in mp3 format on CD-ROM's as COLT: the Bergen Corpus of London Teenage Language. A smaller sample on audio cassette was distributed by Longman during the BNC collection project (Cassette Sleeve images).
Accessing the recordings
If you wish to access the recordings and associated files, please read the copyright terms below and register using the form at the bottom of this page. Registered users are welcome to link to or directly access the sound files and associated annotation and transcription files.
The audio files are 16-bit, 1-channel (monophonic) .wav files, with sampling rate 16,000 samples per second. Their rather long filenames encode a combination of the British Library's catalogue code, BNC tape number and the 3-character "BNC codes".
Suppose you wish to find the .wav file containing the dialect word "gronnies", which occurs only once in the BNC. From the published BNC, you can find that it occurs in transcription file KBW.xml. (You can also download html versions of these transcription files from here.) Inspection of that transcription shows that the word "gronnies" is in <div> number 022505 (<div n="022505">), which is the 5th <div> in tape number 0225. The XML transcriptions do not record whether it is on the A side or the B side of the tape, but from this information it can be inferred that the required recording is either
http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-AAZZP0.wav (A-side) or
http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav (B-side). The syntax of these URI's is as follows. There are three slightly different filename formats, for different ranges of tape numbers:
Audio server URL | BL catalogue code | Tape number |
Side A/B |
||
http://bnc.phon.ox.ac.uk/data/ | 021A-C0897X | 0004 | XX-A | B | ZZP0.wav |
to 0087, and 0091-0905 |
|||||
For some tapes from |
00882 |
Side 1/2 | |||
to | 00993 | X-0 | 1 | 00P0.wav | |
For some tapes from | 097700 | ||||
to | 125500 | XX-0 | 1 | 00P0.wav |
You may also obtain some information about tape numbers and their contents from the British Library Sound and Moving Image Catatogue, http://cadensa.bl.uk. (Search for "British National Corpus" and look at items bearing the code C897.)
You can also (optionally) add a start time and end time to a complete file URI in order to select a specific audio clip, or start time & duration. For example, the following are two ways of referring to the "gronnies" audio clip:
http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav?t=2443.4825,2443.8925
http://bnc.phon.ox.ac.uk/data/021A-C0897X0225XX-ABZZP0.wav?t=2443.4825&d=0.41
HTML versions of the transcriptions, in ordinary spelling, are available from here. Full lists of all .wav, .html and Praat TextGrid annotation files are available from http://bnc.phon.ox.ac.uk/filelist-wav.txt, http://bnc.phon.ox.ac.uk/filelist-html.txt and http://bnc.phon.ox.ac.uk/filelist-textgrid.txt, respectively. A zipfile of all the Praat TextGrids is available from the UK Data Service ReShare repository page http://reshare.ukdataservice.ac.uk/851496/
A table of the phone symbols used in the TextGrids is available from http://www.phon.ox.ac.uk/files/docs/BNC_transcription_alphabet.html.
The TextGrid files may be used together with the .wav audio files in the freely-available Praat speech processing package to view or to find any desired words, vowels or consonants in each audio file. For users who are unfamiliar with Praat, a short explanation of how to do this is given here.)
User commentary
Saul Albert wrote this blogpost.
Copyright and access terms
BNC spoken audio recordings were created or collected from other sources by Longman Dictionaries for the British National Corpus Consortium. Their usage is governed by the terms of the original recording permissions agreement with the contributors, which requires that they can only be "used for scientific study and publication by writers of dictionaries and educational material and language researchers". Furthermore, by downloading any of the audio recordings, you agree to the terms in section 2, 6, 7 and 9 of the BNC User Licence (available here), the audio recordings being understood to be among the "spoken texts" included in the "BNC Texts". The supporting annotation and transcription files are Copyright © 2011 The University of Oxford, and are made publicly available under a Creative Commons Attribution License (details here); if you use these files, you must cite the Audio BNC corpus as follows:
John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC
Though we do not charge a licence fee for access to or use of the audio recordings, users are required to register at the time of their first accessing the sound files, via the following form. Registered users are welcome to link to or directly access the sound files and associated annotation and transcription files.
If you have registered for access to the BNC Audio Sampler on a previous occasion, please register your access to the full Audio BNC here as well. And please keep us informed about what you've been using it for, or if you discover anything interesting in it (or anything wrong! - there are certainly many errors).
Please note that after registering you will remain on this page, from which it is possible to access all the files: the section "accessing the recordings" above explains the syntax of all the file names, and provides links to the various kinds of data files. If you e.g. copy and paste the file names of any of the BNC audio or associated data files into the address bar in your browser, or alternatively use a command-line command such as wget, you can access the files. There is no user interface or login page or search tool or anything like that.