The British Academic Spoken English (BASE) corpus

About BASE

The British Academic Spoken English (BASE) corpus was developed at the Universities of Warwick and Reading, under the directorship of Hilary Nesi, with Paul Thompson. Natalie Snodgrass and Sarah Creer were employed as research assistants and Tim Kelly was video director for the project. Lou Burnard (Oxford University) and Adam Kilgarriff (Lexicography MasterClass Ltd) acted as consultants. The BASE corpus consists of 160 lectures and 39 seminars recorded in a variety of university departments. It contains 1,644,942 tokens in total (lectures and seminars). Holdings are distributed across four broad disciplinary groups, each represented by 40 lectures and 10 seminars. These groups are:

The lectures and seminars have been transcribed and tagged using a system devised in accordance with the TEI Guidelines.


The early stages of corpus development were assisted by funding from the Universities of Warwick and Reading , BALEAP, EURALEX, and The British Academy (2000-2001, Grant reference: SG 30284). Major funding was provided by the Arts and Humanities Research Council as part of their Resource Enhancement Scheme (2001–2005, Award Number: RE/AN6806/APN13545).

Freely available resources

How to cite the corpus

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

  1. Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given assignment.
  2. No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
  3. The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must acknowledge their use of the BASE corpus using the following form of words: The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (eg. ah = arts and humanities), type of speech event (eg. lct = lecture) and file number.


The corpus facilitates, amongst other things, investigation of:

We anticipate that cross-cultural comparisons will be made of BASE and MICASE data, and also of data from other cultures and languages. BASE will remain a record of British spoken academic discourse at the turn of this century, and may also be compared with corpora compiled in the future to investigate diachronic change in academic language use.

Related projects

The BASE corpus functions as a companion to the Michigan Corpus of Spoken Academic English (MICASE), although unlike MICASE it does not include speech events other than lectures and seminars, and the majority of the events are recorded as video rather than audio files.

It also functions as a companion to the British Academic Written English (BAWE) corpus.