MARSEC: The Machine Readable Spoken English Corpus

Background

The Marsec corpus of spoken standard southern British English is a development of the Lancaster/IBM spoken English corpus (SEC). See http://www.ling.lancs.ac.uk/staff/gerry/SEC.htm for related publications and http://www.hd.uib.no/icame/lanspeks.html for contact details about ordering the orthographic, prosodic and part-of-speech annotated transcriptions of the SEC on CD-ROM from ICAME.
Whereas the SEC edition of the corpus comprises annotated orthographic transcriptions of the spoken material it does not include the acoustic material. The MARSEC edition of the corpus adds the acoustic recordings on a second CD-ROM (see below for ordering details) and includes word-level time-alignment (downloadable below) between the transcripts and the acoustic signal.
The CD-ROM contains digital sample files of the original corpus recordings. Each recording has been divided into samples of not greater than one-minute in duration. The sample files are raw headerless mono 16 bit (intel byte-order) PCM samples sampled at 16000 samples per second. This format can be easily be imported into many audio applications.

The diagram above shows a short sample segment from the corpus displayed along with its fundamental frequency, RMS energy and two label files one segmental and the other showing prosodically annotated orthography. Note labels mark the start of a symbol

Getting MARSEC

The MARSEC CD-ROM is available for £200 + VAT from the School of Linguistics at Reading University. To place an order please email S.C.Arnfield@Reading.ac.uk
Downloadable from this web site are the prosodically annotated word-level alignment files. These are text files formatted in the Xwaves label file format. But are easily converted to other label file formats.
NB the filenaming conventions used on the CDROM have changed since production. Download the lookup table.

Download word-level time-aligned prosodic annotations (~2Mb).

Your Name (and title):

Your Email address:

Your Organisation:

Please enter below a brief description of what use you intend to use MARSEC for

Prosodic Annotations (2Mb),

Thank you for your time in providing these details.


 

Note: when you click on the download data button you will be sent a tar file. Save the file as marsec-data.tar. This file can be decoded on a PC using a program such as winzip and on a UNIX machine by using the tar command (tar -xf marsec-data.tar) to unarchive the data files into the current directory.

Using MARSEC with Waves+

Those of you using the Waves+ package from Entropic may find this header file useful.
Download this file and save it somewhere. You should then set the DEF_HEADER environment variable by, for example,
    setenv DEF_HEADER $PATH/marsec-default-waves-header
to point to this file (put in the correct PATH). You can now load MARSEC sample files directly off the CDROM into xwaves.
MARSEC files do not have ESPS headers so xwaves uses it's default header. Normally this is a file supplied by entropic. Here we tell xwaves to use our file which holds the correct format information. Note if you edit this file it will loose it's byte swapped nature. This file may not work for DEC users.

Simon Arnfield (S.C.Arnfield@reading.ac.uk)