MARSEC: The Machine Readable Spoken English Corpus
Background
Whereas the SEC edition of the corpus comprises
annotated orthographic transcriptions of the spoken material it does
not include the acoustic material. The MARSEC edition of the corpus
adds the acoustic recordings on a second CD-ROM (see below for ordering
details) and includes word-level time-alignment (downloadable below) between
the transcripts and the acoustic signal.
The CD-ROM contains digital sample files of the original corpus recordings.
Each recording has been divided into samples of not greater than one-minute
in duration. The sample files are raw headerless mono 16 bit (intel byte-order)
PCM samples sampled at 16000 samples per second. This format can be easily
be imported into many audio applications.
The diagram above shows a short sample segment from the corpus displayed
along with its fundamental frequency, RMS energy and two label files one
segmental and the other showing prosodically annotated orthography. Note
labels mark the start of a symbol
Getting MARSEC
The MARSEC CD-ROM is available for £200 +
VAT from the School of Linguistics at Reading University. To place an order
please email
S.C.Arnfield@Reading.ac.uk
Downloadable from this web site are the prosodically
annotated word-level alignment files. These are text files formatted in
the Xwaves label file format. But are easily converted to other label file
formats.
NB the filenaming conventions used on the CDROM have changed since
production. Download the
lookup
table.
Download word-level time-aligned prosodic annotations (~2Mb).
Note: when you click on the download data button
you will be sent a tar file. Save the file as marsec-data.tar. This file
can be decoded on a PC using a program such as winzip and on a UNIX machine
by using the tar command (tar -xf marsec-data.tar) to unarchive the data
files into the current directory.
Using MARSEC with Waves+
Those of you using the Waves+ package from Entropic
may find this
header
file useful.
Download this file and save it somewhere. You should
then set the DEF_HEADER environment variable by, for example,
setenv DEF_HEADER $PATH/marsec-default-waves-header
to point to this file (put in the correct PATH).
You can now load MARSEC sample files directly off the CDROM into xwaves.
MARSEC files do not have ESPS headers so xwaves
uses it's default header. Normally this is a file supplied by entropic.
Here we tell xwaves to use our file which holds the correct format information.
Note if you edit this file it will loose it's byte swapped nature. This
file may not work for DEC users.