Erik Fudge and Linda Shockey
Department of Linguistic Science, The University of Reading, Whiteknights, Reading,
Berkshire, England, RG6 6AA
Every description of a language includes statements of vowel and consonant
inventories. Databases to deal with such matters (e.g. UPSID) have been in existence
for some time.
Some language descriptions also include statements of what sequences and other
combinations may or may not occur in words of the language: phonotactic statements.
Not all language descriptions actually include phonotactic statements: many
treatments restrict themselves to stating inventories of vowels and consonants,
and possibly tones and/or accents. Even so, such statements can in fact be made
for all languages.
1. Syllable-structure
The content of phonotactic statements varies greatly in detail from language
to language: simple v. complex branching, small v. large inventories.
The first author has been collecting relevant data, and formulating such statements
where possible, for a number of years (we now have information for 200+ languages).
In spite of the variation between languages, it soon became clear that a common
general phonotactic framework can be set up (Fudge, 1969, 1987) and stored in
computer files.
The aim of our database, then, is to set up such a framework for phonotactic
statements for as many languages as possible. The most important units for establishing
this framework are syllables .
Terminology for parts of syllable structure has been worked out: Onset, Rhyme,
Peak (or Nucleus), Coda. Phrase-structure relationships analogous to those of
syntax are recognised between and within these parts; in fact, phrase-structure
rules can be written to generate the occurring combinations.
Each structural place may be occupied by one element selected from the inventory
of available sounds. Rather than a single inventory of sounds, or even an inventory
divided into vowels and consonants, syllable structure may require different
inventories to be available at different places in the structure. Typically,
relations of overlapping or inclusion will hold between these different inventories
(e.g. m, n, w, l, r, j in both Initial and Post-initial for English. For justification
of this duplication, see Fudge, 1969: 274, 280).
2. Above the Syllable: the Word
Syllable-structure is, of course, only part of the story: there are larger units,
particularly the word (Fudge, 1969: 258f), which are needed for a full statement
of the constraints. As yet this has not been incorporated in the database. Some
examples of word-based phonotactic statements would be:
(a) Some Coda phenomena are restricted to word-final position
(b) Stressed syllables may permit bigger vowel inventories than unstressed syllables
(c) Corresponding parts of successive syllables may exhibit constraints for
or against co-occurrence (e.g. vowel harmony).
(d) Some Codas may need to refer to the Onset of the following syllable.
3. Problems
Three main problem areas are worth special attention: (a) form of source descriptions;
(b) systematic or accidental gap?; (c) status of loanwords. Each will be discussed
further in a subsection.
3.1 Form of source descriptions
In spite of our claim that phonotactic statements can be made for all languages,
there is no guarantee that any descriptive article(s) on which the phonotactic
description is to be based will be cast in anything like the same form as the
ideal phonotactic statements. Where a description contains no statement at all
of limitations on what sequences of consonants may occur, word-by-word inspection
of data cited becomes necessary to establish such statements. In order to maintain
consistency of descriptive format, it has therefore been necessary to devise
a standard method for proceeding from source material to standard format .
3.2 Systematic or accidental gap?
It is sometimes also necessary to reach a decision on whether absence of some
combination represents a systematic gap or merely an accidental gap. For example,
English words can begin with /st/ or /tw/: is the absence of /stw-/ systematic
or accidental? For all other consonants X and Y, if /sX/ and /XY-/ are both
possible, then /sXY-/ is also possible - this suggests the absence of /stw-/
is accidental (perhaps due to the low frequency of /tw-/).
3.3 Loanwords
The status of 'loanwords' is always difficult. Some loans are imme-diately 'assimilated'
to the patterns of the 'borrowing' language, e.g. any word beginning with /st/
borrowed into Spanish will split this unpermitted cluster by prefixing /e/,
in effect putting the /s/ into a separate syllable of its own. Others, however,
cause new phonotactic patterns to arise, e.g. many languages of the Philippines
permit only single-consonant Onsets, but have imported loanwords from Spanish
and English with clusters /pl-/, /tr-/ etc.: at what stage of the loaning process
can we conclude that the language has developed branching Onsets?
So much for the theoretical background and the aims of this database. What of
the methods by which these aims have been achieved?
4. Preparation
Two aspects of this are discussed: (a) the format of language files, and (b)
the database software used.
4.1 Language Files
Information about permissible syllable structures in over 200 languages was
gathered from linguistic literature as described above. These languages were
from a wide variety of language families and from all over the globe.
Information was initially stored on filing cards as formulae of the type normally
used by linguists. The phonemic inventory or each language was noted as well.
These cards were then used as the basis for language files which were entered
on a computer. Only 191 of these files were created, as complete information
was not available in all cases. Entering new languages is an ongoing and open-ended
task.
4.2 Database Software
Commercially-available database software is designed to create tables and to
perform arithmetic operations on the material stored in the rows and columns.
Queries about whether something is present in the database are answered by a
simple lookup procedure. Clearly this form of data representation is not amenable
to the storage of formulae. In using the latter it is necessary to ascertain
whether a sequence or structure which is being sought can be generated by the
syllable grammar, i.e. by expanding the formula.
For example, to know whether the syllable [qi] is possible in a language or
language family, one has to find out whether these phonemes are present in the
inventory and whether each of them is possible in com-bination with the other
and in that order, as well as making sure that the CV structure is permitted.
([q] might, for example, not be permitted syllable-initially or before front
vowels, even if it does exist in the phonemic inventory).
It would, of course, have been possible to expand each grammar ourselves and
put the resulting list in the database. This list could be done automatically
by computer, and, as storing and searching are becoming daily easier, it would
provide a viable but brute-force solution. Generating all possible syllables
would also be a practical way to avoid writing rules for co-occurrence restrictions.
In the end, we decided to use software which worked with syllable grammars,
even though it may be more difficult. We wished to take advantage of linguistic
knowledge in the system, using what is known about natural classes and universal
phonological constraints. The program which was used is one which was loaned
to us for research purposes by Bird, Ellison, and Klein of the University of
Edinburgh Centre for Cognitive Science. This is based on finite-state automata
as described in Bird and Ellison (1994). and matches input against a grammar
written in regular expressions. If there is a nonempty intersection between
the structure specified in the query and the grammar, the program reports a
success, otherwise a failure.
5. Data Entry
At the outset of the project, we were faced with having to represent the symbols
of the International Phonetic Alphabet using an ordinary typewriter keyboard.
No commercially-available font could be found which would allow both for storage
and editing of data: we could create files containing non-ASCII symbols within
specified programs, but couldn't search for these symbols using either another
program (such as an editor) or an operating system. In order that our results
be maxi-mally usable by other scientists, we chose to use the three-number codes
suggested by the International Phonetic Association.
6. Setting up a Data Base
Before you make a query, you need several kinds of information entered into
files as well as the software for matching patterns with syllable grammars:
(a) A list of all the symbols you wish to use for any and all languages, and
the classes they fall into, for example, vowels, high vowels, front vowels.
This list must include modified symbols: nasalised, breathy, velarised, and
all other options count as separate units (i.e. [i] will not match with nasalised
[i] unless a query is very carefully worded). This makes an enormous list and
accounts to some degree for the slowness with which the program runs.
(b) A phonemic inventory for each language.
(Both (a) and (b) are stored in the same file, which contains all the classes
you are ever planning to work with. We will refer to this as the Class File).
(c) A syllable grammar for each language, written as a regular expression. Here
is the grammar for English:
"{ ((132) C (Engel)) V ((Engel1) Engc1 ) (Engf (Engf)) & [English]*
& $1}"
where (132) is /s/, C is any English consonant, and Engel, Engel1, etc. are
subclasses of phonemes which are listed in the Class File.
Each grammar is kept in a separate file, which is named after the language .
7. Using the Database
The naming convention for the language grammar files allows us to recall them
using any of several fields. "French.Romance.IndoEuropean. all" would
be accessed in a query involving any of its subparts or in a query about all
languages in the database.
It is possible to enquire about structures, specific sounds, sounds with a given
feature, or any combination of the above. For example:
(a) IE "CVC" looks for CVC structures in all of the Indo-European
languages
(b) all "110 H" looks for /g/ + high vowel sequences in the whole
database
(c) Romance "C 103 H" looks for a consonant followed by a /t/ +
high vowel in the Romance languages.
8. Results
28 or 15% of languages allow syllable-initial three-consonant clusters.
86 or 45% of languages allow initial 2-consonant clusters.
(S = stop, F = fricative, N = nasal, G = glide) shape no. of langs % SS 18 9 SF 26 14 SN 20 10 SG 67 35 FN 29 15 FF 45 29 FS 34 18 FG 75 39
7 or 4% of languages allow final 3-consonant clusters
17 or 9% of languages allow final 2-consonant clusters.
Configuration of final clusters: SS 9 5 NS 11 6 NF 10 5 FF 13 7 GSF 8 4 NSF 8 4 FS 9 5 GC 10 5 131 or 69% have an obligatory syllable-initial consonant.