Erik Fudge and Linda Shockey
Department of Linguistic Science, The University of Reading, Whiteknights, Reading, Berkshire, England, RG6 6AA

Every description of a language includes statements of vowel and consonant inventories. Databases to deal with such matters (e.g. UPSID) have been in existence for some time.

Some language descriptions also include statements of what sequences and other combinations may or may not occur in words of the language: phonotactic statements. Not all language descriptions actually include phonotactic statements: many treatments restrict themselves to stating inventories of vowels and consonants, and possibly tones and/or accents. Even so, such statements can in fact be made for all languages.

1. Syllable-structure

The content of phonotactic statements varies greatly in detail from language to language: simple v. complex branching, small v. large inventories.

The first author has been collecting relevant data, and formulating such statements where possible, for a number of years (we now have information for 200+ languages). In spite of the variation between languages, it soon became clear that a common general phonotactic framework can be set up (Fudge, 1969, 1987) and stored in computer files.

The aim of our database, then, is to set up such a framework for phonotactic statements for as many languages as possible. The most important units for establishing this framework are syllables .

Terminology for parts of syllable structure has been worked out: Onset, Rhyme, Peak (or Nucleus), Coda. Phrase-structure relationships analogous to those of syntax are recognised between and within these parts; in fact, phrase-structure rules can be written to generate the occurring combinations.

Each structural place may be occupied by one element selected from the inventory of available sounds. Rather than a single inventory of sounds, or even an inventory divided into vowels and consonants, syllable structure may require different inventories to be available at different places in the structure. Typically, relations of overlapping or inclusion will hold between these different inventories (e.g. m, n, w, l, r, j in both Initial and Post-initial for English. For justification of this duplication, see Fudge, 1969: 274, 280).

2. Above the Syllable: the Word

Syllable-structure is, of course, only part of the story: there are larger units, particularly the word (Fudge, 1969: 258f), which are needed for a full statement of the constraints. As yet this has not been incorporated in the database. Some examples of word-based phonotactic statements would be:
(a) Some Coda phenomena are restricted to word-final position
(b) Stressed syllables may permit bigger vowel inventories than unstressed syllables
(c) Corresponding parts of successive syllables may exhibit constraints for or against co-occurrence (e.g. vowel harmony).
(d) Some Codas may need to refer to the Onset of the following syllable.

3. Problems

Three main problem areas are worth special attention: (a) form of source descriptions; (b) systematic or accidental gap?; (c) status of loanwords. Each will be discussed further in a subsection.

3.1 Form of source descriptions

In spite of our claim that phonotactic statements can be made for all languages, there is no guarantee that any descriptive article(s) on which the phonotactic description is to be based will be cast in anything like the same form as the ideal phonotactic statements. Where a description contains no statement at all of limitations on what sequences of consonants may occur, word-by-word inspection of data cited becomes necessary to establish such statements. In order to maintain consistency of descriptive format, it has therefore been necessary to devise a standard method for proceeding from source material to standard format .

3.2 Systematic or accidental gap?

It is sometimes also necessary to reach a decision on whether absence of some combination represents a systematic gap or merely an accidental gap. For example, English words can begin with /st/ or /tw/: is the absence of /stw-/ systematic or accidental? For all other consonants X and Y, if /sX/ and /XY-/ are both possible, then /sXY-/ is also possible - this suggests the absence of /stw-/ is accidental (perhaps due to the low frequency of /tw-/).

3.3 Loanwords

The status of 'loanwords' is always difficult. Some loans are imme-diately 'assimilated' to the patterns of the 'borrowing' language, e.g. any word beginning with /st/ borrowed into Spanish will split this unpermitted cluster by prefixing /e/, in effect putting the /s/ into a separate syllable of its own. Others, however, cause new phonotactic patterns to arise, e.g. many languages of the Philippines permit only single-consonant Onsets, but have imported loanwords from Spanish and English with clusters /pl-/, /tr-/ etc.: at what stage of the loaning process can we conclude that the language has developed branching Onsets?

So much for the theoretical background and the aims of this database. What of the methods by which these aims have been achieved?

4. Preparation

Two aspects of this are discussed: (a) the format of language files, and (b) the database software used.

4.1 Language Files

Information about permissible syllable structures in over 200 languages was gathered from linguistic literature as described above. These languages were from a wide variety of language families and from all over the globe.

Information was initially stored on filing cards as formulae of the type normally used by linguists. The phonemic inventory or each language was noted as well. These cards were then used as the basis for language files which were entered on a computer. Only 191 of these files were created, as complete information was not available in all cases. Entering new languages is an ongoing and open-ended task.

4.2 Database Software

Commercially-available database software is designed to create tables and to perform arithmetic operations on the material stored in the rows and columns. Queries about whether something is present in the database are answered by a simple lookup procedure. Clearly this form of data representation is not amenable to the storage of formulae. In using the latter it is necessary to ascertain whether a sequence or structure which is being sought can be generated by the syllable grammar, i.e. by expanding the formula.

For example, to know whether the syllable [qi] is possible in a language or language family, one has to find out whether these phonemes are present in the inventory and whether each of them is possible in com-bination with the other and in that order, as well as making sure that the CV structure is permitted. ([q] might, for example, not be permitted syllable-initially or before front vowels, even if it does exist in the phonemic inventory).

It would, of course, have been possible to expand each grammar ourselves and put the resulting list in the database. This list could be done automatically by computer, and, as storing and searching are becoming daily easier, it would provide a viable but brute-force solution. Generating all possible syllables would also be a practical way to avoid writing rules for co-occurrence restrictions.

In the end, we decided to use software which worked with syllable grammars, even though it may be more difficult. We wished to take advantage of linguistic knowledge in the system, using what is known about natural classes and universal phonological constraints. The program which was used is one which was loaned to us for research purposes by Bird, Ellison, and Klein of the University of Edinburgh Centre for Cognitive Science. This is based on finite-state automata as described in Bird and Ellison (1994). and matches input against a grammar written in regular expressions. If there is a nonempty intersection between the structure specified in the query and the grammar, the program reports a success, otherwise a failure.

5. Data Entry

At the outset of the project, we were faced with having to represent the symbols of the International Phonetic Alphabet using an ordinary typewriter keyboard. No commercially-available font could be found which would allow both for storage and editing of data: we could create files containing non-ASCII symbols within specified programs, but couldn't search for these symbols using either another program (such as an editor) or an operating system. In order that our results be maxi-mally usable by other scientists, we chose to use the three-number codes suggested by the International Phonetic Association.

6. Setting up a Data Base

Before you make a query, you need several kinds of information entered into files as well as the software for matching patterns with syllable grammars:

(a) A list of all the symbols you wish to use for any and all languages, and the classes they fall into, for example, vowels, high vowels, front vowels. This list must include modified symbols: nasalised, breathy, velarised, and all other options count as separate units (i.e. [i] will not match with nasalised [i] unless a query is very carefully worded). This makes an enormous list and accounts to some degree for the slowness with which the program runs.

(b) A phonemic inventory for each language.

(Both (a) and (b) are stored in the same file, which contains all the classes you are ever planning to work with. We will refer to this as the Class File).

(c) A syllable grammar for each language, written as a regular expression. Here is the grammar for English:

"{ ((132) C (Engel)) V ((Engel1) Engc1 ) (Engf (Engf)) & [English]* & $1}"

where (132) is /s/, C is any English consonant, and Engel, Engel1, etc. are subclasses of phonemes which are listed in the Class File.

Each grammar is kept in a separate file, which is named after the language .

7. Using the Database

The naming convention for the language grammar files allows us to recall them using any of several fields. "French.Romance.IndoEuropean. all" would be accessed in a query involving any of its subparts or in a query about all languages in the database.

It is possible to enquire about structures, specific sounds, sounds with a given feature, or any combination of the above. For example:
(a) IE "CVC" looks for CVC structures in all of the Indo-European
(b) all "110 H" looks for /g/ + high vowel sequences in the whole
(c) Romance "C 103 H" looks for a consonant followed by a /t/ +
high vowel in the Romance languages.

8. Results

28 or 15% of languages allow syllable-initial three-consonant clusters.

86 or 45% of languages allow initial 2-consonant clusters.

Configuration of initial clusters:

	(S = stop, F = fricative, N = nasal, G = glide)

		shape		no. of langs	%

		SS		18		 9
		SF		26		14
		SN		20		10
		SG		67		35
		FN		29		15
		FF		45		29
		FS		34		18
		FG		75		39

7 or 4% of languages allow final 3-consonant clusters
17 or 9% of languages allow final 2-consonant clusters.

Configuration of final clusters: SS 9 5 NS 11 6 NF 10 5 FF 13 7 GSF 8 4 NSF 8 4 FS 9 5 GC 10 5 131 or 69% have an obligatory syllable-initial consonant.

None has an obligatory null onset.

15 or 8% have an obligatory syllable-final consonant.

23 or 12% have an obligatory null coda.

9. A Minor Problem

We have not yet adequately confronted the details of phonotaxis, though the broad generalisations are in place. In English, for example, our current grammar gives "sdring" and "ssming" the stamp of approval, though it clearly should not. It will be possible to include constraints in the grammars to filter out false hits. We are somewhat hampered in less well-known languages by lack of information about which sequences do not occur.

10. Conclusion

We are now in a position to investigate the type of syllable universal suggested by Greenberg (1965) and to re-evaluate the work of Hooper (1976) on the sonority hierarchy of syllable structure as well as to ask a variety of new questions. We hope to improve our ability to specify co-occurrence restrictions within the syllable, though this is necessary for only a fairly small subset of the languages included. The problems we confront at the moment seem more related to implementation than to content.


Bird, Steven, and Mark Ellison. 1992. One-Level Phonology: Autosegmental Representations and Rules as Finite-State Automata. Computational Linguistics 20,1: 55-90

Esling, John, and Harry Gaylord. 1993. Computer Codes for Phonetic Symbols. Journal of the International Phonetic Association 23:83-97.

Fudge, E.C. 1969. Syllables. Journal of Linguistics 5:253-286.

Fudge, E.C. 1987. Branching Structures within the Syllable. Journal of Linguistics 23:359-377.

Greenberg, Joseph. 1969. Some Generalisations Concerning Initial and Final Consonant Sequences. Linguistics 18:5-34.

Hooper, Joan Bybee. 1976. An Introduction to Natural Generative Phonology. Academic Press.