Creating syllable units

Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).

At the moment, only one algorithm is supported (maximal onset) because its simplicity lends it to be language agnostic.

To encode syllables, there are two steps:

  1. Encoding syllabic segments

  2. Encoding syllables

Encoding syllabic segments

Syllabic segments are called via a specialized function:

syllabic_segments = ['aa', 'ae','ih']
with CorpusContext(config) as c:

Following this code, all phones with labels of aa, ae, ih will belong to the subset syllabic. This subset can be then queried in the future, in addition to allowing syllables to be encoded.

Encoding syllables

with CorpusContext(config) as c:


The function encode_syllables can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.

Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of syllables at the beginnings of words:

with CorpusContext(config) as c:
     q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)

Encoding syllable properties from syllabics

Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded in some transcription systems. This section details functions that strip this information from the vowel and place it on the syllable unit instead.


Removing the stress/tone information from the vowel makes queries easier, as getting all AA tokens no longer requires specifying that the label is in the set of AA1, AA2, AA0. This functionality can be disabled by specifying clean_phone_label=False in the two functions that follow.

Encoding stress

with CorpusContext(config) as c:



By default, stress is taken to be numbers in the vowel label (i.e., AA1 would have a stress of 1). A different pattern to use for stress information can be specified through the optional regex keyword argument.

Encoding tone

with CorpusContext(config) as c:



As for stress, a different regex can be specified with the regex keyword argument.