Creating syllable units

Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).

At the moment, only one algorithm is supported (maximal onset) because its simplicity lends it to be language agnostic.

To encode syllables, there are two steps:

Encoding syllabic segments
Encoding syllables

Encoding syllabic segments

Syllabic segments are called via a specialized function:

syllabic_segments = ['aa', 'ae','ih']
with CorpusContext(config) as c:
     c.encode_syllabic_segments(syllabic_segments)

Following this code, all phones with labels of aa, ae, ih will belong to the subset syllabic. This subset can be then queried in the future, in addition to allowing syllables to be encoded.

Encoding syllables

with CorpusContext(config) as c:
     c.encode_syllables()

Note

The function encode_syllables can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.

Two algorithms are available for encoding syllables: maximal onset (default) and probabilistic. You can restrict the allowed onsets by passing a set of phone tuples to the custom_onsets keyword argument.For example, to allow only [B, D, G] as onsets, you would call:

with CorpusContext(config) as c:
     c.encode_syllables(custom_onsets={('B',), ('D',), ('G',)})

The maximal onset algorithm automatically marks any word-initial non-syllabic cluster as a syllable onset. This means you do not need to manually include onsets that typically occur only at the beginnings of words and may otherwise cause incorrect syllable boundary placement — for example, (‘S’, ‘T’) or (‘S’, ‘P’) in English. Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of syllables at the beginnings of words:

with CorpusContext(config) as c:
     q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
     print(q.all())

Encoding syllable properties from syllabics

Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded in some transcription systems. This section details functions that strip this information from the vowel and place it on the syllable unit instead.

Note

Removing the stress/tone information from the vowel makes queries easier, as getting all AA tokens no longer requires specifying that the label is in the set of AA1, AA2, AA0. This functionality can be disabled by specifying clean_phone_label=False in the two functions that follow.

Encoding stress

with CorpusContext(config) as c:

     c.encode_stress_to_syllables()

Note

By default, stress is taken to be numbers in the vowel label (i.e., AA1 would have a stress of 1). A different pattern to use for stress information can be specified through the optional regex keyword argument.

Encoding tone

with CorpusContext(config) as c:

     c.encode_tone_to_syllables()

Note

As for stress, a different regex can be specified with the regex keyword argument.