.. _enrichment_syllables:

***********************
Creating syllable units
***********************

Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries
of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).

At the moment, only one algorithm is supported (`maximal onset`) because its simplicity lends it to be language agnostic.

To encode syllables, there are two steps:

1. :ref:`encoding_syllabics`
2. :ref:`encoding_syllables`


.. _encoding_syllabics:

Encoding syllabic segments
==========================

Syllabic segments are called via a specialized function:


.. code-block:: python

   syllabic_segments = ['aa', 'ae','ih']
   with CorpusContext(config) as c:
        c.encode_syllabic_segments(syllabic_segments)


Following this code, all phones with labels of `aa, ae, ih` will belong to the subset `syllabic`.  This subset can be
then queried in the future, in addition to allowing syllables to be encoded.

.. _encoding_syllables:

Encoding syllables
==================

.. code-block:: python

   with CorpusContext(config) as c:
        c.encode_syllables()

.. note::

   The function `encode_syllables` can be given a keyword argument for `call_back`, which is a function like `print` that
   allows for progress to be output to the console.

Two algorithms are available for encoding syllables: `maximal onset` (default) and `probabilistic`.
You can restrict the allowed onsets by passing a set of phone tuples to the `custom_onsets` keyword argument.For example, to allow only
`[B, D, G]` as onsets, you would call:

.. code-block:: python

   with CorpusContext(config) as c:
        c.encode_syllables(custom_onsets={('B',), ('D',), ('G',)})

The maximal onset algorithm automatically marks any word-initial non-syllabic cluster as a syllable onset.
This means you do not need to manually include onsets that typically occur only at the beginnings of words
and may otherwise cause incorrect syllable boundary placement — for example, ('S', 'T') or ('S', 'P') in English.
Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of
all the instances of syllables at the beginnings of words:


.. code-block:: python

   with CorpusContext(config) as c:
        q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
        print(q.all())

.. _stress_tone:

Encoding syllable properties from syllabics
===========================================

Often in corpora there is information about syllables contained on the vowels.  For instance, if the transcription contains
stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet).  Tone is likewise similarly encoded
in some transcription systems.  This section details functions that strip this information from the vowel and place it on
the syllable unit instead.

.. note::

   Removing the stress/tone information from the vowel makes queries easier, as getting all `AA` tokens no longer requires
   specifying that the label is in the set of `AA1, AA2, AA0`.  This functionality can be disabled by specifying `clean_phone_label=False`
   in the two functions that follow.

.. _stress_enrichment:

Encoding stress
---------------

.. code-block:: python

   with CorpusContext(config) as c:

        c.encode_stress_to_syllables()

.. note::

   By default, stress is taken to be numbers in the vowel label (i.e., `AA1` would have a stress of `1`).  A different
   pattern to use for stress information can be specified through the optional `regex` keyword argument.


.. _tone_enrichment:

Encoding tone
-------------

.. code-block:: python

   with CorpusContext(config) as c:

        c.encode_tone_to_syllables()

.. note::

   As for stress, a different regex can be specified with the `regex` keyword argument.