Tutorial 2: Adding extra information

The main objective of this tutorial is to enrich an already imported corpus (see Tutorial 1: First steps) with additional information not present in the original audio and transcripts. This additional information will then be used for creating linguistically interesting queries in the next tutorial (Tutorial 3: Getting information out). This tutorial is available as a Jupyter notebook as well.

Note

Different kinds of enrichment, corresponding to different subsections of this section, can be performed in any order. For example, speaker enrichment is independent of syllable encoding, so you can perform either one before the other and the resulting database will be the same. Within a section, however (i.e., Encoding syllables), the ordering of steps matters. For example, syllabic segments must be specified before syllables can be encoded, because the syllable encoding algorithm builds up syllalbes around syllabic phones.

As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:

import os
from polyglotdb import CorpusContext

## CHANGE THIS PATH to location of pg_tutorial corpus on your system
corpus_root = '/mnt/e/Data/pg_tutorial'

Encoding syllables

To create syllables requires two steps. The first is to specify the subset of phones in the corpus that are syllabic segments and function as syllabic nuclei. In general these will be vowels, but can also include syllabic consonants. Subsets in PolyglotDB are completely arbitrary sets of labels that speed up querying and allow for simpler references; see Subset enrichment for more details.

syllabics = ["ER0", "IH2", "EH1", "AE0", "UH1", "AY2", "AW2", "UW1", "OY2", "OY1", "AO0", "AH2", "ER1", "AW1",
          "OW0", "IY1", "IY2", "UW0", "AA1", "EY0", "AE1", "AA0", "OW1", "AW0", "AO1", "AO2", "IH0", "ER2",
          "UW2", "IY0", "AE2", "AH0", "AH1", "UH2", "EH2", "UH0", "EY1", "AY0", "AY1", "EH0", "EY2", "AA2",
          "OW2", "IH1"]

 with CorpusContext('pg_tutorial') as c:
     c.encode_type_subset('phone', syllabics, 'syllabic')

The database now contains the information that each phone type above (“ER0”, etc.) is a member of a subset called “syllabics”. Thus, each phone token, which belongs to one of these phone types, is also a syllabic.

Once the syllabic segments have been marked as such in the phone inventory, the next step is to actually create the syllable annotations as follows:

with CorpusContext('pg_tutorial') as c:
    c.encode_syllables(syllabic_label='syllabic')

The encode_syllables function uses a maximum onset algorithm based on all existing word-initial sequences of phones not marked as syllabic in this case, and then maximizes onsets between syllabic phones. As an example, something like astringent would have a phone sequence of AH0 S T R IH1 N JH EH0 N T. In any reasonably-sized corpus of English, the list of possible onsets would in include S T R and JH, but not N JH, so the sequence would be syllabified as AH0 . S T R IH1 N . JH EH0 N T.

Note

See Creating syllable units for more details on syllable enrichment.

Encoding utterances

As with syllables, encoding utterances consists of two steps. The first is marking the “words” that are actually non-speech elements within the transcript. When a corpus is first imported, every annotation is treated as speech. So we muist first encode labels like <SIL> as pause elements and not actual speech sounds:

pause_labels = ['<SIL>']

with CorpusContext('pg_tutorial') as c:
    c.encode_pauses(pause_labels)

(Note that in the tutorial corpus <SIL> happens to be the only possible non-speech “word”, but in other corpora there will probably be others, so you’d use a different pause_labels list.)

Once pauses are encoded, the next step is to actually create the utterance annotations as follows:

with CorpusContext('pg_tutorial') as c:
    c.encode_utterances(min_pause_length=0.15)

The min_puase_length argument specifies how long (in seconds) a non-speech element has to be to act as an utterance boundary. In many cases, “pauses” that are short enough, such as those inserted by a forced alignment error, are not good utterance boundaries (or just signal a smaller unit than an “utterance”).

Note

See Creating utterance units for more details on encoding pauses and utterances.

Speaker enrichment

Included in the tutorial corpus is a CSV containing speaker information, namely their gender and their actual name rather than the numeric code used in LibriSpeech. This information can be imported into the corpus as follows:

speaker_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'speaker_info.csv')

with CorpusContext('pg_tutorial') as c:
    c.enrich_speakers_from_csv(speaker_enrichment_path)

Note that the CSV file could have an arbitrary name and location, in general. The command above assumes the name and location for the tutorial corpus.

Once enrichment is complete, we can then query information and extract information about these characteristics of speakers.

Note

See Enriching speaker information for more details on enrichment from csvs.

Stress enrichment

Important

Stress enrichment requires the Encoding syllables step has been completed.

Once syllables have been encoded, there are a couple of ways to encode the stress level of the syllable (i.e., primary stress, secondary stress, or unstressed). The way used in this tutorial will use a lexical enrichment file included in the tutorial corpus. This file has a field named stress_pattern that gives a pattern for the syllables based on the stress. For example, astringent will have a stress pattern of 0-1-0.

lexicon_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'iscan_lexicon.csv')

with CorpusContext('pg_tutorial') as c:
    c.enrich_lexicon_from_csv(lexicon_enrichment_path)
    c.encode_stress_from_word_property('stress_pattern')

Following this enrichment step, words will have a type property of stress_pattern and syllables will have a token property of stress that can be queried on and extracted.

Note

See Encoding stress for more details on how to encode stress in various ways.

Additional enrichment

Important

Speech rate enrichment requires that both the Encoding syllables and Encoding utterances steps have been completed.

One of the final enrichment in this tutorial is to encode speech rate onto utterance annotations. The speech rate measure used here is going to to be syllables per second.

with CorpusContext('pg_tutorial') as c:
    c.encode_rate('utterance', 'syllable', 'speech_rate')

Next we will encode the number of syllables per word:

with CorpusContext('pg_tutorial') as c:
    c.encode_count('word', 'syllable', 'num_syllables')

Once the enrichments complete, a token property of speech_rate will be available for query and export on utterance annotations, as well as one for num_syllables on word tokens.

Note

See Hierarchical enrichment for more details on encoding properties based on the rate/count/position of lower annotations (i.e., phones or syllables) within higher annotations (i.e., syllables, words, or utterances).

Next steps

You can see a full version of the script which carries out all steps shown in code above.

See Tutorial 3: Getting information out for the next tutorial covering how to create and export interesting queries using the information enriched above. See Enrichment for a full list and example usage of the various enrichments possible in PolyglotDB.