Tutorial 4: Vowel formant analysis

The main objective of this tutorial is to perform vowel formant analysis on the enriched polyglot corpus we used in the previous three tutorials.

As in the other tutorials, import statements and the corpus name (as it is stored in pgdb) must be set for the code in this tutorial to be runnable. The example given below continues to make use of the “tutorial-subset” corpus we have been using in tutorials 1-3.

from polyglotdb import CorpusContext

# corpus_root = './data/LibriSpeech-aligned/'
# corpus_name = 'tutorial'
# export_path = './results/tutorial_4_formants.csv')
corpus_root = './data/LibriSpeech-aligned-subset/'
corpus_name = 'tutorial-subset'
export_path = './results/tutorial_4_subset_formants.csv')

Vowel phoneme enrichment

Currently, the tutorial-subset corpus contains an entry for each phoneme. We can query all phonemes using the following commands:

with CorpusContext(corpus_name) as c:
 q = c.query_lexicon(c.lexicon_phone)
 q = q.order_by(c.lexicon_phone.label)
 q = q.columns(c.lexicon_phone.label.column_name('phone'))
 phone_results = q.all()
 phone_set = [x.values[0] for x in phone_results]

We can then isolate only the vowel phonemes using regular expressions:

non_speech_set = ['<SIL>', 'sil', 'spn']
vowel_regex = '^[AEOUI].[0-9]'
vowel_set = [re.search(vowel_regex, p).string for p in phone_set
            if re.search(vowel_regex, p) != None and p not in non_speech_set]

The corpus can then be enriched with syllables that have vowels as their nuclei:

with CorpusContext(corpus_name) as c:
  c.encode_type_subset('phone', vowel_set, 'vowel')

with CorpusContext(corpus_name) as c:

Using Praat to measure verb formants

Now that all vowel syllables are isolated and easily queriable, polyglotdb can perform formant analysis on these vowels. The executable run to perform formant analysis is configurable: a common option is to use praat:

# NOTE: the location of your praat executable depends on your operating system/installation.
# double check where praat is installed on your system and change the praat_path variable as required.
praat_path = "/usr/bin/praat"
with CorpusContext(corpus_name) as c:
  c.config.praat_path = praat_path
  c.analyze_formant_points(vowel_label='vowel', call_back=print)

This step will enrich the corpus with formant variables (F1, F2, F3) aassociated with each vowel phoneme in the corpus.

Exporting a CSV file

We can now query the results using a similar set of commands as in the previous tutorials:

with CorpusContext(corpus_name) as c:
  q = c.query_graph(c.phone).filter(c.phone.subset == 'vowel')
  q = q.columns(c.phone.speaker.name.column_name('speaker'), # speaker enrichment performed in tutorial 2
                c.phone.F1.column_name('F1'), # the columns enriched by praat
  results = q.all()

The CSV file generated will then be ready to open in other programs or in R for data analysis. You can see a full version of the script, its expected output when run on the ‘LibriSpeech-subset’ corpora.

Next steps

At this point, the corpus is ready for formant analysis using R. We have provided an rmd script showcasing a possible approach. We have also provided results for running this script in a follow-up analysis html. These results were found using the full LibriSpeech-aligned dataset, which contains many more speakers than the subset we have been using in tutorials so far.

See Tutorial 5: Pitch extraction for another practical example of interesting linguistic analysis that can be peformed on enriched corpora using python and R. You can also see the `related ISCAN tutorial`_ for R code on visualizing and analyzing the exported results.