Tutorial 3: Getting information out¶

The main objective of this tutorial is to export a CSV file using a query on an imported (Tutorial 1: First steps) and enriched (Tutorial 2: Adding extra information) corpus. This tutorial is available as a Jupyter notebook as well.

As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:

from polyglotdb import CorpusContext


## CHANGE FOR YOUR SYSTEM
export_path = '/path/to/export/pg_tutorial.csv'

Creating an initial query¶

The first steps for generating a CSV file is to create a query that selects just the linguistic objects (“annotations”) of a particular type (e.g. words, syllables) that are of interest to our study.

For this example, we will query for all syllables, which are:

stressed (defined here as having a stress value equal to '1')

At the beginning of the word,

In words that are at the end of utterances.

with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

Next, we want to specify the particular information to extract for each syllable found.

# duplicated from above
with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )

With the above, we extract information of interest about the syllable, the word it is in, the utterance it is in, the speaker and the sound file (discourse in PolyglotDB’s API).

To test out the query, we can limit the results (for readability) and print them:

# duplicated from above
with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )

    q = q.limit(10)
    results = q.all()
    print(results)

Which will show the first ten rows that would be exported to a csv.

Exporting a CSV file¶

Once the query is constructed with filters and columns, exporting to a CSV is a simple method call on the query object. For completeness, the full code for the query and export is given below.

with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)
    q = q.filter(c.syllable.stress == 1)

    q = q.filter(c.syllable.begin == c.syllable.word.begin)

    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end)

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )
    q.to_csv(export_path)

The CSV file generated will then be ready to open in other programs or in R for data analysis.

Next steps¶

See the related ISCAN tutorial for R code on visualizing and analyzing the exported results.