Welcome to PolyglotDB’s documentation!

Contents:

Introduction

General Background

PolyglotDB is a Python package that focuses on representing linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to large speech corpora.

In general there are two ways to leverage PolyglotDB for analyzing a dataset:

  1. The first way, more appropriate for technically skilled users, is through a Python API: writing Python scripts that import functions and classes from PolyglotDB. (For this route, see Getting started for setting up PolyglotDB, followed by Tutorial for walk-through examples.) This way also makes more sense for users in an individual lab, where it can be assumed that all users have the same level of access to datasets (without any ethical issues).
  2. The second way, more appropriate for a user group dispersed across multiple sites and where some users are less comfortable with Python scripting, is by setting up an ISCAN (Integrated Speech Corpus ANalysis) server—see the ISCAN documentation for more details. ISCAN servers allow users to view information and perform most functions of PolyglotDB through a web browser. In addition, ISCAN servers include features for the use case of multiple datasets with differential access: by user/corpus permissions level, and functionality for managing multiple Polyglot databases.

This documentation site is relevant for ways PolyglotDB canbeused, but is geared towards a technically-skilled user and thus focuses more on the use case of using PolyglotDB “by script” (#1).

The general workflow for working with PolyglotDB is:

  • Import
    • Parse and load initial data from corpus files into a Polyglot database
      • This step can take a while, from a couple of minutes to hours depending on corpus size.
      • Intended to be done once per corpus
    • See Importing the tutorial corpus for an example
    • See Importing corpora for more details on the import process
  • Enrichment
    • Add further information through analysis algorithms or from CSV files
      • Can take a while, from a couple of minutes to hours depending on enrichment and corpus size
      • Intended to be done once per corpus
    • See Tutorial 2: Adding extra information for an example
    • See Enrichment for more details on the enrichment process
  • Query
    • Find specific linguistic units
      • Should be quick, from a couple of minutes to ~10 minutes depending on corpus size
      • Intended to be done many times per corpus, for different queries
    • See Tutorial 3: Getting information out for an example
    • See Querying corpora for more details on the query process
  • Export
    • Generate a CSV file for data analysis with specific information extracted from the previous query
      • Should be quick, and intended to be done many times per corpus (like Query)
    • See Exporting a CSV file for an example
    • See Exporting query results for more details on the export process

The thinking behind this workflow is explained in more detail in the ISCAN conference paper.

Note

There are also many PolyglotDB scripts written for the SPADE project that can be used as examples. These scripts are available in the SPADE GitHub repo.

High level overview

PolyglotDB represents language (speech and text corpora) using the annotation graph formalism put forth in Bird and Liberman (2001). Annotations are represented in a directed acyclic graph, where nodes are points in time in an audio file or points in a text file. Directed edges are labelled with annotations, and multiple levels of annotations can be included or excluded as desired. They put forth a relational formalism for annotation graphs, and later work implements the formalism in SQL. Similarly, the LaBB-CAT and EMU-SDMS use the annotation graph formalism.

Recently, NoSQL databases have been rising in popularity, and one type of these is the graph database. In this type of database, nodes and relationships are primitives rather than relational tables. Graph databases map on annotation graphs in a much cleaner fashion than relational databases. The graph database used in PolyglotDB is Neo4j.

PolyglotDB also uses a NoSQL time-series database called InfluxDB. Acoustic measurements like F0 and formants are stored here as every time step (10 ms) has a value associated with it. Each measurement is also associated with a speaker and a phone from the graph database.

Multiple versions of imported sound files are generated at various sampling rates (1200 Hz, 11000 Hz, and 22050 Hz) to help speed up relevant algorithms. For example, pitch algorithms don’t need a highly sampled signal and higher sample rates will slow down the processing of files.

The idea of using multiple languages or technologies that suit individual problems has been known, particularly in the realm of merging SQL and NoSQL databases, as “polyglot persistence.”

More detailed information on specific implementation details is available in the Developer documentation, as well as in the InterSpeech proceedings paper.

Development history

PolyglotDB was originally conceptualized for use in Phonological CorpusTools, developed at the University of British Columbia. However, primary development shifted to the umbrella of Montreal Corpus Tools, developed by members of the Montreal Language Modelling Lab at McGill University (now part of MCQLL Lab).

A graphical program named Speech Corpus Tools was originally developed to allow users to interact with Polyglot without writing scripts. However, in the context of the the Speech Across Dialects of English (SPADE) project, a more flexible solution was needed to accommodate use cases involving multiple users, with physical separation between users and data, and differing levels of permission across datasets. ISCAN has been developed within the SPADE project with these requirements in mind.

Contributors

Citation

A citeable paper for PolyglotDB is:

McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017. [PDF]

Or you can cite it via:

McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, Arlie Coles, Sarah Mihuc, Michael Goodale, and Morgan Sonderegger (2019). PolyglotDB [Computer program]. Version 0.1.0, retrieved 26 March 2019 from https://github.com/MontrealCorpusTools/PolyglotDB.

Getting started

PolyglotDB is the Python API for interacting with Polyglot databases and is installed through pip. There are other dependencies that must be installed prior to using a Polyglot database, depending on the user’s platform.

Note

Another way to use Polyglot functionality is through setting up an ISCAN server. An Integrated Speech Corpus Analysis (ISCAN) server can be set up on a lab’s central server, or you can run it on your local computer as well (though many of PolyglotDB’s algorithms benefit from having more processors and memory available). Please see the ISCAN documentation for more information on setting it up (http://iscan.readthedocs.io/en/latest/getting_started.html). The main feature benefits of ISCAN are multiple Polyglot databases (separating out different corpora and allowing any of them to be started or shutdown), graphical interfaces for inspecting data, and a user authentication system with different levels of permission for remote access through a web application.

Installation

To install via pip:

pip install polyglotdb

To install from source (primarily for development):

  1. Clone or download the Git repository (https://github.com/MontrealCorpusTools/PolyglotDB).
  2. Navigate to the directory via command line and install the dependencies via pip install -r requirements.txt
  3. Install PolyglotDB via python setup.py install, which will install the pgdb utility that can be run anywhere and manages a local database.

Note

The use of sudo is not recommended for installation. Ideally your Python installation should be managed by either Anaconda or Homebrew (for Macs).

Set up local database

Installing the PolyglotDB package also installs a utility script (pgdb) that is then callable from the command line anywhere on the system. The pgdb command allows for the administration of a single Polyglot database (install/start/stop/uninstall). Using pgdb requires that several prerequisites be installed first, and the remainder of this section will detail how to install these on various platforms. Please be aware that using the pgdb utility to set up a database is not recommended for larger groups or those needing remote access. See the ISCAN server for a more fully featured solution.

Mac
  1. Ensure Java 11 is installed inside Anaconda distribution (conda install -c anaconda openjdk) if using Anaconda, or via Homebrew otherwise (brew cask install java)
  2. Check Java version is 11 via java --version
  3. Once PolyglotDB is installed, run pgdb install /path/to/where/you/want/data/to/be/stored, or pgdb install to save data in the default directory.

Warning

Do not use sudo with this command on Macs, as it will lead to permissions issues later on.

Once you have installed PolyglotDB, to start it run pgdb start. Likewise, you can close PolyglotDB by running pgdb stop.

To uninstall, run pgdb uninstall

Windows
  1. Ensure Java 11 is installed (https://www.java.com/) and on the path (java --version works in the command prompt)
  2. Check Java version is 11 via java --version
  3. Start an Administrator command prompt (right click on cmd.exe and select “Run as administrator”), as Neo4j will be installed as a Windows service.
  4. Run pgdb install /path/to/where/you/want/data/to/be/stored, or pgdb install to save data in the default directory.

To start the database, you likewise have to use an administrator command prompt before entering the commands pgdb start or pgdb stop.

To uninstall, run pgdb uninstall (also requires an administrator command prompt).

Linux

Ensure Java 11 is installed. On Ubuntu:

sudo apt-get update
sudo apt-get install openjdk-11-jdk-headless

Once installed, double check that java --version returns Java 11. Then run pgdb install /path/to/where/you/want/data/to/be/stored, or pgdb install to save data in the default directory.

Once you have installed PolyglotDB, to start it run pgdb start. Likewise, you can close PolyglotDB by running pgdb stop.

To uninstall, navigate to the PolyglotDB directory and type pgdb uninstall

Tutorial

Under development!

Contents:

Tutorial 1: First steps

The main objective of this tutorial is to import a downloaded corpus consisting of sound files and TextGrids into a Polyglot database so that they can be queried. This tutorial is available as a Jupyter notebook as well.

Downloading the tutorial corpus

The tutorial corpus used here is a version of the LibriSpeech test-clean subset, forced aligned with the Montreal Forced Aligner (tutorial corpus download link). Extract the files to somewhere on your local machine.

Importing the tutorial corpus

To import the tutorial corpus, the following lines of code are necessary:

from polyglotdb import CorpusContext
import polyglotdb.io as pgio

corpus_root = '/mnt/e/Data/pg_tutorial'

parser = pgio.inspect_mfa(corpus_root)
parser.call_back = print

with CorpusContext('pg_tutorial') as c:
   c.load(parser, corpus_root)

Important

If during the running of the import code, a neo4j.exceptions.ServiceUnavailable error is raised, then double check that the pgdb database is running. Once polyglotdb is installed, simply call pgdb start, assuming pgdb install has already been called. See Set up local database for more information.

The import statements at the top get the necessary classes and functions for importing, namely the CorpusContext class and the polyglot IO module. CorpusContext objects are how all interactions with the database are handled. The CorpusContext is created as a context manager in Python (the with ... as ... pattern), so that clean up and closing of connections are automatically handled both on successful completion of the code as well as if errors are encountered.

The IO module handles all import and export functionality in polyglotdb. The principle functions that a user will encounter are the inspect_X functions that generate parsers for corpus formats. In the above code, the MFA parser is used because the tutorial corpus was aligned using the MFA. See Importing corpora for more information on the inspect functions and parser objects they generate for various formats.

Resetting the corpus

If at any point there’s some error or interruption in import or other stages of the tutorial, the corpus can be reset to a fresh state via the following code:

from polyglotdb import CorpusContext

with CorpusContext('pg_tutorial') as c:
   c.reset()

Warning

Be careful when running this code as it will delete any and all information in the corpus. For smaller corpora such as the one presented here, the time to set up is not huge, but for larger corpora this can result in several hours worth of time to reimport and re-enrich the corpus.

Testing some simple queries

To ensure that data import completed successfully, we can print the list of speakers, discourses, and phone types in the corpus, via:

from polyglotdb import CorpusContext

with CorpusContext('pg_tutorial') as c:
 print('Speakers:', c.speakers)
 print('Discourses:', c.discourses)

 q = c.query_lexicon(c.lexicon_phone)
 q = q.order_by(c.lexicon_phone.label)
 q = q.columns(c.lexicon_phone.label.column_name('phone'))
 results = q.all()
 print(results)

A more interesting summary query is perhaps looking at the count and average duration of different phone types across the corpus, via:

from polyglotdb.query.base.func import Count, Average

with CorpusContext('pg_tutorial') as c:
   q = c.query_graph(c.phone).group_by(c.phone.label.column_name('phone'))
   results = q.aggregate(Count().column_name('count'), Average(c.phone.duration).column_name('average_duration'))
   for r in results:
      print('The phone {} had {} occurrences and an average duration of {}.'.format(r['phone'], r['count'], r['average_duration']))

Next steps

You can see a full version of the script.

See Tutorial 2: Adding extra information for the next tutorial covering how to enrich the corpus and create more interesting queries.

Tutorial 2: Adding extra information

The main objective of this tutorial is to enrich an already imported corpus (see Tutorial 1: First steps) with additional information not present in the original audio and transcripts. This additional information will then be used for creating linguistically interesting queries in the next tutorial (Tutorial 3: Getting information out). This tutorial is available as a Jupyter notebook as well.

Note

Different kinds of enrichment, corresponding to different subsections of this section, can be performed in any order. For example, speaker enrichment is independent of syllable encoding, so you can perform either one before the other and the resulting database will be the same. Within a section, however (i.e., Encoding syllables), the ordering of steps matters. For example, syllabic segments must be specified before syllables can be encoded, because the syllable encoding algorithm builds up syllalbes around syllabic phones.

As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:

import os
from polyglotdb import CorpusContext

## CHANGE THIS PATH to location of pg_tutorial corpus on your system
corpus_root = '/mnt/e/Data/pg_tutorial'

Encoding syllables

To create syllables requires two steps. The first is to specify the subset of phones in the corpus that are syllabic segments and function as syllabic nuclei. In general these will be vowels, but can also include syllabic consonants. Subsets in PolyglotDB are completely arbitrary sets of labels that speed up querying and allow for simpler references; see Subset enrichment for more details.

syllabics = ["ER0", "IH2", "EH1", "AE0", "UH1", "AY2", "AW2", "UW1", "OY2", "OY1", "AO0", "AH2", "ER1", "AW1",
          "OW0", "IY1", "IY2", "UW0", "AA1", "EY0", "AE1", "AA0", "OW1", "AW0", "AO1", "AO2", "IH0", "ER2",
          "UW2", "IY0", "AE2", "AH0", "AH1", "UH2", "EH2", "UH0", "EY1", "AY0", "AY1", "EH0", "EY2", "AA2",
          "OW2", "IH1"]

 with CorpusContext('pg_tutorial') as c:
     c.encode_type_subset('phone', syllabics, 'syllabic')

The database now contains the information that each phone type above (“ER0”, etc.) is a member of a subset called “syllabics”. Thus, each phone token, which belongs to one of these phone types, is also a syllabic.

Once the syllabic segments have been marked as such in the phone inventory, the next step is to actually create the syllable annotations as follows:

with CorpusContext('pg_tutorial') as c:
    c.encode_syllables(syllabic_label='syllabic')

The encode_syllables function uses a maximum onset algorithm based on all existing word-initial sequences of phones not marked as syllabic in this case, and then maximizes onsets between syllabic phones. As an example, something like astringent would have a phone sequence of AH0 S T R IH1 N JH EH0 N T. In any reasonably-sized corpus of English, the list of possible onsets would in include S T R and JH, but not N JH, so the sequence would be syllabified as AH0 . S T R IH1 N . JH EH0 N T.

Note

See Creating syllable units for more details on syllable enrichment.

Encoding utterances

As with syllables, encoding utterances consists of two steps. The first is marking the “words” that are actually non-speech elements within the transcript. When a corpus is first imported, every annotation is treated as speech. So we muist first encode labels like <SIL> as pause elements and not actual speech sounds:

pause_labels = ['<SIL>']

with CorpusContext('pg_tutorial') as c:
    c.encode_pauses(pause_labels)

(Note that in the tutorial corpus <SIL> happens to be the only possible non-speech “word”, but in other corpora there will probably be others, so you’d use a different pause_labels list.)

Once pauses are encoded, the next step is to actually create the utterance annotations as follows:

with CorpusContext('pg_tutorial') as c:
    c.encode_utterances(min_pause_length=0.15)

The min_puase_length argument specifies how long (in seconds) a non-speech element has to be to act as an utterance boundary. In many cases, “pauses” that are short enough, such as those inserted by a forced alignment error, are not good utterance boundaries (or just signal a smaller unit than an “utterance”).

Note

See Creating utterance units for more details on encoding pauses and utterances.

Speaker enrichment

Included in the tutorial corpus is a CSV containing speaker information, namely their gender and their actual name rather than the numeric code used in LibriSpeech. This information can be imported into the corpus as follows:

speaker_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'speaker_info.csv')

with CorpusContext('pg_tutorial') as c:
    c.enrich_speakers_from_csv(speaker_enrichment_path)

Note that the CSV file could have an arbitrary name and location, in general. The command above assumes the name and location for the tutorial corpus.

Once enrichment is complete, we can then query information and extract information about these characteristics of speakers.

Note

See Enriching speaker information for more details on enrichment from csvs.

Stress enrichment

Important

Stress enrichment requires the Encoding syllables step has been completed.

Once syllables have been encoded, there are a couple of ways to encode the stress level of the syllable (i.e., primary stress, secondary stress, or unstressed). The way used in this tutorial will use a lexical enrichment file included in the tutorial corpus. This file has a field named stress_pattern that gives a pattern for the syllables based on the stress. For example, astringent will have a stress pattern of 0-1-0.

lexicon_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'iscan_lexicon.csv')

with CorpusContext('pg_tutorial') as c:
    c.enrich_lexicon_from_csv(lexicon_enrichment_path)
    c.encode_stress_from_word_property('stress_pattern')

Following this enrichment step, words will have a type property of stress_pattern and syllables will have a token property of stress that can be queried on and extracted.

Note

See Encoding stress for more details on how to encode stress in various ways.

Additional enrichment

Important

Speech rate enrichment requires that both the Encoding syllables and Encoding utterances steps have been completed.

One of the final enrichment in this tutorial is to encode speech rate onto utterance annotations. The speech rate measure used here is going to to be syllables per second.

with CorpusContext('pg_tutorial') as c:
    c.encode_rate('utterance', 'syllable', 'speech_rate')

Next we will encode the number of syllables per word:

with CorpusContext('pg_tutorial') as c:
    c.encode_count('word', 'syllable', 'num_syllables')

Once the enrichments complete, a token property of speech_rate will be available for query and export on utterance annotations, as well as one for num_syllables on word tokens.

Note

See Hierarchical enrichment for more details on encoding properties based on the rate/count/position of lower annotations (i.e., phones or syllables) within higher annotations (i.e., syllables, words, or utterances).

Next steps

You can see a full version of the script which carries out all steps shown in code above.

See Tutorial 3: Getting information out for the next tutorial covering how to create and export interesting queries using the information enriched above. See Enrichment for a full list and example usage of the various enrichments possible in PolyglotDB.

Tutorial 3: Getting information out

The main objective of this tutorial is to export a CSV file using a query on an imported (Tutorial 1: First steps) and enriched (Tutorial 2: Adding extra information) corpus. This tutorial is available as a Jupyter notebook as well.

As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:

from polyglotdb import CorpusContext


## CHANGE FOR YOUR SYSTEM
export_path = '/path/to/export/pg_tutorial.csv'

Creating an initial query

The first steps for generating a CSV file is to create a query that selects just the linguistic objects (“annotations”) of a particular type (e.g. words, syllables) that are of interest to our study.

For this example, we will query for all syllables, which are:

  • stressed (defined here as having a stress value equal to '1')
  • At the beginning of the word,
  • In words that are at the end of utterances.
with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

Next, we want to specify the particular information to extract for each syllable found.

# duplicated from above
with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )

With the above, we extract information of interest about the syllable, the word it is in, the utterance it is in, the speaker and the sound file (discourse in PolyglotDB’s API).

To test out the query, we can limit the results (for readability) and print them:

# duplicated from above
with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)

    q = q.filter(c.syllable.stress == '1') # Stressed syllables...
    q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )

    q = q.limit(10)
    results = q.all()
    print(results)

Which will show the first ten rows that would be exported to a csv.

Exporting a CSV file

Once the query is constructed with filters and columns, exporting to a CSV is a simple method call on the query object. For completeness, the full code for the query and export is given below.

with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.syllable)
    q = q.filter(c.syllable.stress == 1)

    q = q.filter(c.syllable.begin == c.syllable.word.begin)

    q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end)

    q = q.columns(c.syllable.label.column_name('syllable'),
                  c.syllable.duration.column_name('syllable_duration'),
                  c.syllable.word.label.column_name('word'),
                  c.syllable.word.begin.column_name('word_begin'),
                  c.syllable.word.end.column_name('word_end'),
                  c.syllable.word.num_syllables.column_name('word_num_syllables'),
                  c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
                  c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
                  c.syllable.speaker.name.column_name('speaker'),
                  c.syllable.discourse.name.column_name('file'),
                  )
    q.to_csv(export_path)

The CSV file generated will then be ready to open in other programs or in R for data analysis.

Next steps

See the related ISCAN tutorial for R code on visualizing and analyzing the exported results.

Interacting with a local Polyglot database

There are two potential ways to have a local Polyglot instance up and running on your local machine. The first is a command line utility pgdb. The other option is to connect to a locally running ISCAN server instance.

pgdb utility

This utility provides a basic way to install/start/stop all of the required databases in a Polyglot database (see Set up local database for more details on setting up a Polyglot instance this way).

When using this set up the following ports are used (and are relevant for later connecting with the corpus):

Port Protocol Database
7474 HTTP Neo4j
7475 HTTPS Neo4j
7687 Bolt Neo4j
8086 HTTP InfluxDB
8087 UDP InfluxDB

If any of those ports are in use by other programs (they’re also the default ports for the respective database software), then the Polyglot instance will not be able to start.

Once pgdb start has executed, the local Neo4j instance can be seen at http://localhost:7474/.

Connecting from a script

When the Polyglot instance is running locally, scripts can connect to the relevant databases through the use of parameters passed to CorpusContext objects (or CorpusConfig objects):

from polyglotdb import CorpusContext, CorpusConfig

connection_params = {'host': 'localhost',
                    'graph_http_port': 7474,
                    'graph_bolt_port': 7687,
                    'acoustic_http_port': 8086}
config = CorpusConfig('corpus_name', **connection_params)
with CorpusContext(config) as c:
    pass # replace with some task, i.e., import, enrichment, or query

These port settings are used by default and so connecting to a vanilla install of the pgdb utility can be done more simply through the following:

from polyglotdb import CorpusContext

with CorpusContext('corpus_name') as c:
    pass # replace with some task, i.e., import, enrichment, or query

See the tutorial scripts for examples that use this style of connecting to a local pgdb instance.

Local ISCAN server

A locally running ISCAN server is a more fully functional system that can manage multiple Polyglot databases (creating, starting and stopping as necessary through a graphical web interface). While ISCAN servers are intended to be run on dedicated remote servers, there will often be times where scripts will need to connect a locally running server. For this, there is a utility function ensure_local_database_running:

from polyglotdb import CorpusContext, CorpusConfig
from polyglotdb.utils import ensure_local_database_running

with ensure_local_database_running('database', port=8080, token='auth_token_from_iscan') as connection_params:
    config = CorpusConfig('corpus_name', **connection_params)
    with CorpusContext(config) as c:
        pass # replace with some task, i.e., import, enrichment, or query

Important

Replace the database, auth_token_from_iscan, and corpus_name with relevant values. In the use case of one corpus per database, database and corpus_name can be the same name, as in the SPADE analysis repository.

As compared to the example above, the only difference is the context manager use of ensure_local_database_running. What this function does is first try to connect to a ISCAN server running on the local machine. If it successfully connects, then it creates a new database named "database" if it does not already exist, starts it if it is not already running, and then returns the connection parameters as a dictionary that can be used for instantiating the CorpusConfig object. Once all the work inside the context of ensure_local_database_running has been completed, the database will be stopped.

The token keyword argument should be an authentication token for a user with appropriate permissions to access the ISCAN server. This token can be found by going to the admin page for tokens within ISCAN (by default, http://localhost:8080/admin/auth_token/) and choosing an appropriate one. However, please ensure that this token is not committed or made public in any way as that would lead to security issues. One way to use this in committed code is to have the token saved in a separate text document that git does not track, and load it via a function like:

def load_token():
    token_path = os.path.join(base_dir, 'auth_token')
    if not os.path.exists(token_path):
        return None
    with open(token_path, 'r') as f:
        token = f.read().strip()
    return token

Note

The ISCAN server keeps track of all existing databases and ensures that the ports do not overlap, so multiple databases can be run simultaneously. The ports are all in the 7400 and 8400 range, and should not (but may) conflict with other applications.

This utility is thus best for isolated work by a single user, where only they will be interacting with the particular database specified and the database only needs to be available during the running of the script.

You can see an example of connecting to local ISCAN server used in the scripts for the SPADE analysis repository, for instance the basic_queries.py script.

Importing corpora

Corpora can be imported from several input formats. The list of currently supported formats is:

Each format has a inspection function in the polyglot.io submodule that will check that format of the specified directory or file matches the input format and return the appropriate parser.

These functions would be used as follows:

import polyglotdb.io as pgio

corpus_directory = '/path/to/directory'

parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids

# OR

parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids

# OR

parser = pgio.inspect_textgrid(corpus_directory)

# OR

parser = pgio.inspect_labbcat(corpus_directory)

# OR

parser = pgio.inspect_partitur(corpus_directory)

# OR

parser = pgio.inspect_timit(corpus_directory)

# OR

parser = pgio.inspect_buckeye(corpus_directory)

Note

For more technical detail on the inspect functions and the parser objects they return, see PolyglotDB I/O.

To import a corpus, the CorpusContext context manager has to be imported from polyglotdb:

from polyglotdb import CorpusContext

CorpusContext is the primary way through which corpora can be interacted with.

Before importing a corpus, you should ensure that a Neo4j server is running. Interacting with corpora requires submitting the connection details. The easiest way to do this is with a utility function ensure_local_database_running (see Interacting with a local Polyglot database for more information):

from polyglotdb.utils import ensure_local_database_running
from polyglotdb import CorpusConfig

with ensure_local_database_running('database_name') as connection_params:
   config = CorpusConfig('corpus_name', **connection_params)

The above config object contains all the configuration for the corpus.

To import a file into a corpus (in this case a TextGrid):

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')

with ensure_local_database_running('database_name') as connection_params:
   config = CorpusConfig('my_corpus', **connection_params)
   with CorpusContext(config) as c:
       c.load(parser, '/path/to/textgrid.TextGrid')

In the above code, the io module is imported and provides access to all the importing and exporting functions. For every format, there is an inspect function to generate a parser for that file and other ones that are formatted the same. In the case of a TextGrid, the parser has annotation types correspond to interval and point tiers. The inspect function tries to guess the relevant attributes of each tier.

Note

The discourse load function of Corpuscontext objects takes a parser as the first argument. Parsers contain an attribute annotation_types, which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from aligners, no modification is necessary.

All interaction with the databases is via the CorpusContext context manager. Further details on import arguments can be found in the API documentation.

Once the above code is run, corpora can be queried and explored.

Enrichment

Following import, the corpus is often fairly bare, with just word and phone annotations. An important step in analyzing corpora is therefore enriching it with other information. Most of the methods here are automatic once a function is called.

Contents:

Subset enrichment

One of the most basic aspects of linguistic analysis is creating functional subsets of linguistic units. In phonology, for instance, this would be creating classes like syllabic or coronal. For words, this might be classes like content vs functional, or something more fine-grained like noun, adjective, verb, etc. At the core of these analyses is the idea that we treat some subset of linguistic units separately from others. In PolyglotDB, subsets are a fairly broad and general concept and can be applied to both linguistic types (i.e., phones or words in a lexicon) or to tokens (i.e., actual productions in a discourse).

For instance, if we wanted to create a subset of phone types that are syllabic, we can run the following code:

syllabics = ['aa', 'ih']

with CorpusContext('corpus') as c:
    c.encode_type_subset('phone', syllabics, 'syllabic')

This type subset can then be used as in Creating syllable units, or for the queries in Subsetting annotations.

Token subsets can also be created, see Enrichment via queries.

Creating syllable units

Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).

At the moment, only one algorithm is supported (maximal onset) because its simplicity lends it to be language agnostic.

To encode syllables, there are two steps:

  1. Encoding syllabic segments
  2. Encoding syllables

Encoding syllabic segments

Syllabic segments are called via a specialized function:

syllabic_segments = ['aa', 'ae','ih']
with CorpusContext(config) as c:
     c.encode_syllabic_segments(syllabic_segments)

Following this code, all phones with labels of aa, ae, ih will belong to the subset syllabic. This subset can be then queried in the future, in addition to allowing syllables to be encoded.

Encoding syllables

with CorpusContext(config) as c:
     c.encode_syllables()

Note

The function encode_syllables can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.

Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of syllables at the beginnings of words:

with CorpusContext(config) as c:
     q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
     print(q.all())

Encoding syllable properties from syllabics

Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded in some transcription systems. This section details functions that strip this information from the vowel and place it on the syllable unit instead.

Note

Removing the stress/tone information from the vowel makes queries easier, as getting all AA tokens no longer requires specifying that the label is in the set of AA1, AA2, AA0. This functionality can be disabled by specifying clean_phone_label=False in the two functions that follow.

Encoding stress
with CorpusContext(config) as c:

     c.encode_stress_to_syllables()

Note

By default, stress is taken to be numbers in the vowel label (i.e., AA1 would have a stress of 1). A different pattern to use for stress information can be specified through the optional regex keyword argument.

Encoding tone
with CorpusContext(config) as c:

     c.encode_tone_to_syllables()

Note

As for stress, a different regex can be specified with the regex keyword argument.

Creating utterance units

Utterances are groups of words that are continuous in some sense. The can be thought of as similar to interpausal units or chunks in other work. The basic idea is that there are intervals in which there are no speech, a subset of which count as breaks in speech depending on the length of these non-speech intervals.

To encode utterances, there are two steps:

  1. Encoding non-speech elements
  2. Encoding utterances

Encoding non-speech elements

Non-speech elements in PolyglotDB are termed pause. Pauses are encoded as follows:

nonspeech_words = ['<SIL>','<IVER>']
with CorpusContext(config) as c:
     c.encode_pauses(nonspeech_words)

The function encode_pauses takes a list of word labels that should not be considered speech in a discourse and marks them as such.

Note

Non-speech words can also be encoded through regular expressions, as in:

nonspeech_words = '^[<[{].*'
with CorpusContext(config) as c:
    c.encode_pauses(nonspeech_words)

Where the pattern to be matched is any label that starts with < or [.

Once pauses are encoded, aspects of pauses can be queried, as follows:

with CorpusContext(config) as c:
    q = c.query_graph(c.pause).filter(c.pause.discourse.name == 'one_discourse')
    print(q.all())

Additionally, word annotations can have previous and following pauses that can be found:

with CorpusContext(config) as c:
    q = c.query_graph(c.word).columns(c.word.label,
                                       c.word.following_pause_duration.column_name('pause_duration'))
    print(q.all())

Note

Once pauses are encoded, accessing an annotation’s previous or following word via c.word.previous will skip over any pauses. So for a string like I <SIL> go…, the previous word to the word go would be I rather than <SIL>.

Encoding utterances

Once pauses are encoded, utterances can be encoded by specifying the minimum length of non-speech elements that count as a break between stretches of speech.

with CorpusContext(config) as c:
     c.encode_utterances(min_pause_length=0.15)

Note

The function encode_utterances can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.

Following encoding, utterances are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of words at the beginnings of utterances:

with CorpusContext(config) as c:
     q = c.query_graph(c.word).filter(c.word.begin == c.word.utterance.begin)
     print(q.all())

Hierarchical enrichment

Hierarchical enrichment is for encoding properties that reference multiple levels of annotations. For instance, something like speech rate of an utterance requires referencing both utterances as well as the rate per second of an annotation type below, usually syllables. Likewise, encoding number of syllables in a word or the position of a phone in a word again reference multiple levels of annotation.

Note

See Annotation Graphs for details on the implementation and representations of the annotation graph hierarchy that PolyglotDB uses.

Encode count

Count enrichment creates a property on the higher annotation that is a measure of the number of lower annotations of a type it contains. For instance, if we want to encode how many phones there are within each word, the following code is used:

with CorpusContext('corpus') as c:
    c.encode_count('word', 'phone', 'number_of_phones')

Following enrichment, all word tokens will have a property for number_of_phones that can be referenced in queries and exports.

Encode rate

Rate enrichment creates a property on a higher annotation that is a measure of lower annotations per second. It is calculated as the count of units contained by the higher annotation divided by the duration of the higher annotation.

with CorpusContext('corpus') as c:
    c.encode_rate('word', 'phone', 'phones_per_second')

Following enrichment, all word tokens will have a property for phones_per_second that can be referenced in queries and exports.

Encode position

Position enrichment creates a property on the lower annotation that is the position of the element in relation to other annotations within a higher annotation. It starts at 1 for the first element.

with CorpusContext('corpus') as c:
    c.encode_position('word', 'phone', 'position_in_word')

The encoded property is then queryable/exportable, as follows:

with CorpusContext('corpus') as c:
     q = c.query_graph(c.phone).filter(c.phone.position_in_word == 1)
     print(q.all())

The above query will match all phones in the first position (i.e., identical results to a query using alignment, see Hierarchical queries for more details on those).

Enrichment via CSV files

PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying a local CSV file to add information from. When constructing this CSV file, the first column should be the label used to identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus.

ID_column,property_one,property_two
first_item,first_item_value_one,first_item_value_two
second_item,,second_item_value_two

Enriching using this file would look up elements based on the ID_column, and the one matching first_item would get both property_one and property_two (with the respective values). The one matching second_item would only get a property_two (because the value for property_one is empty.

Enriching the lexicon

lexicon_csv_path = '/full/path/to/lexicon/data.csv'
with CorpusContext(config) as c:
    c.enrich_lexicon_from_csv(lexicon_csv_path)

Note

The function enrich_lexicon_from_csv accepts an optional keyword case_sensitive and defaults to False. Changing this will respect capitalization when looking up words.

Enriching the phonological inventory

The phone inventory can be enriched with arbitrary properties via:

inventory_csv_path = '/full/path/to/inventory/data.csv'
with CorpusContext(config) as c:
    c.enrich_inventory_from_csv(inventory_csv_path)

Enriching speaker information

Speaker information can be added via:

speaker_csv_path = '/full/path/to/speaker/data.csv'
with CorpusContext(config) as c:
    c.enrich_speakers_from_csv(speaker_csv_path)

Enriching discourse information

Metadata about the discourses or sound files can be added via:

discourse_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
    c.enrich_discourses_from_csv(discourse_csv_path)

Enriching arbitrary tokens

Often it’s necessary or useful to encode a new property on tokens of an annotation without directly interfacing the database. This could happen, for example, if you wanted to use a different language or tool for a certain phonetic analysis than Python. In this case, it is possible to enrich any type of token via CSV. This can be done using the corpus_context.enrich_tokens_with_csv function.

token_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
    c.enrich_tokens_from_csv(token_csv_path,
            annotation_type="phone",
            id_column="phone_id"
            properties=["measurement_1", "measurement_2"])

The only requirement for the CSV is that there is a column which contains the IDs of the tokens you wish to update. You can get these IDs (along with other parameters) by querying the tokens before hand, and exporting a CSV, see Export for token CSVs. The only columns from the CSV that will be added as token properties, are those which are included in the properties parameter. If this parameter is left as None, then all the columns of the CSV except the id_column will be included.

Enrichment via queries

Queries have the functionality to set properties and create subsets of elements based on results.

For instance, if you wanted to make word initial phones more easily queryable, you could perform the following:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.begin == c.phone.word.begin)
    q.create_subset('word-initial')

Once that code completes, a subsequent query could be made of:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.subset == 'word-initial)
    print(q.all()))

Or instead of a subset, a property could be encoded as:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.begin == c.phone.word.begin)
    q.set_properties(position='word-initial')

And then this property can be exported as a column in a csv:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q.columns(c.position)
    q.to_csv(some_csv_path)

Lexicon queries can also be used in the same way to create subsets and encode properties that do not vary on a token by token basis.

For instance, a subset for high vowels can be created as follows:

with CorpusContext(config) as c:
    high_vowels = ['iy', 'ih','uw','uh']
    q = c.query_lexicon(c.lexicon_phone)
    q = q.filter(c.lexicon_phone.label.in_(high_vowels))
    q.create_subset('high_vowel')

Which can then be used to query phone annotations:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.subset == 'high_vowel')
    print(q.all())

Acoustic measures

One of the most important steps in analyzing a corpus is taking acoustic measurements of the data in the corpus. The acoustics functions listed here allow you to analyze and save acoustics into the database, to be queried later. There are several automatic acoustics functions in PolyglotDB, and you can also encode other measures using the Praat script function.

Contents:

Encoding acoustic measures

PolyglotDB has some built-in functions to encode certain acoustic measures, and also supports encoding measures from a Praat script. All of these functions carry out the acoustic analysis and save the results into the database. It currently contains built-in functions to encode pitch, intensity, and formants.

In order to use any of them on your own computer, you must set your CorpusContext’s config.praat_path to point to Praat, and likewise config.reaper_path to Reaper if you want to use Reaper for pitch.

Encoding pitch

Pitch is encoded using the analyze_pitch function.

This is done as follows:

with CorpusContext(config) as c:
    c.analyze_pitch()

Note

The function analyze_pitch requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, pitch tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Pitch analysis can be configured in two ways, the source program of the measurement and the algorithm for fine tuning the pitch range.

Sources

The keyword argument source can be set to either 'praat' or 'reaper', depending on which program you would like PolyglotDB to use to measure pitch. The default source is Praat.

with CorpusContext(config) as c:
    c.analyze_pitch()
    # OR
    c.analyze_pitch(source='reaper')

If the source is praat, the Praat executable must be discoverable on the system path (i.e., a call of praat in a terminal works). Likewise, if the source is reaper, the Reaper executable must be on the path or the full path to the Reaper executable must be specified.

Algorithms

Similar to the source, attribute, the algorithm can be toggled between "base", "gendered", and "speaker_adapted".

with CorpusContext(config) as c:
    c.analyze_pitch()

    # OR

    c.analyze_pitch(algorithm='gendered')

    # OR

    c.analyze_pitch(algorithm='speaker_adapted')

The "base" algorithm uses a default minimum pitch of 50 Hz and a maximum pitch of 500 Hz, but these can be changed through the absolute_min_pitch and absolute_max_pitch parameters.

The "gendered" algorithm checks whether a Gender property is available for speakers. If a speaker has a property value that starts with f (i.e., female), utterances by that speakers will use a minimum pitch of 100 Hz and a maximum pitch of 500 Hz. If they have a property value of m (i.e., male), utterances by that speakers will use a minimum pitch of 50 Hz and a maximum pitch of 400 Hz.

The "speaker_adapted" algorithm does two passes of pitch estimation. The first is identical to "base" and uses a minimum pitch of 50 Hz and a maximum pitch of 500 Hz (or whatever the parameters have been set to). This first pass is used to estimate by-speaker means of F0. Speaker-specific pitch floors and ceilings are calculated by adding or subtracting the number of octaves that the adjusted_octaves parameter specifies. The default is 1, so the per-speaker pitch range will be one octave below and above the speaker’s mean pitch.

Encoding intensity

Intensity is encoded using analyze_intensity(), as follows:

with CorpusContext(config) as c:
    c.analyze_intensity()

Note

The function analyze_intensity requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, intensity tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Encoding formants

There are several ways of encoding formants. The first is encodes formant tracks similar to encoding pitch or intensity tracks (i.e., done over utterances). There is also support for encoding formants tracks just over specified vowel segments. Finally, point measures of formants can be encoded. Both formant tracks and points can be calculated using either just a simple one-pass algorithm or by using a multiple-pass refinement algorithm.

Basic formant tracks

Formant tracks over utterances are encoded using analyze_formant_tracks, as follows:

with CorpusContext(config) as c:
    c.analyze_formant_tracks()

Note

The function analyze_formant_tracks requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, formant tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Formant tracks can also be encoded just for specific phones by specifying a subset of phones:

with CorpusContext(config) as c:
    c.analyze_formant_tracks(vowel_label='vowel')

Note

This usage requires that a vowel subset of phone types be already encoded in the database. See Enrichment via queries for more details on creating subsets

These formant tracks do not do any specialised analysis to ensure that they are not false formants.

Basic formant point measurements

The analyze_formant_points function will generate measure for F1, F2, F3, B1, B2, and B3 at the time point 33% of the way through the vowel for every vowel specified.

with CorpusContext(config) as c:
    c.analyze_formant_points(vowel_label='vowel')

Note

The function analyze_formant_points requires that a vowel subset of phone types be already encoded in the database. See Enrichment via queries for more details on creating subsets

Refined formant points and tracks

The other function for generating both point and track measurements is the analyze_formant_points_refinement. This function computes formant measurementss for multiple values of n_formants from 4 to 7. To pick the best measurement, the function initializes per-vowel means and standard deviations with the F1, F2, F3, B1, B2, B3 values generated by n_formants=5. Then, it performs multiple iterations that select the new best track as the one that minimizes the Mahalanobis distance to the relevant prototype. In order to choose whether you wish to save tracks or points in the database, just change the output_tracks parameter to true if you would like tracks, and false otherwise. When operating over tracks, the algorithm still only evaluates the best parameters by using the 33% point.

with CorpusContext(config) as c:
    c.analyze_formant_points_refinement(vowel_label='vowel')

Following encoding, phone types that were analyzed will have properties for F1, F2, F3, B1, B2, and B3 available for query and export. See Querying acoustic point measures for more details.

Encoding Voice Onset Time(VOT)

Currently there is only one method to encode Voice Onset Times(VOTs) into PolyglotDB. This makes use of the AutoVOT program which automatically calculates VOTs based on various acoustic properties.

VOTs are encoded over a specific subset of phones using :code: analyze_vot as follows:

with CorpusContext(config) as c:
    c.analyze_vot(self, classifier,
                stop_label="stops",
                vot_min=5,
                vot_max=100,
                window_min=-30,
                window_max=30):

Note

The function analyze_vot requires that utterances and any subsets be encoded prior to being run. See Creating utterance units for further details on encoding utterances and Subset enrichment for subsets.

Parameters

The :code: analyze_vot function has a variety of parameters that are important for running the function properly. classifier is a string which has a paht to an AutoVOT classifier directory. A default classifier is available in /tests/data/classifier/sotc_classifiers.

stop_label refers to the name of the subset of phones that you intend to calculate VOTs for.

vot_min and vot_max refer to the minimum and maximum duration of any VOT that is calculated. The AutoVOT repo <https://github.com/mlml/autovot> has some sane defaults for English voiced and voiceless stops.

window_min and window_max refer to the edges of a given phone’s duration. So, a window_min of -30 means that AutoVOT will look up to 30 milliseconds before the start of a phone for the burst, and a window_max of 30 means that it will look up to 30 milliseconds after the end of a phone.

Encoding other measures using a Praat script

Other acoustic measures can be encoded by passing a Praat script to analyze_script.

The requirements for the Praat script are:

  • exactly one input: the full path to the sound file containing (only) the phone. (Any other parameters can be set manually within your script, and an existing script may need some other modifications in order to work on this type of input)

  • print the resulting acoustic measurements (or other properties) to the Praat Info window in the following format:

    • The first line should be a space-separated list of column names. These are the names of the properties that will be saved into the database.
    • The second line should be a space-separated list containing one measurement for each property.
    • (It is okay if there is some blank space before/after these two lines.)

    An example of the Praat output:

    peak slope cog spread
    5540.7376 24.3507 6744.0670 1562.1936
    

    Output format if you are only taking one measure:

    cog
    6013.9
    

To run analyze_script, do the following:

  1. encode a phone class for the subset of phones you would like to analyze
  2. call analyze_script on that phone class, with the path to your script

For example, to run a script which takes measures for sibilants:

with CorpusContext(config) as c:
    c.encode_class(['S', 'Z', 'SH', 'ZH'], 'sibilant')
    c.analyze_script('sibilant', 'path/to/script/sibilant.praat')

Querying acoustic measures

Querying acoustic tracks

All of the built-in acoustic measures are saved as tracks with 10 ms intervals in the database (and formants created using one of the functions in Encoding formants). To access these values, you use corpus_context.phone.MEASUREMENT.track, replacing MEASUREMENT with the name of the measurement you want: pitch, formants, or intensity.

Example: querying for formant track (TODO: I haven’t tested whether this really works exactly as it’s written)

with CorpusContext(config) as c:
        q = c.query_graph(c.phone)
        q = q.columns(c.phone.begin, c.phone.end, c.phone.formants.track)
        results = q.all()
        q.to_csv('path/to/output.csv')

You can also find the min, max, and mean of the track for each phone, using corpus_context.phone.MEASUREMENT.min, etc.

Querying acoustic point measures

Acoustic measures that only have one measurement per phone are termed point measures and are accessed as regular properties of the annotation.

Anything encoded using analyze_script is not saved as a track, and are instead recorded once for each phone. These are accessed using corpus_context.phone.MEASUREMENT, replacing MEASUREMENT with the name of the measurement you want.

Example: querying for cog (center of gravity)

with CorpusContext(config) as c:
        q = c.query_graph(c.phone)
        q = q.columns(c.phone.begin, c.phone.end, c.phone.cog)
        results = q.all()
        q.to_csv('path/to/output.csv')
Querying Voice Onset Time

Querying voice onset time is done in the same method as acoustic point measures, however, the vot object itself has different measures associated with it.

So, you must also include what you would like from the vot measurement as shown below.

with CorpusContext(config) as c:
        q = c.query_graph(c.phone)
        q = q.columns(c.phone.vot.begin, c.phone.vot.end, c.phone.vot.confidence)
        results = q.all()
        q.to_csv('path/to/output.csv')

Subannotation enrichment

Often there are details which we would like to include on a linguistic annotation (word, syllable, phone, etc.) which are not a simple measure like a single value or a one value across time. An example of this would be Voice Onset Time (VOT), where we have two distinct parts (voicing onset and burst) which cannot just be reduced to a single value.

In PolyglotDB, we refer to these more complicated structures as sub-annotations as they provide details that cannot just be a single measure like formants or pitch. These sub-annotations are always attached to a regular linguistic annotation, but they have all of their own properties.

So for example, a given phone token could have a vot subannotation on it, which would consist of several different values that are all related. This would be the onset, burst or confidence (of the prediction) of the VOT in question. This allows semantically linked measurements to be linked and treated as a single object with multiple values rather than several distinct measurements that happen to be related.

For information on querying subannotations, see Subannotation queries.

Querying corpora

Queries are the primary function of PolyglotDB. The goal for the query API is to provide an extensible base for more complex linguistic query systems to be built.

Contents:

Querying annotations

The main way of finding specific annotations is through the query_graph method of CorpusContext objects.

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label == 'are')
    results = q.all()
    print(results)

The above code will find and print all instances of word annotations that are labeled with ‘are’. The method query_graph takes one argument, which is an attribute of the context manager corresponding to the name of the annotation type.

The primary function for queries is filter. This function takes one or more conditional expressions on attributes of annotations. In the above example, word annotations have an attribute label which corresponds to the orthography.

Conditional expressions can take on any normal Python conditional (==, !=, <, <=, >, >=). The Python operator in does not work; a special pattern has to be used:

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    results = q.all()
    print(results)

The in_ conditional function can take any iterable, including another query:

with CorpusContext(config) as c:
    sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
    results = q.all()
    print(results)

In this case, it will find all phone annotations that are in the words listed. Using the id attribute will use unique identifiers for the filter. In this particular instance, it does not matter, but it does in the following:

with CorpusContext(config) as c:
    sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    sub_q = sub_q.filter_right_aligned(c.word.line)
    q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
    results = q.all()
    print(results)

The above query will find all instances of the three words, but only where they are right-aligned with a line annotation.

Note

Queries are lazy evaluated. In the above example, sub_q is not evaluated until q.all() is called. This means that filters can be chained across multiple lines without a performance hit.

Following and previous annotations

Filters can reference the surrounding local context. For instance:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    results = q.all()
    print(results)

The above query will find all the ‘aa’ phones that are followed by an ‘r’ phone. Similarly, c.phone.previous would provide access to filtering on preceding phones.

Subsetting annotations

In linguistics, it’s often useful to specify subsets of symbols as particular classes. For instance, phonemes are grouped together by whether they are syllabic, their manner/place of articulation, and vowel height/backness/rounding, and words are grouped by their parts of speech.

Suppose a subset has been created as in Subset enrichment, so that the phones ‘aa’ and ‘ih’ have been marked as syllabic. Once this category is encoded in the database, it can be used in filters.

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.subset=='syllabic')
    results = q.all()
    print(results)

Note

The results returned by the above query will be identical to the similar query:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.label.in_(['aa', 'ih']))
    results = q.all()
    print(results)

The primary benefits of using subsets are performance based due to the inner workings of Neo4j. See Neo4j implementation for more details.

Another way to specify subsets is on the phone annotations themselves, as follows:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone.filter_by_subset('syllabic'))
    results = q.all()
    print(results)

Both of these queries are identical and will return all instances of ‘aa’ and ‘ih’ phones. The benefit of filter_by_subset is generally for use in Hierarchical queries.

Note

Using repeated subsets repeatedly in queries can make them overly verbose. The objects that the queries use are normal Python objects and can therefore be assigned to variables for easier use.

  with CorpusContext(config) as c:
      syl = c.phone.filter_by_subset('syllabic')
      q = c.query_graph(syl)
      q = q.filter(syl.end == syl.word.end)
      results = q.all()
      print(results)

The above query would find all phones marked by '+syllabic' that are
at the ends of words.

Hierarchical queries

A key facet of language is that it is hierarchical. Words contain phones, and can be contained in larger utterances. There are several ways to query hierarchical information. If we want to find all aa phones in the word dogs, then we can perform the following query:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.word.label == 'dogs')
    results = q.all()
    print(results)

Starting from the word level, we might want to know what phones each word contains.

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.label.column('phones'))
    results = q.all()
    print(results)

In the output of the above query, there would be a column labeled phones that contains a list of the labels of phones that belong to the word (['d', 'aa', 'g', 'z']). Any property of phones can be queried this way (i.e., begin, end, duration, etc).

Going down the hierarchy, we can also find all words that contain a certain phone.

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    q = q.filter(c.word.phone.label == 'aa')
    results = q.all()
    print(results)

In this example, it will find all instances of the three words that contain an aa phone.

Special keywords exist for these containment columns. The keyword rate will return the elements per second for the word (i.e., phones per second). The keyword count will return the number of elements.

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.rate.column('phones_per_second'))
    q = q.columns(c.word.phone.count.column('num_phones'))
    results = q.all()
    print(results)

These keywords can also leverage subsets, as above:

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.rate.column('phones_per_second'))
    q = q.columns(c.word.phone.filter_by_subset('+syllabic').count.column('num_syllabic_phones'))
    q = q.columns(c.word.phone.count.column('num_phones'))
    results = q.all()
    print(results)

Additionally, there is a special keyword can be used to query the position of a contained element in a containing one.

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.word.label == 'dogs')
    q = q.columns(c.word.phone.position.column_name('position_in_word'))
    results = q.all()
    print(results)

The above query should return 2 for the value of position_in_word, as the aa phone would be the second phone.

Subannotation queries

Annotations can have subannotations associated with them. Subannotations are not independent linguistic types, but have more information associated with them than just a single property. For instance, voice onset time (VOT) would be a subannotation of stops (as it has a begin time and an end time that are of interest). For mor information on subannotations, see Subannotation enrichment. Querying such subannotations would be performed as follows:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.columns(c.phone.vot.duration.column_name('vot'))
    results = q.all()
    print(results)

In some cases, it may be desirable to have more than one subannotation of the same type associated with a single annotation. For instance, voicing during the closure of a stop can take place at both the beginning and end of closure, with an unvoiced period in the middle. Using a similar query as above would get the durations of each of these (in the order of their begin time):

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.columns(c.phone.voicing_during_closure.duration.column_name('voicing'))
    results = q.all()
    print(results)

In some cases, we might like to know the total duration of such subannotations, rather than the individual durations. To query that information, we can use an aggregate:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    results = q.aggregate(Sum(c.phone.voicing_during_closure.duration).column_name('total_voicing'))
    print(results)
Miscellaneous

Aggregates and groups

Aggregate functions are available in polyglotdb.query.base.func. Aggregate functions available are:

  • Average
  • Count
  • Max
  • Min
  • Stdev
  • Sum

In general, these functions take a numeric attribute as an argument. The only one that does not follow this pattern is Count.

from polyglotdb.query.base.func import Count
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    result = q.aggregate(Count())
    print(result)

Like the all function, aggregate triggers evaluation of the query. Instead of returning rows, it will return a single number, which is the number of rows matching this query.

from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    result = q.aggregate(Average(c.phone.duration))
    print(result)

The above aggregate function will return the average duration for all ‘aa’ phones followed by ‘r’ phones.

Aggregates are particularly useful with grouping. For instance:

from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label.in_(['r','l']))
    q = q.group_by(c.phone.following.label.column_name('following_label'))
    result = q.aggregate(Average(c.phone.duration), Count())
    print(result)

The above query will return the average duration and the count of ‘aa’ phones grouped by whether they’re followed by an ‘r’ or an ‘l’.

Note

In the above example, the group_by attribute is supplied with an alias for output. In the print statment and in the results, the column will be called ‘following_label’ instead of the default (more opaque) one.

Ordering

The order_by function is used to provide an ordering to the results of a query.

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label.in_(['r','l']))
    q = q.filter(c.phone.discourse == 'a_discourse')
    q = q.order_by(c.phone.begin)
    results = q.all()
    print(results)

The results for the above query will be ordered by the timepoint of the annotation. Ordering by time is most useful for when looking at single discourses (as including multiple discourses in a query would invalidate the ordering).

Note

In grouped aggregate queries, ordering is by default by the first group_by attribute. This can be changed by calling order_by before evaluating with aggregate.

Lexicon queries

Querying the lexicon is in many ways similar to querying annotations in graphs.

with CorpusContext(config) as c:
    q = c.query_lexicon(c.lexicon_phone).filter(c.lexicon_phone.label == 'aa')
    print(q.all())

The above query will just return one result (as there is only one phone type with a given label) as opposed to the multiple results returned when querying annotations.

Speaker queries

Querying speaker information is similar to querying other aspects, and function very similarly to querying discourses. Queries are constructed through the function query_speakers:

with CorpusContext(config) as c:
    q = c.query_speakers()
    speakers = [x['name'] for x in q.all()]
    print(speakers)

The above code will print all of the speakers in the current corpus. Like other queries, speakers can be filtered by properties that are encoded for them and specific information can be extracted.

with CorpusContext(config) as c:
    q = c.query_speakers().filter(c.speaker.name == 'Speaker 1').columns(c.speaker.discourses.name.column_name('discourses'))
    speaker1_discourses = q.all()[0]['discourses']
    print(speaker1_discourses)

The above query will print out all the discourses that a speaker identified as "Speaker 1" spoke in.

Discourse queries

Discourses can also be queried, and function very similarly to speaker queries. Queries are constructed through the function query_discourses:

with CorpusContext(config) as c:
    q = c.query_discourses()
    discourses = [x['name'] for x in q.all()]
    print(discourses)

The above code will print all of the discourses in the current corpus. Like other queries, discourses can be filtered by properties that are encoded for them and specific information can be extracted.

with CorpusContext(config) as c:
    q = c.query_discourses().filter(c.discourse.name == 'File 1').columns(c.discourse.speakers.name.column_name('speakers'))
    file1_speakers = q.all()[0]['speakers']
    print(file1_speakers)

The above query will print out all the speakers that spoke in the discourse identified as "File 1".

Exporting query results

Exporting queries is simply a matter of calling the to_csv function of a query, rather than its all function.

csv_path = '/path/to/save/file.csv'
with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label == 'are')
    q = q.columns(c.word.label.column_name('word'), c.word.duration,
                  c.word.speaker.name.column_name('speaker'))
    q.to_csv(csv_path)

All queries, including those over annotations, speakers, discourses, etc, have this method available for creating CSV files from their results. The columns function allows for users to list any attributes within the query, (i.e., properties of the word, or any higher/previous/following annotation, or anything about the speaker, etc). These attributes by default have a column header generated based on the query, but these headers can be overwritten through the use of the column_name function, as above.

Export for token CSVs

If you wish to add properties to a set of tokens by means of a CSV, this can be achieved by using the token import tool explained in Enriching arbitrary tokens. In order do this you will need a CSV that contains the ID of each token that you wish to evaluate. The following code shows how to export all phones with their ID, begin, end and sound file, which could be useful for a phonetic analysis in an external tool.

csv_path = '/path/to/save/file.csv'
with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.columns(c.phone.label, \
                  c.phone.id, \
                  c.phone.begin, \
                  c.phone.end, \
                  c.phone.discourse.name)
    q = q.order_by(g.phone.begin)
    q.to_csv(csv_path)

Query Reference

Getting elements

c.phone c.lexicon_phone c.speaker c.discourse

Attributes

In addition to any values that get added through enrichment, there are several built in attributes that allow access to different parts of the database.

Attribute type Code Notes
Label [1] c.phone.label  
Name [2] c.speaker.name  
Begin [3] c.phone.begin  
End [3] c.phone.end  
Duration [3] c.phone.duration  
Previous annotation [3] c.phone.previous  
Following annotation [3] c.phone.following  
Previous pause [3] c.phone.word.previous_pause Must be from a word annotation
Following pause [3] c.phone.word.following_pause Must be from a word annotation
Speaker [3] c.phone.speaker  
Discourse [3] c.phone.discourse  
Pitch attribute [3] c.phone.pitch  
Formants attribute [3] c.phone.formants  
Intensity attribute [3] c.phone.intensity  
Minimum value [4] c.phone.pitch.min  
Maximum value [4] c.phone.pitch.max  
Mean value [4] c.phone.pitch.mean  
Raw track [4] c.phone.pitch.track  
Sampled track [4] c.phone.pitch.sampled_track  
Interpolated track [4] c.phone.pitch.interpolated_track  
[1]Only available for graph annotations and lexicon annotations
[2]Only available for speakers/discourses
[3](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) Only available for graph annotations
[4](1, 2, 3, 4, 5, 6) Only available for acoustic attributes

Filters

Filter type Code Notes
Equal c.phone.label == 'aa'  
Not equal c.phone.label != 'aa'  
Greater than c.phone.begin > 0  
Greater than or equal c.phone.begin >= 0  
Less than c.phone.end < 10  
Less than or equal c.phone.end <= 10  
In c.phone.label.in_(['aa','ae']) in_ can also take a query
Not in c.phone.label.not_in_(['aa']) not_in_ can also take a query
Is null c.phone.label == None  
Is not null c.phone.label != None  
Regular expression match c.phone.label.regex('a,')  
In subset c.phone.subset == 'syllabic'  
Not in subset c.phone.subset != 'syllabic'  
Precedes pause c.word.precedes_pause == True Only available for graph annotations
Does not precede pause c.word.precedes_pause == False Only available for graph annotations
Follows pause c.word.follows_pause == True Only available for graph annotations
Does not follow pause c.word.follows_pause == False Only available for graph annotations
Right aligned c.phone.end == c.phone.word.end Only available for graph annotations
Not right aligned c.phone.end != c.phone.word.end` Only available for graph annotations
Left aligned c.phone.begin == c.phone.word.begin Only available for graph annotations
Not left aligned c.phone.begin != c.phone.word.begin Only available for graph annotations

Developer documentation

This section of the documentation is devoted to explaining implementation details of PolyglotDB. In large part this is currently a brain dump of Michael McAuliffe to hopefully allow for easier implementation of new features in the future.

The overarching structure of PolyglotDB is based around two database technologies: Neo4j and InfluxDB. Both of these database systems are devoted to modelling, storing, and querying specific types of data, namely, graph and time series data. Because speech data can be modelled in each of these ways (see Annotation Graphs for more details on representing annotations as graphs), using these databases leverages their performance and scalability for increasing PolyglotDB’s ability to deal with large speech corpora. Please see the InterSpeech proceedings paper for more information on the high level motivations of PolyglotDB.

Contents:

Neo4j implementation

This section details how PolyglotDB saves and structures data within Neo4j.

Note

This section assumes some familiarity with the Cypher query language and Neo4j, see the Cypher documentation for more details and reference.

Annotation Graphs

The basic formalism in PolyglotDB for modelling and storing transcripts is that of annotation graphs, originally proposed by Bird & Liberman (1999). In this formalism, transcripts are directed acyclic graphs. Nodes in the graph represent time points in the audio file and edges between the nodes represent annotations (such as phones, words, utterances, etc). This style of graph is illustrated below.

Note

Annotation is a broad term in this conceptual framework and basically represents anything that can be defined as a label, and begin/end time points. Single time point annotations (something like ToBI) are not strictly covered in this framework. Annotations as phoneticians typically think of annotation (i.e., extra information annotated by the researcher like VOT or categorization by listener) is modelled in PolyglotDB as Subannotations (annotations of annotations) and are handled differently than the principle annotations which are linguistic units.

Image cannot be displayed in your browser

Annotation graph representation of the word “polyglot” spoken over the course of one second.

The annotation graph framework is a conceptual way to model linear time signals and interval annotations, independent of a specific implementation. The annotation graph formalism has been implemented in other speech corpus management systems, in either SQL (LaBB-CAT) or custom file-based storage systems (EMU-SDMS). One of the principle goals in developing PolyglotDB was to be scalable to large datasets (potentially hundreds of hours) and still have good performance in querying the database. Initial implementations in SQL were not as fast as I would have liked, so Neo4j was selected as the storage backend. Neo4j is a NoSQL graph database where nodes and edges are fundamental elements in both the storage and Cypher query language. Given its active development and use in enterprise systems, it is the best choice for meeting the scalability and performance considerations.

However, Neo4j prioritizes nodes far more than edges (see Neo4j’s introductory materials for more details). In general, their use case is primarily something like IMDB, for instance. In such a case, you’ll have nodes for movies, shows, actors, directors, crew, etc, each with different labels associated with them. Edges represent relationships like “acted in”, or “directed”. The nodes have the majority of the properties, like names, dates of birth, gender, etc, and relationships are sparse/empty. The annotation graph formalism has nodes being relatively sparse (just time point), and the edges containing the properties (label, token properties, speaker information, etc). Neo4j uses indices to speed up queries, but these are focused on node properties rather than edge properties (or were at the beginning of development). As such, the storage model was modified from the annotation graph formalism into something more node-based, seen below.

Image cannot be displayed in your browser

PolyglotDB implementation of the annotation graph formalism for a production of the word “polyglot”.

Rather than time points as nodes, the actual annotations are nodes, and relationships between them are either hierarchical (i.e., the phone P is contained by the syllable P.AA1, represented by solid lines in the figure above) or precedence (the phone P precedes the phone AA1, represented by dashed lines in the figure above). Each node has properties for begin and end time points, as well as any arbitrary encoded information (i.e., part of speech tags). Each node of a given annotation type (word, syllable, phone) is labelled as such in Neo4j, speeding up queries.

All interaction with corpus data is done through the CorpusContext class. When this class is instantiated, and then used as a context manager, it connects to both a Neo4j database and an InfluxDB database (described in more detail in InfluxDB implementation). When a corpus is imported (see Importing corpora), nodes and edges in the Neo4j database are created, along with appropriate labels on the nodes to organize and aid querying. By default, from a simple import of forced aligned TextGrids, the full list of node types in a fresh PolyglotDB Neo4j database is as follows:

:phone
:phone_type
:word
:word_type
:Speaker
:Discourse
:speech

In the previous figure (Fig. 2), for instance, all green nodes would have the Neo4j label :phone, all orange nodes would have the Neo4j label :syllable, and the purple node would have the Neo4j label :word. All nodes would also have the Neo4j label :speech. Each of the nodes in figure would additionally have links to other aspects of the graph. The word node would have a link to a node with the Neo4j label of :word_type, the syllables nodes would each link a node with the Neo4j label :syllable_type, and the phone nodes would link to nodes with Neo4j label of :phone_type. These type nodes would then contain any type information that is not dependent on the particular production. Each node in the figure would also link to a :Speaker node for whoever produced the word, as well as a :Discourse node for whichever file it was recorded in.

Note

A special tag for the corpus name is added to every node in the corpus, in case multiple corpora are imported in the same database. For instance, if the CorpusContext is instantiated as CorpusContext('buckeye'), then any imported annotations would have a Neo4j label of :buckeye associated with them. If another CorpusContext was instantiated as CorpusContext('not_buckeye'), then any queries for annotations in the buckeye corpus would not be found, as it would be looking only at annotations tagged with the Neo4j label :not_buckeye.

The following node types can further be added to via enrichment (see Enrichment):

:pause
:utterance
:utterance_type (never really used)
:syllable
:syllable_type

In addition to node labels, Neo4j and Cypher use relationship labels on edges in queries. In the above example, all solid lines would have the label of :contained_by, as the lower annotation is contained by the higher one (see Corpus hierarchy representation below for details of the hierarchy implementation). All the dashed lines would have the Neo4j label of :precedes as the previous annotation precedes the following one. The following is a list of all the relationship types in the Neo4j database:

:is_a (relation between type and token nodes)
:precedes (precedence relation)
:precedes_pause (precedence relation for pauses when encoded)
:contained_by (hierarchical relation)
:spoken_by (relation between tokens and speakers)
:spoken_in (relation between tokens and discourses)
:speaks_in (relation between speakers and discourses)
:annotates (relation between annotations and subannotations)

Corpus hierarchy representation

Neo4j is a schemaless database, each node can have arbitrary information added to it without requiring that information on any other node. However, enforcing a bit of a schema on the Neo4j is helpful for dealing with corpora which are more structured than an arbitrary graph. For a user, knowing that a typo leads to a property name that doesn’t exist on any annotations that they’re querying is useful. Additionally, knowing the type of the data stored (string, boolean, float, etc) allows for restricting certain operations (for instance, calculating a by speaker z-score is only relevant for numeric properties). As such a schema in the form of a Hierarchy is explicitly defined and used in PolyglotDB.

Each CorpusContext has a polyglotdb.structure.Hierarchy object which stores metadata about the corpus. Hierarchy objects are basically schemas for the Neo4j database, telling the user what information annotations of a given type should have (i.e., do word annotations have frequency as a type property? part_of_speech as a token property?). Additionally it also gives the strict hierarchy between levels of annotation. A freshly imported corpus with just words and phones will have a simple hierarchy that phones are contained by words. Enrichment can add more levels to the hierarchy for syllables and utterances. All aspects of the Hierarchy object are stored in the Neo4j database and synced with the CorpusContext object.

In the Neo4j graph, there is a Corpus root node, with all encoded annotations linked as they would be in an annotation graph for a given discourse (i.e., Utterance -> Word -> Syllable -> Phone in orange below). These nodes contain a list of properties that will be found on each node in the annotation graphs (i.e., label, begin, end), along with what type of data each property is (i.e., string, numeric, boolean, etc). There will also be a property for subsets that is a list of all the token subsets of that annotation type. Each of these annotations are linked to type nodes (in blue below) that has a list of properties that belong to the type (i.e., in the figure below, word types have label, transcription and frequency).

Image cannot be displayed in your browser

In addition, if subannotations are encoded, they will be represented in the hierarchy graph as well (i.e., Burst, Closure, and Intonation in yellow above), along with all the properties they contain. Speaker and Discourse properties are encoded in the graph hierarchy object as well as any acoustics that have been encoded and are stored in the InfluxDB portion of the database (see Saving acoustics for details on encoding acoustic measures).

Query implementation

One of the primary functions of PolyglotDB is querying information in the Neo4j databases. The fundamental function of the polyglotdb.query module is to convert Python syntax and objects (referred to as PolyglotDB queries below) into Cypher strings that extract the correct elements from the Neo4j database. There is a fair bit of “magic” behind the scenes as much of this conversion is done by hijacking built in Python functionality. For instance c.phone.label == 'AA1' does not actually return a boolean, but rather a Clause object. This Clause object has functionality for generating a Cypher string like node_phone.label = 'AA1', which would then be slotted into the appropriate place in the larger Cypher query. There is a larger Query object that has many subobjects, such a filters, and columns to return, and uses these subobjects to slot into a query template to generate the final Cypher query. This section attempts to break down the individual pieces that get added together to create the final Cypher query.

There are 4 principle types of queries currently implemented in PolyglotDB based on the information desired (annotations, lexicon, speaker, and discourse queries). Annotation queries are the most common, as they will search over the produced annotation tokens in discourses. For instance, finding all stops in a particular environment and returning relevant information is going to be an annotation query with each matching token having its own result. Lexicon queries are queries over annotation types rather than tokens. Speaker and Discourse queries are those over their respective entities. Queries are constructed as Python objects (descended from polyglotdb.query.base.query.BaseQuery) and are generated from methods on a CorpusContext object, as below. Each of the four types of queries has their own submodule within the polyglotdb.query module.

Data type CorpusContext method Query class
Annotations polyglotdb.corpus.CorpusContext.query_graph() polyglotdb.query.annotations.query.GraphQuery
Lexicon polyglotdb.corpus.CorpusContext.query_lexicon() polyglotdb.query.lexicon.query.LexiconQuery
Speaker polyglotdb.corpus.CorpusContext.query_speakers() polyglotdb.query.speaker.query.SpeakerQuery
Discourse polyglotdb.corpus.CorpusContext.query_discourses() polyglotdb.query.discourse.query.DiscourseQuery

The main structure of each of the query submodules is as follows:

The following walk through of the basic components of a query submodule will use a speaker query for illustration purposes. In this example, we’ll be trying to extract the list of male speakers (with the assumption that speakers have been encoded for gender and that the corpus is appropriately named corpus). In Cypher, this query would be:

MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
RETURN node_Speaker.name AS speaker_name

This query in polyglotdb would be:

with CorpusContext('corpus') as c:
    q = c.query_speakers() # Generate SpeakerQuery object
    q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
    q = q.columns(c.speaker.name.column_name('speaker_name')) # Return just the speaker name (with the `speaker_name` alias)
    results = q.all()

The attributes.py file contains the definitions of classes corresponding to nodes and attributes in the Neo4j database. These classes have code for how to represent them in cypher queries and how properties are extracted. As an example of a somewhat simple case, consider polyglotdb.query.speaker.attributes.SpeakerNode and polyglotdb.query.speaker.attributes.SpeakerAttribute. A SpeakerNode object will have an alias in the Cypher query of node_Speaker and an initial look up definition for the query as follows:

(node_Speaker:Speaker:corpus)

The polyglotdb.query.speaker.attributes.SpeakerAttribute class is used for the gender and name attributes referenced in the query. These are created through calling c.speaker.gender (the __getattr__ method for both the CorpusContext class and the SpeakerNode class are overwritten to allow for this kind of access). Speaker attributes use their node’s alias to construct how they are referenced in Cypher, i.e. for c.speaker.gender:

node_Speaker.gender

When the column_name function is called, an output alias is used when constructing RETURN statements in Cypher:

node_Speaker.name AS speaker_name

The crucial part of a query is, of course, the ability to filter. Filters are constructed using Python operators, such as == or !=, or functions replicating other operators like .in_(). Operators on attributes return classes from the elements.py file of a query submodule. For instance, the polyglotdb.query.base.elements.EqualClauseElement is returned when the == is used (as in the above query), and this object handles how to convert the operator into Cypher, in the above case of c.speaker.gender == 'male', it will generate the following Cypher code when requested:

node_Speaker.gender = "male"

The query.py file contains the definition of the Query class descended from polyglotdb.query.base.query.BaseQuery. The filter and columns methods allow ClauseElements and Attributes to be added for the construction of the Cypher query. When all is called (or cypher which does the actual creation of the Cypher string), the first step is to inspect the elements and attributes to see what nodes are necessary for the query. The definitions of each of these nodes are then concatenated into a list for the MATCH part of the Cypher query, giving the following for our example:

MATCH (node_Speaker:Speaker:corpus)

Next the filtering elements are constructed into a WHERE clause (separated by AND for more than one element), giving the following for our example:

WHERE node_Speaker.gender = "male"

And finally the RETURN statement is constructed from the list of columns specified (along with their specified column names):

RETURN node_Speaker.name AS speaker_name

If columns are not specified in the query, then a Python object containing all the information of the node is returned, according to classes in the models.py file of the submodule. For our speaker query, if the columns are omitted, then the returned results will have all speaker properties encoded in the corpus. In terms of implementation, the following query in polyglotdb

with CorpusContext('corpus') as c:
    q = c.query_speakers() # Generate SpeakerQuery object
    q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
    results = q.all()
    print(results[0].name) # Get the name of the first result

will generate the following Cypher query:

MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
RETURN node_Speaker
Annotation queries

Annotation queries are the most complicated kind due to all of the relationships linking nodes. Where Speaker, Discourse and Lexicon queries are really just lists of nodes with little linkages between nodes, Annotation queries leverage the relationships in the annotation graph quite a bit.

Basic query

Given a relatively basic query like the following:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.word)
    q = q.filter(c.word.label == 'some_word')
    q = q.columns(c.word.label.column_name('word'), c.word.transcription.column_name('transcription'),
                  c.word.begin.column_name('begin'),
                  c.word.end.column_name('end'), c.word.duration.column_name('duration'))
    results = q.all()

Would give a Cypher query as follows:

MATCH (node_word:word:corpus)-[:is_a]->(node_word_type:word_type:corpus),
WHERE node_word_type.label = "some_word"
RETURN node_word_type.label AS word, node_word_type.transcription AS transcription,
       node_word.begin AS begin, node_word.end AS end,
       node_word.end - node_word.begin AS duration

The process of converting the Python code into the Cypher query is similar to the above Speaker example, but each step has some complications. To begin with, rather than defining a single node, the annotation node definition contains two nodes, a word token node and a word type node linked by the is_a relationship.

The use of type properties allows for a more efficient look up on the label property (for convenience and debugging, word tokens also have a label property). The Attribute objects will look up what properties are type vs token for constructing the Cypher statement.

Additionally, duration is a special property that is calculated based off of the token’s begin and end properties at query time. This way if the time points are updated, the duration remains accurate. In terms of efficiency, subtraction at query time is not costly, and it does save on space for storing an additional property. Duration can still be used in filtering, i.e.:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.word)
    q = q.filter(c.word.duration > 0.5)
    q = q.columns(c.word.label.column_name('word'),
                  c.word.begin.column_name('begin'),
                  c.word.end.column_name('end'))
    results = q.all()

which would give the Cypher query:

MATCH (node_word:word:corpus)-[:is_a]->(node_word_type:word_type:corpus),
WHERE node_word.end - node_word.begin > 0.5
RETURN node_word_type.label AS word,  node_word.begin AS begin,
       node_word.end AS end,  AS duration
Precedence queries

Aspects of the previous annotation can be queried via precedence queries like the following:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.label == 'AE')
    q = q.filter(c.phone.previous.label == 'K')
    results = q.all()

will result the following Cypher query:

MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone)<-[:precedes]-(prev_1_node_phone:phone:corpus)-[:is_a]->(prev_1_node_phone_type:phone_type:corpus)
WHERE node_phone_type.label = "AE"
AND prev_1_node_phone_type.label = "K"
RETURN node_phone, node_phone_type, prev_1_node_phone, prev_1_node_phone_type
Hierarchical queries

Hierarchical queries are those that reference some annotation higher or lower than the originally specified annotation. For instance to do a search on phones and also include information about the word as follows:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.label == 'AE')
    q = q.filter(c.phone.word.label == 'cat')
    results = q.all()

This will result in Cypher query as follows:

MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone_word:word:corpus)-[:is_a]->(node_phone_word_type:word_type:corpus),
(node_phone)-[:contained_by]->(node_phone_word)
WHERE node_phone_type.label = "AE"
AND node_phone_word_type.label = "cat"
RETURN node_phone, node_phone_type, node_phone_word, node_phone_word_type
Spoken queries

Queries can include aspects of speaker and discourse as well. A query like the following:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.speaker.name == 'some_speaker')
    q = q.filter(c.phone.discourse.name == 'some_discourse')
    results = q.all()

Will result in the following Cypher query:

MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone)-[:spoken_by]->(node_phone_Speaker:Speaker:corpus),
(node_phone)-[:spoken_in]->(node_phone_Discourse:Discourse:corpus)
WHERE node_phone_Speaker.name = "some_speaker"
AND node_phone_Discourse.name = "some_discourse"
RETURN node_phone, node_phone_type
Annotation query optimization

There are several aspects to query optimization that polyglotdb does. The first is that rather than polyglotdb.query.annotations.query.GraphQuery the default objects returned are actually polyglotdb.query.annotations.query.SplitQuery objects. The behavior of these objects is to split a query into either Speakers or Discourse and have smaller GraphQuery for each speaker/discourse. The results object that gets returned then iterates over each of the results objects returned by the GraphQuery objects.

In general splitting functionality by speakers/discourses (and sometimes both) is the main way that Cypher queries are performant in polyglotdb. Aspects such as enriching syllables and utterances are quite complicated and can result in out of memory errors if the splits are too big (despite the recommended optimizations by Neo4j, such as using PERIODIC COMMIT to split the transactions).

Lexicon queries

Note

While the name of this type of query is lexicon, it’s really just queries over types, regardless of their linguistic type. Phone, syllable, and word types are all queried via this interface. Utterance types are not really used for anything other than consistency with the other annotations, as the space of possible utterance is basically infinite, but the space of phones, syllables and words are more constrained, and type properties are more useful.

Lexicon queries are more efficient queries of annotation types than the annotation queries above. Assuming word types have been enriched with a frequency property, a polyglotdb query like:

with CorpusContext('corpus') as c:
    q = c.query_lexicon(c.word_lexicon) # Generate LexiconQuery object
    q = q.filter(c.word_lexicon.frequency > 100) # Subset of word types based on their frequency
    results = q.all()

Would result in a Cypher query like:

MATCH (node_word_type:word_type:corpus)
WHERE node_word_type.frequency > 100
RETURN node_word_type
Speaker/discourse queries

Speaker and discourse queries are relatively straightforward with only a few special annotation node types or attribute types. See Query implementation for an example using a SpeakerQuery.

The special speaker attribute is discourses which will return a list of the discourses that the speaker spoke in, and conversely, the speakers attribute of DiscourseNode objects will return a list of speakers who spoke in that discourse.

A polyglotdb query like the following:

with CorpusContext('corpus') as c:
    q = c.query_speakers() # Generate SpeakerQuery object
    q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
    q = q.columns(c.speaker.discourses.name.column_name('discourses')) # Return just the speaker name (with the `speaker_name` alias)
    results = q.all()

will generate the following Cypher query:

MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
WITH node_Speaker
MATCH (node_Speaker)-[speaks:speaks_in]->(node_Speaker_Discourse:Discourse:corpus)
WITH node_Speaker, collect(node_Speaker_Discourse) AS node_Speaker_Discourse
RETURN extract(n in node_Speaker_Discourse|n.name) AS discourses
Aggregation functions

In the simplest case, aggregation queries give a way to get an aggregate over the full query. For instance, given the following PolyglotDB query:

from polyglotdb.query.base.func import Average
with CorpusContext('corpus') as c:
     q = g.query_graph(g.phone).filter(g.phone.label == 'aa')
     result = q.aggregate(Average(g.phone.duration))

Will generate a resulting Cypher query like the following:

MATCH (node_phone:phone:corpus)-[:is_a]->(type_node_phone:phone_type:corpus)
WHERE node_phone.label = "aa"
RETURN avg(node_phone.end - node_phone.begin) AS average_duration

In this case, there will be one result returned: the average duration of all phones in the query. If, however, you wanted to get the average duration per phone type (i.e., for each of aa, iy, ih, and so on), then aggregation functions can be combined with group_by clauses:

from polyglotdb.query.base.func import Average
with CorpusContext('corpus') as c:
     q = g.query_graph(g.phone).filter(g.phone.label.in_(['aa', 'iy', 'ih']))
     results = q.group_by(g.phone.label.column_name('label')).aggregate(Average(g.phone.duration))
MATCH (node_phone:phone:corpus)-[:is_a]->(type_node_phone:phone_type:corpus)
WHERE node_phone.label IN ["aa", "iy", "ih"]
RETURN node_phone.label AS label, avg(node_phone.end - node_phone.begin) AS average_duration

Note

See Aggregate functions for more details on the aggregation functions available.

InfluxDB implementation

This section details how PolyglotDB saves and structures data within InfluxDB. InfluxDB is a NoSQL time series database, with a SQL-like query language.

Note

This section assumes a bit of familiarity with the InfluxDB query language, which is largely based on SQL. See the InfluxDB documentation for more details and reference to other aspects of InfluxDB.

InfluxDB Schema

Each measurement encoded (i.e., pitch, intensity, formants) will have a separate table in InfluxDB, similar to SQL. When querying, the query will select columns from a a table (i.e., select * from "pitch"). Each row in InfluxDB minimally has a time field, as it is a time series database. In addition, each row will have queryable fields and tags, in InfluxDB parlance. Tags can function as separate tables, speeding up queries, while fields are simply values that are indexed. All InfluxDB tables will have three tags (these create different indices for the database and speed up queries) for speaker, discourse, and channel. The union of discourse (i.e., file name) and channel (usually 0, particularly for mono sound files) along with the time in seconds will always give a unique acoustic time point, and indexing by speaker is crucial for PolyglotDB’s algorithms.

Note

The time resolution for PolyglotDB is at the millisecond level. In general, I think having measurements every 10ms is a balanced time resolution for acoustic measures. Increasing the time resolution will also increase the processing time for PolyglotDB algorithms, as well as the database size. Time resolution is generally a property of the analyses done, so greater time resolution than 10 ms is possible, but not greater than 1 ms, as millisecond time resolution is hardcoded in the current code. Any time point will be rounded/truncated to the nearest millisecond.

In addition to these tags, there are several queryable fields which are always present in addition to the measurement fields. First, the phone for the time point is saved to allow for efficient aggregation across phones. Second, the utterance_id for the time point is also saved. The utterance_id is used for general querying, where each utterance’s track for the requested acoustic property is queried once and then cached for any further results to use without needing to query the InfluxDB again. For instance, a query on phone formant tracks might return 2000 phones. Without the utterance_id, there would be 2000 look ups for formant tracks (each InfluxDB query would take about 0.15 seconds), but using the utterance-based caching, the number of hits to the InfluxDB database would be a fraction (though the queries themselves would take a little bit longer).

Note

For performance reasons internal to InfluxDB, phone and utterance_id are fields rather than tags, because the cross of them with speaker, discourse, and channel would lead to an extremely large cross of possible tag combinations. This mix of tags and fields has been found to be the most performant.

Finally, there are the actual measurements that are saved. Each acoustic track (i.e., pitch, formants, intensity) can have multiple measurements. For instance, a formants track can have F1, F2, F3, B1, B2, and B3, which are all stored together on the same time point and accessed at the same time. These measures are kept in the corpus hierarchy in Neo4j. Each measurement track (i.e. pitch) will be a node linked to the corpus (see the example in Corpus hierarchy representation). That node will have each property listed along with its data type (i.e., F0 is a float).

Optimizations for acoustic measures

PolyglotDB has default functions for generating pitch, intensity, and formants tracks (see Reference functions for specific examples and Saving acoustics for more details on how they are implemented). For implementing future built in acoustic track analysis functions, one realm of optimization lays in the differently sampled files that PolyglotDB generates. On import, three files are generated per discourse at 1,200Hz, 11,000Hz, and 16,000Hz. The intended purpose of these files are for acoustic analysis of different kinds of segments/measurements. The file at 1,200Hz is ideal for pitch analysis (maximum pitch of 600Hz), the file at 11,000Hz is ideal for formant analysis (maximum formant frequency of 5,500Hz). The file at 16,000Hz is intended for consonantal analysis (i.e., fricative spectral analysis) or any other algorithm requiring higher frequency information. The reason these three files are generated is that analysis functions generally include the resampling to these frequencies as part of the analysis, so performing it ahead of time can speed up the analysis. Some programs also don’t necessarily include resampling (i.e., pitch estimation in REAPER), so using the appropriate file can lead to massive speed ups.

Query implementation

Given a PolyglotDB query like the following:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.word)
    q = q.filter(c.word.label == 'some_word')
    q = q.columns(c.word.label.column_name('word'), c.word.pitch.track)
    results = q.all()

Once the Cypher query completes and returns results for a matching word, that information is used to create an InfluxDB query. The inclusion of an acoustic column like the pitch track also ensures that necessary information like the utterance ID and begin and end time points of the word are returned. The above query would result in several queries like the following being run:

SELECT "time", "F0" from "pitch"
WHERE "discourse" = 'some_discourse'
AND "utterance_id" = 'some_utterance_id'
AND "speaker" = 'some_speaker'

The above query will get all pitch points for the utterance of the word in question, and create Python objects for the track (polyglotdb.acoustics.classes.Track) and each time point (polyglotdb.acoustics.classes.TimePoint). With the begin and end properties of the word, a slice of the track is added to the output row.

Aggregation

Unlike for aggregation of properties in the Neo4j database (see Aggregation functions), aggregation of acoustic properties occurs in Python rather than being implemented in a query to InfluxDB, for the same performance reasons above. By caching utterance tracks as needed, and then performing aggregation over necessary slices (i.e., words or phones), the overall query is much faster.

Low level implementation

Saving acoustics

The general pipeline for generating and saving acoustic measures is as follows:

  • Acoustic analysis using Conch’s analysis functions
  • Format output from Conch into InfluxDB format and fill in any needed information (phone labels)
  • Write points to InfluxDB
  • Update the Corpus hierarchy with information about acoustic properties

Acoustic analysis is first performed in Conch, a Python package for processing sound files into acoustic and auditory representations. To do so, segments are created in PolyglotDB through calls to polyglotdb.acoustics.segments.generate_segments() and related functions. The generated SegmentMapping object from Conch is an iterable of Segment objects. Each Segment minimally has a path to a sound file, the begin time stamp, the end time stamp, and the channel. With these four pieces of information, the waveform signal can be extracted and acoustic analysis can be performed. Segment objects can also have other properties associated with them, so that the SegmentMapping can be grouped into sensible bits of analysis (SegmentMapping.grouped_mapping(). This is done in PolyglotDB to split analysis by speakers, for instance.

SegmentMapping and those returned by the grouped_mapping can then be passed to analyze_segments, which in addition to a SegmentMapping take a callable function that takes the minimal set of arguments above (file path, begin, end, and channel) and return some sort of track or point measure from the signal segment. Below for a list of generator functions that return a callable to be used with analyze_segments. The analyze_segments function uses multiprocessing to apply the callable function to each segment, allowing for speed ups for the number of available cores on the machine.

Once the Conch analysis function completes, the tracks are saved via polyglotdb.corpus.AudioContext.save_acoustic_tracks(). In addition to the discourse, speaker, channel, and utterance_id, phone label information is also added to each time point’s measurements. These points are then saved using the write_points function of the InfluxDBClient, returned from the acoustic_client() function.

Querying acoustics

In general, the pipeline for querying is as follows:

  • Construct InfluxDB query string from function arguments
  • Pass this query string to an InfluxDBClient
  • Iterate over results and construct a polyglotdb.acoustics.classes.Track object

All audio functions, and hence all interface with InfluxDB, is handled through the polyglotdb.corpus.AudioContext parent class for the CorpusContext. Any constructed InfluxDB queries will get executed through an InfluxDBClient, constructed in the polyglotdb.corpus.AudioContext.acoustic_client() function, which uses the InfluxDB connection parameters from the CorpusContext. As an example, see polyglotdb.corpus.AudioContext.get_utterance_acoustics. First, a InfluxDB client is constructed, then a query string is formatted from the relevant arguments passed to get_utterance_acoustics, and the relevant property names for the acoustic measure (i.e., F1, F2 and F3 for formants, see InfluxDB Schema for more details). This query string is then run via the query method of the InfluxDBClient. The results are iterated over and a polyglotdb.acoustics.classes.Track object is constructed from the results and then returned.

PolyglotDB I/O

In addition to documenting the IO module of PolyglotDB, this document should serve as a guide for implementing future importers for additional formats.

Import pipeline

Importing a corpus consists of several steps. First, a file must be inspected with the relevant inspect function (i.e., inspect_textgrid or inspect_buckeye). These functions generate Parsers for a given format that allow annotations across many tiers to be coalesced into linguistic types (word, segments, etc).

As an example, suppose a TextGrid has an interval tier for word labels, an interval tier for phone labels, tiers for annotating stop information (closure duration, bursts, VOT, etc). In this case, our parser would want to associate the stop information annotations with the phones (or rather a subset of the phones), and not have them as a separate linguistic type.

Following inspection, the file can be imported easily using a CorpusContext’s load function. Under the hood, what happens is the Parser object creates standardized linguistic annotations from the annotations in the text file, which are then imported into the database.

Currently the following formats are supported:

Inspect

Inspect functions (i.e., inspect_textgrid) return a guess for how to parse the annotations present in a given file (or files in a given directory). They return a parser of the respective type (i.e., TextgridParser) with an attribute for the annotation_tiers detected. For instance, the inspect function for TextGrids will return a parser with annotation types for each interval and point tier in the TextGrid.

Inspect TextGrids

Note

See TextGrid parser for full API of the TextGrid Parser

Consider the following TextGrid with interval tiers for words and phones:

Image cannot be displayed in your browser

Running the inspect_textgrid function for this file will return two annotation types. From bottom to top, it will generate a phone annotation type and a word annotation type. Words and phones are two special linguistic types in PolyglotDB. Other linguistic types can be defined in a TextGrid (i.e., grouping words into utterances or phones into syllables, though functionality exists for computing both of those automatically), but word and phone tiers must be defined.

Note

Which tier corresponds to which special word and phone type is done via heuristics. The first and most reliable is whether the tier name contains “word” or “phone” in their tier name. The second is done by using cutoffs for mean and SD of word and phone durations in the Buckeye corpus to determine if the intervals are more likely to be word or phones. For reference, the mean and SD of words used is 0.2465409 and 0.03175723, and those used for phones is 0.08327773 and 0.03175723.

From the above TextGrid, phones will have a label property (i.e., “Y”), a begin property (i.e., 0.15), and a end property (i.e., 0.25). Words will have a label property (i.e., “you”), a begin property (i.e., 0.15), and a end property (i.e., 0.25), as well as a computed transcription property made of up of all of the included phones based on timings (i.e., “Y.UW1”). Any empty intervals will result in “words” that have the label of “<SIL>”, which can then be marked as pause later in corpus processing (see Encoding non-speech elements for more details).

Note

The computation of transcription uses the midpoints of phones and whether they are between the begin and end time points of words.

Inspect forced aligned TextGrids

Both the Montreal Forced Aligner and FAVE-aligner generate TextGrids for files in two formats that PolyglotDB can parse. The first format is for files with a single speaker. These files will have two tiers, one for words (named words or word) and one for phones (named phones or phone). The second format is for files with multiple speakers, where each speaker will have a pair of tiers for words (formatted as Speaker name - words) and phones (formatted as Speaker name - phones).

TextGrids generated from Web-MAUS have a single format with a tier for words (named ORT), a tier for the canonical transcription (named KAN) and a tier for phones (named MAU). In parsing, just the tiers for words and phones are used, as the transcription will be generated automatically.

Note

See MFA for full API of the MFA Parser, FAVE for full API of the FAVE Parser, and MAUS for the full API of the MAUS Parser.

Inspect LaBB-CAT formatted TextGrids

The LaBB-CAT system generates force-aligned TextGrids for files in a format that PolyglotDB can parse (though some editing may be required due to issues in exporting single speakers in LaBB-CAT). As with the other supported aligner output formats, PolyglotDB looks for word and phone tiers per speaker (or for just a single speaker depending on export options). The parser will use transcript to find the word tiers (i.e. Speaker name - transcript) and segment to find the phone tiers (i.e., Speaker name - phones).

Note

See LaBB-CAT parser for full API of the LaBB-CAT Parser

Inspect Buckeye Corpus

The Buckeye Corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., s0101a.wav), one detailing information about words (i.e., s0101a.words) and one detailing information about surface phones (i.e. s0101a.phones). The PolyglotDB parser extracts label, begin and end for each phone. Words have type properties for their underlying transcription and token properties for their part of speech and begin/end.

Note

See Buckeye parser for full API of the Buckeye Parser

Inspect TIMIT Corpus

The TIMIT corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., sa1.wav), one detailing information about words (i.e., sa1.WRD) and one detailing information about surface phones (i.e. sa1.PHN). The PolyglotDB parser extracts label, begin and end for each phone and each word. Time stamps are converted from samples in the original text files to seconds for use in PolyglotDB.

Note

See TIMIT parser for full API of the Buckeye Parser

Modifying aspects of parsing

Additional properties for linguistic units can be imported as well through the use of extra interval tiers when using a TextGrid parser (see Inspect TextGrids), as in the following TextGrid:

Image cannot be displayed in your browser

Here we have properties for each word’s part of speech (POS tier) and transcription. The transcription tier will overwrite the automatic calculation of transcription based on contained segments. Each of these will be properties will be type properties by default (see Neo4j implementation for more details). If these properties are meant to be token level properties (i.e., the part of speech of a word varies depending on the token produced), it can changed as follows:

from polyglotdb import CorpusContext
import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[2].type_property = False # The index of the TextGrid tier for POS is 2

# ... code that uses the parser to import data

If the content of a tier should be ignored (i.e., if it contains information not related to any annotations in particular), then it can be manually marked to be ignored as follows:

from polyglotdb import CorpusContext
import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[0].ignored = True # Index of 0 if the first tier should be ignored

# ... code that uses the parser to import data

Parsers created through other inspect functions (i.e. Buckeye) can be modified in similar ways, though the TextGrid parser is necessarily the most flexible.

Speaker parsers

There are two currently implemented schemes for parsing speaker names from a file path. The first is the Filename Speaker Parser, which takes a number of characters in the base file name (without the extension) starting either from the left or right. For instance, the path /path/to/buckeye/s0101a.words for a Buckeye file would return the speaker s01 using 3 characters from the left.

The other speaker parser is the Directory Speaker Parser, which parses speakers from the directory that contains the specified path. For instance, given the path /path/to/buckeye/s01/s0101a.words would return s01 because the containing folder of the file is named s01.

Load discourse

Loading of discourses is done via a CorpusContext’s load function:

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')

with CorpusContext(config) as c:
    c.load(parser, '/path/to/textgrid.TextGrid')

Alternatively, load_discourse can be used with the same arguments. The load function automatically determines whether the input path to be loaded is a single file or a folder, and proceeds accordingly.

Load directory

As stated above, a CorpusContext’s load function will import a directory of files as well as a single file, but the load_directory can be explicitly called as well:

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrids')

with CorpusContext(config) as c:
    c.load_directory(parser, '/path/to/textgrids')
Writing new parsers

New parsers can be created through extending either the Base parser class or one of the more specialized parser classes. There are in general three aspects that need to be implemented. First, the _extensions property should be updated to reflect the file extensions that the parser will find and attempt to parse. This property should be an iterable, even if only one extension is to be used.

Second, the __init__ function should be implemented if anything above and beyond the based class init function is required (i.e., special speaker parsing).

Finally, the parse_discourse function should be overwritten to implement some way of populating data on the annotation tiers from the source data files and ultimately create a DiscourseData object (intermediate data representation for straight-forward importing into the Polyglot databases).

Creating new parsers for forced aligned TextGrids requires simply extending the polyglotdb.io.parsers.aligner.AlignerParser and overwriting the word_label and phone_label class properties. The name property should also be set to something descriptive, and the speaker_first should be set to False if speakers follow word/phone labels in the TextGrid tiers (i.e., words -Speaker name rather than Speaker name - words). See polyglotdb.io.parsers.mfa.MfaParser, polyglotdb.io.parsers.fave.FaveParser, polyglotdb.io.parsers.maus.MausParser, and polyglotdb.io.parsers.labbcat.LabbcatParser for examples.

Exporters

Under development.

API Reference

Contents:

Corpus API

Corpus classes
Base corpus
class polyglotdb.corpus.BaseContext(*args, **kwargs)[source]

Base CorpusContext class. Inherit from this and extend to create more functionality.

Parameters:
*args

If the first argument is not a CorpusConfig object, it is the name of the corpus

**kwargs

If a CorpusConfig object is not specified, all arguments and keyword arguments are passed to a CorpusConfig object

annotation_types

Get a list of all the annotation types in the corpus’s Hierarchy

Returns:
list

Annotation types

cache_hierarchy()[source]

Save corpus Hierarchy to the disk

cypher_safe_name

Escape the corpus name for use in Cypher queries

Returns:
str

Corpus name made safe for Cypher

discourses

Gets a list of discourses in the corpus

Returns:
list

Discourse names in the corpus

encode_type_subset(annotation_type, annotation_labels, subset_label)[source]

Encode a type subset from labels of annotations

Parameters:
annotation_type : str

Annotation type of labels

annotation_labels : list

a list of labels of annotations to subset together

subset_label : str

the label for the subset

execute_cypher(statement, **parameters)[source]

Executes a cypher query

Parameters:
statement : str

the cypher statement

parameters : kwargs

keyword arguments to execute a cypher statement

Returns:
BoltStatementResult

Result of Cypher query

exists()[source]

Check whether the corpus has a Hierarchy schema in the Neo4j database

Returns:
bool

True if the corpus Hierarchy has been saved to the database

hierarchy_path

Get the path to cached hierarchy information

Returns:
str

Path to the cached hierarchy data on disk

load_hierarchy()[source]

Load Hierarchy object from the cached version

lowest_annotation

Returns the annotation type that is the lowest in the Hierarchy.

Returns:
str

Lowest annotation type in the Hierarchy

phone_name

Gets the phone label

Returns:
str

phone name

phones

Get a list of all phone labels in the corpus.

Returns:
list

All phone labels in the corpus

query_discourses()[source]

Start a query over discourses in the corpus

Returns:
DiscourseQuery

DiscourseQuery object

query_graph(annotation_node)[source]

Start a query over the tokens of a specified annotation type (i.e. corpus.word)

Parameters:
annotation_node : polyglotdb.query.attributes.AnnotationNode

The type of annotation to look for in the corpus

Returns:
SplitQuery

SplitQuery object

query_lexicon(annotation_node)[source]

Start a query over types of a specified annotation type (i.e. corpus.lexicon_word)

Parameters:
annotation_node : polyglotdb.query.attributes.AnnotationNode

The type of annotation to look for in the corpus’s lexicon

Returns:
LexiconQuery

LexiconQuery object

query_speakers()[source]

Start a query over speakers in the corpus

Returns:
SpeakerQuery

SpeakerQuery object

remove_discourse(name)[source]

Remove the nodes and relationships associated with a single discourse in the corpus.

Parameters:
name : str

Name of the discourse to remove

reset(call_back=None, stop_check=None)[source]

Reset the Neo4j and InfluxDB databases for a corpus

Parameters:
call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

reset_graph(call_back=None, stop_check=None)[source]

Remove all nodes and relationships in the corpus.

reset_type_subset(annotation_type, subset_label)[source]

Reset and remove a type subset

Parameters:
annotation_type : str

Annotation type of the subset

subset_label : str

the label for the subset

speakers

Gets a list of speakers in the corpus

Returns:
list

Speaker names in the corpus

word_name

Gets the word label

Returns:
str

word name

words

Get a list of all word labels in the corpus.

Returns:
list

All word labels in the corpus

Phonological functionality
class polyglotdb.corpus.PhonologicalContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with phones

encode_class(phones, label)[source]

encodes phone classes

Parameters:
phones : list

a list of phones

label : str

the label for the class

encode_features(feature_dict)[source]

gets the phone if it exists, queries for each phone and sets type to kwargs (?)

Parameters:
feature_dict : dict

features to encode

enrich_features(feature_data, type_data=None)[source]

Sets the data type and feature data, initializes importers for feature data, adds features to hierarchy for a phone

Parameters:
feature_data : dict

the enrichment data

type_data : dict

By default None

enrich_inventory_from_csv(path)[source]

Enriches corpus from a csv file

Parameters:
path : str

the path to the csv file

remove_pattern(pattern='[0-2]')[source]

removes a stress or tone pattern from all phones

Parameters:
pattern : str

the regular expression for the pattern to remove Defaults to ‘[0-2]’

reset_class(label)[source]

Reset and remove a subset

Parameters:
label : str

Subset name to remove

reset_features(feature_names)[source]

resets features

Parameters:
feature_names : list

list of names of features to remove

reset_inventory_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

reset_to_old_label()[source]

Reset phones back to their old labels which include stress and tone

Syllabic functionality
class polyglotdb.corpus.SyllabicContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with syllables

encode_stress_from_word_property(word_property_name)[source]

Use a property on words formatted like “0-1-0” to encode stress on syllables.

The number of syllables and the position of syllables within a word will also be encoded as a result of this function.

Parameters:
word_property_name : str

Property name of words that contains the stress pattern

encode_stress_to_syllables(regex=None, clean_phone_label=True)[source]

Use numbers (0-9) in phone labels as stress property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:
regex : str

Regular expression character set for finding stress in the phone label

clean_phone_label : bool

Flag for removing regular expression from the phone labels

encode_syllabic_segments(phones)[source]

Encode a list of phones as ‘syllabic’

Parameters:
phones : list

A list of vowels and syllabic consonants

encode_syllables(algorithm='maxonset', syllabic_label='syllabic', call_back=None, stop_check=None)[source]

Encodes syllables to a corpus

Parameters:
algorithm : str, defaults to ‘maxonset’

determines which algorithm will be used to encode syllables

syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

encode_tone_to_syllables(regex=None, clean_phone_label=True)[source]

Use numbers (0-9) in phone labels as tone property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:
regex : str

Regular expression character set for finding tone in the phone label

clean_phone_label : bool

Flag for removing regular expression from the phone labels

enrich_syllables(syllable_data, type_data=None)[source]

Sets the data type and syllable data, initializes importers for syllable data, adds features to hierarchy for a phone

Parameters:
syllable_data : dict

the enrichment data

type_data : dict

By default None

find_codas(syllabic_label='syllabic')[source]

Gets syllable codas across the corpus

Parameters:
syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

Returns:
data : dict

A dictionary with coda values as keys and frequency values as values

find_onsets(syllabic_label='syllabic')[source]

Gets syllable onsets across the corpus

Parameters:
syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

Returns:
data : dict

A dictionary with onset values as keys and frequency values as values

has_syllabics

Check whether there is a phone subset named syllabic

Returns:
bool

True if syllabic is found as a phone subset

has_syllables

Check whether the corpus has syllables encoded

Returns:
bool

True if the syllables are in the Hierarchy

reset_syllables(call_back=None, stop_check=None)[source]

Resets syllables, removes syllable annotation, removes onset, coda, and nucleus labels

Parameters:
call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

Lexical functionality
class polyglotdb.corpus.LexicalContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with words

enrich_lexicon(lexicon_data, type_data=None, case_sensitive=False)[source]

adds properties to lexicon, adds properties to hierarchy

Parameters:
lexicon_data : dict

the data in the lexicon

type_data : dict

default to None

case_sensitive : bool

default to False

enrich_lexicon_from_csv(path, case_sensitive=False)[source]

Enriches lexicon from a CSV file

Parameters:
path : str

the path to the csv file

case_sensitive : boolean

Defaults to false

reset_lexicon_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

Pause functionality
class polyglotdb.corpus.PauseContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with non-speech elements

encode_pauses(pause_words, call_back=None, stop_check=None)[source]

Set words to be pauses, as opposed to speech.

Parameters:
pause_words : str, list, tuple, or set

Either a list of words that are pauses or a string containing a regular expression that specifies pause words

call_back : callable

Function to monitor progress

stop_check : callable

Function to check whether process should be terminated early

has_pauses

Check whether corpus has encoded pauses

Returns:
bool

True if pause is in the subsets available for words

reset_pauses()[source]

Revert all words marked as pauses to regular words marked as speech

Utterance functionality
class polyglotdb.corpus.UtteranceContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with utterances

encode_speech_rate(subset_label, call_back=None, stop_check=None)[source]

Encodes speech rate

Parameters:
subset_label : str

the name of the subset to encode

encode_utterance_position(call_back=None, stop_check=None)[source]

Encodes position_in_utterance for a word

encode_utterances(min_pause_length=0.5, min_utterance_length=0, call_back=None, stop_check=None)[source]

Encode utterance annotations based on minimum pause length and minimum utterance length. See get_pauses for more information about the algorithm.

Once this function is run, utterances will be queryable like other annotation types.

Parameters:
min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

enrich_utterances(utterance_data, type_data=None)[source]

adds properties to lexicon, adds properties to hierarchy

Parameters:
utterance_data : dict

the data to enrich with

type_data : dict

default to None

get_utterance_ids(discourse, min_pause_length=0.5, min_utterance_length=0)[source]

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:
discourse : str

String identifier for a discourse

min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

get_utterances(discourse, min_pause_length=0.5, min_utterance_length=0)[source]

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:
discourse : str

String identifier for a discourse

min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

reset_speech_rate()[source]

resets speech_rate

reset_utterance_position()[source]

resets position_in_utterance

reset_utterances()[source]

Remove all utterance annotations.

Audio functionality
class polyglotdb.corpus.AudioContext(*args, **kwargs)[source]

Class that contains methods for dealing with audio files for corpora

acoustic_client()[source]

Generate a client to connect to the InfluxDB for the corpus

Returns:
InfluxDBClient

Client through which to run queries and writes

analyze_formant_tracks(source='praat', stop_check=None, call_back=None, multiprocessing=True, vowel_label=None)[source]

Compute formant tracks and save them to the database

See polyglotdb.acoustics.formants.base.analyze_formant_tracks() for more details.

Parameters:
source : str

Program to compute formants

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

vowel_label : str, optional

Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.

analyze_intensity(source='praat', stop_check=None, call_back=None, multiprocessing=True)[source]

Compute intensity tracks and save them to the database

See polyglotdb.acoustics.intensity..analyze_intensity() for more details.

Parameters:
source : str

Program to compute intensity (only praat is supported)

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

analyze_pitch(source='praat', algorithm='base', absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, call_back=None, multiprocessing=True)[source]

Analyze pitch tracks and save them to the database.

See polyglotdb.acoustics.pitch.base.analyze_pitch() for more details.

Parameters:
source : str

Program to use for analyzing pitch, either praat or reaper

algorithm : str

Algorithm to use, base, gendered, or speaker_adjusted

absolute_min_pitch : int

Absolute pitch floor

absolute_max_pitch : int

Absolute pitch ceiling

adjusted_octaves : int

How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling

stop_check : callable

Function to check whether processing should stop early

call_back : callable

Function to report progress

multiprocessing : bool

Flag whether to use multiprocessing or threading

analyze_script(phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]

Use a Praat script to analyze annotation types in the corpus. The Praat script must return properties per phone (i.e., point measures, not a track), and these properties will be saved to the Neo4j database.

See polyglotdb.acoustics.other..analyze_script() for more details.

Parameters:
phone_class : str

DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run

subset : str, optional

the name of an already encoded subset of an annotation type, on which the analysis will be run

annotation_type : str

the type of annotation that the analysis will go over

script_path : str

Path to the Praat script

duration_threshold : float

Minimum duration that phones should be to be analyzed

arguments : list

Arguments to pass to the Praat script

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

Returns:
list

List of the names of newly added properties to the Neo4j database

analyze_track_script(acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]

Use a Praat script to analyze phones in the corpus. The Praat script must return a track, and these tracks will be saved to the InfluxDB database.

See polyglotdb.acoustics.other..analyze_track_script() for more details.

Parameters:
acoustic_name : str

Name of the acoustic measure

properties : list

List of tuples of the form (property_name, Type)

script_path : str

Path to the Praat script

duration_threshold : float

Minimum duration that phones should be to be analyzed

phone_class : str

Name of the phone subset to analyze

arguments : list

Arguments to pass to the Praat script

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

analyze_utterance_pitch(utterance, source='praat', **kwargs)[source]

Analyze a single utterance’s pitch track.

See polyglotdb.acoustics.pitch.base.analyze_utterance_pitch() for more details.

Parameters:
utterance : str

Utterance ID from Neo4j

source : str

Program to use for analyzing pitch, either praat or reaper

kwargs

Additional settings to use in analyzing pitch

Returns:
Track

Pitch track

analyze_vot(classifier, stop_label='stops', stop_check=None, call_back=None, multiprocessing=False, overwrite_edited=False, vot_min=5, vot_max=100, window_min=-30, window_max=30)[source]

Compute VOTs for stops and save them to the database.

See polyglotdb.acoustics.vot.base.analyze_vot() for more details.

Parameters:
classifier : str

Path to an AutoVOT classifier model

stop_label : str

Label of subset to analyze

vot_min : int

Minimum VOT in ms

vot_max : int

Maximum VOT in ms

window_min : int

Window minimum in ms

window_max : int

Window maximum in Ms

overwrite_edited : bool

Overwrite VOTs with the “edited” property set to true, if this is true

call_back : callable

call back function, optional

stop_check : callable

stop check function, optional

multiprocessing : bool

Flag to use multiprocessing, otherwise will use threading

discourse_audio_directory(discourse)[source]

Return the directory for the stored audio files for a discourse

discourse_has_acoustics(acoustic_name, discourse)[source]

Return whether a discourse has any specific acoustic values associated with it

Parameters:
acoustic_name : str

Name of the acoustic type

discourse : str

Name of the discourse

Returns:
bool
discourse_sound_file(discourse)[source]

Get details for the audio file paths for a specified discourse.

Parameters:
discourse : str

Name of the audio file in the corpus

Returns:
dict

Information for the audio file path

encode_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]

Computes and saves as type properties summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:
acoustic_name : str

Name of the acoustic type

statistic : str

One of mean, median, stddev, sum, mode, count

by_speaker : bool, defaults to True

Flag for calculating summary statistic by speaker

by_phone : bool, defaults to False

Flag for calculating summary statistic by phone

execute_influxdb(query)[source]

Execute an InfluxDB query for the corpus

Parameters:
query : str

Query to run

Returns:
influxdb.resultset.ResultSet

Results of the query

genders()[source]

Gets all values of speaker property named gender in the Neo4j database

Returns:
list

List of gender values

generate_spectrogram(discourse, file_type='consonant', begin=None, end=None)[source]

Generate a spectrogram from an audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

begin : float

Timestamp in seconds

end : float

Timestamp in seconds

Returns:
numpy.array

Spectrogram information

float

Time step between each window

float

Frequency step between each frequency bin

get_acoustic_measure(acoustic_name, discourse, begin, end, channel=0, relative_time=False, **kwargs)[source]

Get acoustic for a given discourse and time range

Parameters:
acoustic_name : str

Name of acoustic track

discourse : str

Name of the discourse

begin : float

Beginning of time range

end : float

End of time range

channel : int, defaults to 0

Channel of the audio file

relative_time : bool, defaults to False

Flag for retrieving relative time instead of absolute time

kwargs : kwargs

Tags to filter on

Returns:
polyglotdb.acoustics.classes.Track

Track object

get_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]

Computes summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:
acoustic_name : str

Name of the acoustic type

statistic : str

One of mean, median, stddev, sum, mode, count

by_speaker : bool, defaults to True

Flag for calculating summary statistic by speaker

by_phone : bool, defaults to False

Flag for calculating summary statistic by phone

Returns:
dict

Dictionary where keys are phone/speaker/phone-speaker pairs and values are the summary statistic of the acoustic measure

get_utterance_acoustics(acoustic_name, utterance_id, discourse, speaker)[source]

Get acoustic for a given utterance and time range

Parameters:
acoustic_name : str

Name of acoustic track

utterance_id : str

ID of the utterance from the Neo4j database

discourse : str

Name of the discourse

speaker : str

Name of the speaker

Returns:
polyglotdb.acoustics.classes.Track

Track object

has_all_sound_files()[source]

Check whether all discourses have a sound file

Returns:
bool

True if a sound file exists for each discourse name in corpus, False otherwise

has_sound_files

Check whether any discourses have a sound file

Returns:
bool

True if there are any sound files at all, false if there aren’t

load_audio(discourse, file_type)[source]

Loads a given audio file at the specified sampling rate type (consonant, vowel or low_freq). Consonant files have a sampling rate of 16 kHz, vowel files a sampling rate of 11 kHz, and low frequency files a sampling rate of 1.2 kHz.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

Returns:
numpy.array

Audio signal

int

Sampling rate of the file

load_waveform(discourse, file_type='consonant', begin=None, end=None)[source]

Loads a segment of a larger audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

begin : float, optional

Timestamp in seconds

end : float, optional

Timestamp in seconds

Returns:
numpy.array

Audio signal

int

Sampling rate of the file

reassess_utterances(acoustic_name)[source]

Update utterance IDs in InfluxDB for more efficient querying if utterances have been re-encoded after acoustic measures were encoded

Parameters:
acoustic_name : str

Name of the measure for which to update utterance IDs

relativize_acoustic_measure(acoustic_name, by_speaker=True, by_phone=False)[source]

Relativize acoustic tracks by taking the z-score of the points (using by speaker or by phone means and standard deviations, or both by-speaker, by phone) and save them as separate measures, i.e., F0_relativized from F0.

Parameters:
acoustic_name : str

Name of the acoustic measure

by_speaker : bool, defaults to True

Flag for relativizing by speaker

by_phone : bool, defaults to False

Flag for relativizing by phone

reset_acoustic_measure(acoustic_type)[source]

Reset a given acoustic measure

Parameters:
acoustic_type : str

Name of the acoustic measurement to reset

reset_acoustics()[source]

Reset all acoustic measures currently encoded

reset_formant_points()[source]

Reset formant point measures encoded in the corpus

reset_relativized_acoustic_measure(acoustic_name)[source]

Reset any relativized measures that have been encoded for a specified type of acoustics

Parameters:
acoustic_name : str

Name of the acoustic type

reset_vot()[source]

Reset all VOT measurements in the corpus

save_acoustic_track(acoustic_name, discourse, track, **kwargs)[source]

Save an acoustic track for a sound file

Parameters:
acoustic_name : str

Name of the acoustic type

discourse : str

Name of the discourse

track : Track

Track to save

kwargs: kwargs

Tags to save for acoustic measurements

save_acoustic_tracks(acoustic_name, tracks, speaker)[source]

Save multiple acoustic tracks for a collection of analyzed segments

Parameters:
acoustic_name : str

Name of the acoustic type

tracks : iterable

Iterable of Track objects to save

speaker : str

Name of the speaker of the tracks

update_utterance_pitch_track(utterance, new_track)[source]

Save a pitch track for the specified utterance.

See polyglotdb.acoustics.pitch.base.update_utterance_pitch_track() for more details.

Parameters:
utterance : str

Utterance ID from Neo4j

new_track : list or Track

Pitch track

Returns:
int

Time stamp of update

utterance_sound_file(utterance_id, file_type='consonant')[source]

Generate an audio file just for a single utterance in an audio file.

Parameters:
utterance_id : str

Utterance ID from Neo4j

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

Returns:
str

Path to the generated sound file

Summarization functionality
class polyglotdb.corpus.SummarizedContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with summary measures for linguistic items

average_speech_rate()[source]

Get the average speech rate for each speaker in a corpus

Returns:
result: list

the average speech rate by speaker

baseline_duration(annotation, speaker=None)[source]

Get the baseline duration of each word in corpus. Baseline duration is determined by summing the average durations of constituent phones for a word. If there is no underlying transcription available, the longest duration is considered the baseline.

Parameters:
speaker : str

a speaker name, if desired (defaults to None)

Returns:
word_totals : dict

a dictionary of words and baseline durations

encode_baseline(annotation_type, property_name, by_speaker=False)[source]

Encode a baseline measure of a property, that is, the expected value of a higher annotation given the average property value of the phones that make it up. For instance, the expected duration of a word or syllable given its phonological content.

Parameters:
annotation_type : str

Name of annotation type to compute for

property_name : str

Property of phones to compute based off of (i.e., duration)

by_speaker : bool

Flag for whether to use by-speaker means

encode_measure(property_name, statistic, annotation_type, by_speaker=False)[source]

Compute and save an aggregate measure for annotation types

Available statistic names:

  • mean/average/avg
  • sd/stdev
Parameters:
property_name : str

Name of the property

statistic : str

Name of the statistic to use for aggregation

annotation_type : str

Name of the annotation type

by_speaker : bool

Flag for whether to compute aggregation by speaker

encode_relativized(annotation_type, property_name, by_speaker=False)[source]

Compute and save to the database a relativized measure (i.e., the property value z-scored using a mean and standard deviation computed from the corpus). The computation of means and standard deviations can be by-speaker.

Parameters:
annotation_type : str

Name of the annotation type

property_name : str

Name of the property to relativize

by_speaker : bool

Flag to use by-speaker means and standard deviations

get_measure(data_name, statistic, annotation_type, by_speaker=False, speaker=None)[source]

abstract function to get statistic for the data_name of an annotation_type

Parameters:
data_name : str

the aspect to summarize (duration, pitch, formants, etc)

statistic : str

how to summarize (mean, stdev, median, etc)

annotation_type : str

the annotation to summarize

by_speaker : boolean

whether to summarize by speaker or not

speaker : str

the specific speaker to encode baseline duration for (only for baseline duration)

make_dict(data, speaker=False, label=None)[source]

turn data results into a dictionary for encoding

Parameters:
data : list

a list returned by cypher

Returns:
finaldict : dict

a dictionary in the format for enrichment

Spoken functionality
class polyglotdb.corpus.SpokenContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with speaker and sound file metadata

enrich_discourses(discourse_data, type_data=None)[source]

Add properties about discourses to the corpus, allowing them to be queryable.

Parameters:
discourse_data : dict

the data about the discourse to add

type_data : dict

Specifies the type of the data to be added, defaults to None

enrich_discourses_from_csv(path)[source]

Enriches discourses from a csv file

Parameters:
path : str

the path to the csv file

enrich_speakers(speaker_data, type_data=None)[source]

Add properties about speakers to the corpus, allowing them to be queryable.

Parameters:
speaker_data : dict

the data about the speakers to add

type_data : dict

Specifies the type of the data to be added, defaults to None

enrich_speakers_from_csv(path)[source]

Enriches speakers from a csv file

Parameters:
path : str

the path to the csv file

get_channel_of_speaker(speaker, discourse)[source]

Get the channel that the speaker is in

Parameters:
speaker : str

Speaker to query

discourse : str

Discourse to query

Returns:
int

Channel of audio that speaker is in

get_discourses_of_speaker(speaker)[source]

Get a list of all discourses that a given speaker spoke in

Parameters:
speaker : str

Speaker to query over

Returns:
list

All discourses the speaker spoke in

get_speakers_in_discourse(discourse)[source]

Get a list of all speakers that spoke in a given discourse

Parameters:
discourse : str

Audio file to query over

Returns:
list

All speakers who spoke in the discourse

make_speaker_annotations_dict(data, speaker, property)[source]

helper function to turn dict of {} format to {speaker :{property :{data}}}

Parameters:
data : dict

annotations and values

property : str

the name of the property being encoded

speaker : str

the name of the speaker

reset_discourse_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

reset_speaker_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

Structured functionality
class polyglotdb.corpus.StructuredContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with metadata for the corpus

encode_count(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes the rate of the lower type in the higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

encode_hierarchy()[source]

Sync the current Hierarchy to the Neo4j database and to the disk

encode_position(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes position of lower type in higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

encode_rate(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes the rate of the lower type in the higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

generate_hierarchy()[source]

Get hierarchy schema information from the Neo4j database

Returns:
Hierarchy

the structure of the corpus

query_metadata(annotation)[source]

Start a query over metadata

Parameters:
annotation : Node
Returns:
MetaDataQuery

MetaDataQuery object

refresh_hierarchy()[source]

Save the Neo4j database schema to the disk

reset_hierarchy()[source]

Delete the Hierarchy schema in the Neo4j database

reset_property(annotation_type, name)[source]

Removes property from hierarchy

Parameters:
annotation_type : str

what is being removed

name : str

the column name

Annotation functionality
class polyglotdb.corpus.AnnotatedContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with annotations on linguistic items (termed “subannotations” in PolyglotDB

Omnibus class
class polyglotdb.corpus.CorpusContext(*args, **kwargs)[source]

Main corpus context, inherits from the more specialized contexts.

Parameters:
args : args

Either a CorpusConfig object or sequence of arguments to be passed to a CorpusConfig object

kwargs : kwargs

sequence of keyword arguments to be passed to a CorpusConfig object

Corpus structure class
class polyglotdb.structure.Hierarchy(data=None, corpus_name=None)[source]

Class containing information about how a corpus is structured.

Hierarchical data is stored in the form of a dictionary with keys for linguistic types, and values for the linguistic type that contains them. If no other type contains a given type, its value is None.

Subannotation data is stored in the form of a dictionary with keys for linguistic types, and values of sets of types of subannotations.

Parameters:
data : dict

Information about the hierarchy of linguistic types

corpus_name : str

Name of the corpus

acoustics

Get all currently encoded acoustic measurements in the corpus

Returns:
list

All encoded acoustic measures

add_acoustic_properties(corpus_context, acoustic_type, properties)[source]

Add acoustic properties to an encoded acoustic measure. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

acoustic_type : str

Acoustic measure to add properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_annotation_type(annotation_type, above=None, below=None)[source]

Adds an annotation type to the Hierarchy object along with default type and token properties for the new annotation type

Parameters:
annotation_type : str

Annotation type to add

above : str

Annotation type that is contained by the new annotation type, leave out if new annotation type is at the bottom of the hierarchy

below : str

Annotation type that contains the new annotation type, leave out if new annotation type is at the top of the hierarchy

add_discourse_properties(corpus_context, properties)[source]

Adds discourse properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_speaker_properties(corpus_context, properties)[source]

Adds speaker properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_subannotation_properties(corpus_context, subannotation_type, properties)[source]

Adds properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Name of the subannotation type

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_subannotation_type(corpus_context, annotation_type, subannotation_type, properties=None)[source]

Adds subannotation type for a given annotation type to the Hierarchy object and syncs it to a Neo4j database. The list of optional properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add a subannotation to

subannotation_type : str

Name of the subannotation type

properties : iterable

Optional iterable of tuples of the form (property_name, Type)

add_token_properties(corpus_context, annotation_type, properties)[source]

Adds token properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add token properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_token_subsets(corpus_context, annotation_type, subsets)[source]

Adds token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to add subsets for

subsets : iterable

List of subsets to add for the annotation tokens

add_type_properties(corpus_context, annotation_type, properties)[source]

Adds type properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add type properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_type_subsets(corpus_context, annotation_type, subsets)[source]

Adds type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to add subsets for

subsets : iterable

List of subsets to add for the annotation type

annotation_types

Get a list of all the annotation types in the hierarchy

Returns:
list

All annotation types in the hierarchy

from_json(json)[source]

Set all properties from a dictionary deserialized from JSON

Parameters:
json : dict

Object information

get_depth(lower_type, higher_type)[source]

Get the distance between two annotation types in the hierarchy

Parameters:
lower_type : str

Name of the lower type

higher_type : str

Name of the higher type

Returns:
int

Distance between the two types

get_higher_types(annotation_type)[source]

Get all annotation types that are higher than the specified annotation type

Parameters:
annotation_type : str

Annotation type from which to get higher annotation types

Returns:
list

List of all annotation types that are higher the specified annotation type

get_lower_types(annotation_type)[source]

Get all annotation types that are lower than the specified annotation type

Parameters:
annotation_type : str

Annotation type from which to get lower annotation types

Returns:
list

List of all annotation types that are lower the specified annotation type

has_discourse_property(key)[source]

Check for whether discourses have a given property

Parameters:
key : str

Property to check for

Returns:
bool

True if discourses have the given property

has_speaker_property(key)[source]

Check for whether speakers have a given property

Parameters:
key : str

Property to check for

Returns:
bool

True if speakers have the given property

has_subannotation_property(subannotation_type, property_name)[source]

Check whether the Hierarchy has a property associated with a subannotation type

Parameters:
subannotation_type : str

Name of subannotation to check

property_name : str

Name of the property to check for

Returns:
bool

True if subannotation type has the given property name

has_subannotation_type(subannotation_type)[source]

Check whether the Hierarchy has a subannotation type

Parameters:
subannotation_type : str

Name of subannotation to check for

Returns:
bool

True if subannotation type is present

has_token_property(annotation_type, key)[source]

Check whether a given annotation type has a given token property.

Parameters:
annotation_type : str

Annotation type to check for the given token property

key : str

Property to check for

Returns:
bool

True if the annotation type has the given token property

has_token_subset(annotation_type, key)[source]

Check whether a given annotation type has a given token subset.

Parameters:
annotation_type : str

Annotation type to check for the given token subset

key : str

Subset to check for

Returns:
bool

True if the annotation type has the given token subset

has_type_property(annotation_type, key)[source]

Check whether a given annotation type has a given type property.

Parameters:
annotation_type : str

Annotation type to check for the given type property

key : str

Property to check for

Returns:
bool

True if the annotation type has the given type property

has_type_subset(annotation_type, key)[source]

Check whether a given annotation type has a given type subset.

Parameters:
annotation_type : str

Annotation type to check for the given type subset

key : str

Subset to check for

Returns:
bool

True if the annotation type has the given type subset

highest

Get the highest annotation type of the Hierarchy

Returns:
str

Highest annotation type

highest_to_lowest

Get a list of annotation types sorted from highest to lowest

Returns:
list

Annotation types from highest to lowest

items()[source]

Key/value pairs for the hierarchy.

Returns:
generator

Items of the hierarchy

keys()[source]

Keys (linguistic types) of the hierarchy.

Returns:
generator

Keys of the hierarchy

lowest

Get the lowest annotation type of the Hierarchy

Returns:
str

Lowest annotation type

lowest_to_highest

Get a list of annotation types sorted from lowest to highest

Returns:
list

Annotation types from lowest to highest

phone_name

Alias function for getting the lowest annotation type

Returns:
str

Name of the lowest annotation type

remove_acoustic_properties(corpus_context, acoustic_type, properties)[source]

Remove acoustic properties to an encoded acoustic measure.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

acoustic_type : str

Acoustic measure to remove properties for

properties : iterable

List of property names

remove_annotation_type(annotation_type)[source]

Removes an annotation type from the hierarchy

Parameters:
annotation_type : str

Annotation type to remove

remove_discourse_properties(corpus_context, properties)[source]

Removes discourse properties and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

List of property names to remove

remove_speaker_properties(corpus_context, properties)[source]

Removes speaker properties and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

List of property names to remove

remove_subannotation_properties(corpus_context, subannotation_type, properties)[source]

Removes properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Name of the subannotation type

properties : iterable

List of property names to remove

remove_subannotation_type(corpus_context, subannotation_type)[source]

Remove a subannotation type from the Hierarchy object and sync it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Subannotation type to remove

remove_token_properties(corpus_context, annotation_type, properties)[source]

Removes token properties for an annotation type and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to remove token properties for

properties : iterable

List of property names to remove

remove_token_subsets(corpus_context, annotation_type, subsets)[source]

Removes token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to remove subsets for

subsets : iterable

List of subsets to remove for the annotation tokens

remove_type_properties(corpus_context, annotation_type, properties)[source]

Removes type properties for an annotation type and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to remove type properties for

properties : iterable

List of property names to remove

remove_type_subsets(corpus_context, annotation_type, subsets)[source]

Removes type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to remove subsets for

subsets : iterable

List of subsets to remove for the annotation type

to_json()[source]

Convert the Hierarchy object to a dictionary for JSON serialization

Returns:
dict

All necessary information for the Hierarchy object

update(other)[source]

Merge Hierarchies together. If other is a dictionary, then only the hierarchical data is updated.

Parameters:
other : Hierarchy or dict

Data to be merged in

values()[source]

Values (containing types) of the hierarchy.

Returns:
generator

Values of the hierarchy

word_name

Shortcut for returning the annotation type matching “word”

Returns:
str or None

Annotation type that begins with “word”

Corpus config class
class polyglotdb.config.CorpusConfig(corpus_name, data_dir=None, **kwargs)[source]

Class for storing configuration information about a corpus.

Parameters:
corpus_name : str

Identifier for the corpus

kwargs : keyword arguments

All keywords will be converted to attributes of the object

Attributes:
corpus_name : str

Identifier of the corpus

graph_user : str

Username for connecting to the graph database

graph_password : str

Password for connecting to the graph database

graph_host : str

Host for the graph database

graph_port : int

Port for connecting to the graph database

engine : str

Type of SQL database

base_dir : str

Base directory to store information and temporary files for the corpus defaults to “.pgdb” under the current user’s home directory

acoustic_connection_kwargs

Return connection parameters to use for connecting to an InfluxDB database

Returns:
dict

Connection parameters

graph_connection_string

Construct a connection string to use for Neo4j

Returns:
str

Connection string

temporary_directory(name)[source]

Create a temporary directory for use in the corpus, and return the path name.

All temporary directories deleted upon successful exit of the context manager.

Returns:
str:

Full path to temporary directory

Query API

Base
BaseQuery(corpus, to_find)
Attributes
Node(node_type[, corpus, hierarchy])
NodeAttribute(node, label)
CollectionNode(anchor_node, collected_node)
CollectionAttribute(node, label)
Clause elements
ClauseElement(attribute, value) Base class for filter elements that will be translated to Cypher.
EqualClauseElement(attribute, value)
GtClauseElement(attribute, value)
GteClauseElement(attribute, value)
LtClauseElement(attribute, value)
LteClauseElement(attribute, value)
NotEqualClauseElement(attribute, value)
InClauseElement(attribute, value)
Aggregate functions
AggregateFunction([attribute])
Average([attribute])
Count([attribute])
Max([attribute])
Min([attribute])
Stdev([attribute])
Sum([attribute])
Annotation queries
GraphQuery(corpus, to_find[, stop_check]) Base GraphQuery class.
SplitQuery(corpus, to_find[, stop_check])
Attributes
AnnotationNode(node_type[, corpus, hierarchy]) Class for annotations referenced in graph queries
AnnotationAttribute(annotation, label) Class for information about the attributes of annotations in a graph query
Clause elements
ContainsClauseElement(attribute, value) Clause for filtering based on hierarchical relations.
Lexicon queries
LexiconQuery(corpus, to_find) Class for generating a Cypher query over the lexicon (type information about annotations in the corpus)
Speaker queries
SpeakerQuery(corpus) Class for generating a Cypher query over speakers
Attributes
SpeakerNode([corpus, hierarchy])
SpeakerAttribute(node, label)
DiscourseNode([corpus, hierarchy])
DiscourseCollectionNode(speaker_node)
ChannelAttribute(node)
Discourse queries
DiscourseQuery(corpus) Class for generating a Cypher query over discourses
Attributes
DiscourseNode([corpus, hierarchy])
DiscourseAttribute(node, label)
SpeakerNode([corpus, hierarchy])
SpeakerCollectionNode(discourse_node)
ChannelAttribute(node)

I/O API

Contents:

I/O Types API
Parsing types
parsing.BreakIndexTier(name, linguistic_type)
parsing.GroupingTier(name, linguistic_type)
parsing.MorphemeTier(name, linguistic_type)
parsing.OrthographyTier(name, linguistic_type)
parsing.SegmentTier(name, linguistic_type[, …])
parsing.TobiTier(name, linguistic_type[, label])
parsing.TranscriptionTier(name, linguistic_type)
parsing.TextMorphemeTier(name, linguistic_type)
parsing.TextOrthographyTier(name, …[, label])
parsing.TextTranscriptionTier(name, …)
Linguistic types
standardized.PGAnnotationType(name)
Import API

Contents:

Inspect Functions
Buckeye
inspect_buckeye(word_path) Generate a BuckeyeParser for the Buckeye corpus.
Interlinear gloss text
inspect_ilg(path[, number]) Generate an IlgParser for a specified text file for parsing it as an interlinear gloss text file
TextGrids
inspect_textgrid(path) Generate a TextgridParser for a specified TextGrid file
TextGrids from forced alignment
inspect_mfa(path) Generate an MfaParser for a specified text file for parsing it as a MFA file
inspect_fave(path) Generate an FaveParser for a specified text file for parsing it as an FAVE text file
inspect_maus(path) Generate an MausParser for a specified text file for parsing it as a MAUS file
TIMIT
inspect_timit(word_path) Generate a TimitParser.
Parser Classes
Base parser
class polyglotdb.io.parsers.base.BaseParser(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]

Base parser, extend this class for new parsers.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

make_transcription : bool, defaults to True

If true, create a word attribute for transcription based on segments that are contained by the word

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

match_extension(filename)[source]

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(name, types_only=False)[source]

Parse annotations for later importing.

Parameters:
name : str

Name of the discourse

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data

parse_information(path, corpus_name)[source]

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

TextGrid parser
class polyglotdb.io.parsers.textgrid.TextgridParser(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]

Parser for Praat TextGrid files.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

make_transcription : bool, defaults to True

If true, create a word attribute for transcription based on segments that are contained by the word

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

load_textgrid(path)[source]

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

parse_discourse(path, types_only=False)[source]

Parse a TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

Forced alignment output parser
class polyglotdb.io.parsers.aligner.AlignerParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Base class for parsing TextGrid output from forced aligners.

Parameters:
annotation_tiers : list

List of the annotation tiers to store data from the TextGrid

hierarchy : Hierarchy

Basic hierarchy of the TextGrid

make_transcription : bool

Flag for whether to add a transcription property to words based on phones they contain

stop_check : callable

Function to check for whether parsing should stop

call_back : callable

Function to report progress in parsing

Attributes:
word_label : str

Label identifying word tiers

phone_label : str

Label identifying phone tiers

name : str

Name of the aligner the TextGrids are from

speaker_first : bool

Whether speaker names precede tier types in the TextGrid when multiple speakers are present

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)[source]

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

MFA
class polyglotdb.io.parsers.mfa.MfaParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the Montreal Forced Aligner.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

FAVE
class polyglotdb.io.parsers.fave.FaveParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the FAVE-align.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

MAUS
class polyglotdb.io.parsers.maus.MausParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the Web-MAUS aligner.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

TIMIT parser
class polyglotdb.io.parsers.timit.TimitParser(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]

Parser for the TIMIT corpus.

Has annotation types for word labels and surface transcription labels.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

parse_discourse(word_path, types_only=False)[source]

Parse a TIMIT file for later importing.

Parameters:
word_path : str

Path to TIMIT .wrd file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

Buckeye parser
class polyglotdb.io.parsers.buckeye.BuckeyeParser(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]

Parser for the Buckeye corpus.

Has annotation types for word labels, word transcription, word part of speech, and surface transcription labels.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

parse_discourse(word_path, types_only=False)[source]

Parse a Buckeye file for later importing.

Parameters:
word_path : str

Path to Buckeye .words file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data

LaBB-CAT parser
class polyglotdb.io.parsers.labbcat.LabbCatParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids exported from LaBB-CAT

Parameters:
annotation_tiers : list

List of the annotation tiers to store data from the TextGrid

hierarchy : Hierarchy

Basic hierarchy of the TextGrid

make_transcription : bool

Flag for whether to add a transcription property to words based on phones they contain

stop_check : callable

Function to check for whether parsing should stop

call_back : callable

Function to report progress in parsing

load_textgrid(path)[source]

Load a TextGrid file. Additionally ignore duplicated tier names as they can sometimes be exported erroneously from LaBB-CAT.

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

Speaker parsers
Filename Speaker Parser
class polyglotdb.io.parsers.speaker.FilenameSpeakerParser(number_of_characters, left_orientation=True)[source]

Class for parsing a speaker name from a path that gets a specified number of characters from either the left or the right of the base file name.

Parameters:
number_of_characters : int

Number of characters to include in the speaker designation, set to 0 to get the full file name

left_orientation : bool

Whether to pull characters from the left or right of the base file name, defaults to True

parse_path(path)[source]

Parses a file path and returns a speaker name

Parameters:
path : str

File path

Returns:
str

Substring of path that is the speaker name

Directory Speaker Parser
class polyglotdb.io.parsers.speaker.DirectorySpeakerParser[source]

Class for parsing a speaker name from a path that gets the directory immediately containing the file and uses its name as the speaker name

parse_path(path)[source]

Parses a file path and returns a speaker name

Parameters:
path : str

File path

Returns:
str

Directory that is the name of the speaker

Exporters

Exporters are still under development.

Helper functions
DiscourseData(name, annotation_types, hierarchy) Class for collecting information about a discourse to be loaded
inspect_directory(directory) Function to inspect a directory and return the most likely type of files within it.
text_to_lines(path) Parse a text file into lines.
find_wav_path(path) Find a sound file for a given file, by looking for a .wav file with the same base name as the given path

Acoustics API

Classes
class polyglotdb.acoustics.classes.Track[source]

Track class to contain, select, and manage TimePoint objects

Attributes:
points : iterable of TimePoint

Time points with values of the acoustic track

add(point)[source]

Add a TimePoint to the track

Parameters:
point : TimePoint

Time point to add

items()[source]

Generator for returning tuples of the time point and values

Returns:
generator

Tuples of time points and values

keys()[source]

Get a list of all keys for TimePoints that the Track has

Returns:
list

All keys on TimePoint objects

slice(begin, end)[source]

Create a slice of the acoustic track between two times

Parameters:
begin : float

Begin time for the slice

end : float

End time for the slice

Returns:
Track

Track constructed from just the time points in the specified time

times()[source]

Get a list of all time points in the track

Returns:
list

Sorted time points

class polyglotdb.acoustics.classes.TimePoint(time)[source]

Class for handling acoustic measurements at a specific time point

Attributes:
time : float

The time of the time point

values : dict

Dictionary of acoustic measures for the given time point

add_value(name, value)[source]

Add a new named measure and value to the TimePoint

Parameters:
name : str

Name of the measure

value : object

Measure value

has_value(name)[source]

Check whether a time point contains a named measure

Parameters:
name : str

Name of the measure

Returns:
bool

True if name is in values and has a value

select_values(columns)[source]

Generate a dictionary of only the specified measurement names

Parameters:
columns : iterable

Iterable of measurement names to include

Returns:
dict

Subset of values if their name is in the specified columns

update(point)[source]

Update values in this time point from another TimePoint

Parameters:
point : polyglotdb.acoustics.classes.TimePoint

TimePoint to get values from

Segments
polyglotdb.acoustics.segments.generate_segments(corpus_context, annotation_type='utterance', subset=None, file_type='vowel', duration_threshold=0.001, padding=0, fetch_subannotations=False)[source]

Generate segment vectors for an annotation type, to be used as input to analyze_file_segments.

Parameters:
corpus_context : CorpusContext

The CorpusContext object of the corpus

annotation_type : str, optional

The type of annotation to use in generating segments, defaults to utterance

subset : str, optional

Specify a subset to use for generating segments

file_type : str, optional

One of ‘low_freq’, ‘vowel’, or ‘consonant’, specifies the type of audio file to use

duration_threshold: float, optional

Segments with length shorter than this value (in seconds) will not be included

Returns:
SegmentMapping

Object containing segments to be analyzed

polyglotdb.acoustics.segments.generate_vowel_segments(corpus_context, duration_threshold=None, padding=0, vowel_label='vowel')[source]

Generate segment vectors for each vowel, to be used as input to analyze_file_segments.

Parameters:
corpus_context : polyglot.corpus.context.CorpusContext

The CorpusContext object of the corpus

duration_threshold: float, optional

Segments with length shorter than this value (in seconds) will not be included

Returns:
SegmentMapping

Object containing vowel segments to be analyzed

polyglotdb.acoustics.segments.generate_utterance_segments(corpus_context, file_type='vowel', duration_threshold=None, padding=0)[source]

Generate segment vectors for each utterance, to be used as input to analyze_file_segments.

Parameters:
corpus_context : polyglot.corpus.context.CorpusContext

The CorpusContext object of the corpus

file_type : str, optional

One of ‘low_freq’, ‘vowel’, or ‘consonant’, specifies the type of audio file to use

duration_threshold: float, optional

Segments with length shorter than this value (in seconds) will not be included

Returns:
SegmentMapping

Object containing utterance segments to be analyzed

Formants
polyglotdb.acoustics.formants.base.analyze_formant_tracks(corpus_context, vowel_label=None, source='praat', call_back=None, stop_check=None, multiprocessing=True)[source]

Analyze formants of an entire utterance, and save the resulting formant tracks into the database.

Parameters:
corpus_context : CorpusContext

corpus context to use

vowel_label : str, optional

Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.

call_back : callable

call back function, optional

stop_check : callable

stop check function, optional

polyglotdb.acoustics.formants.base.analyze_formant_points(corpus_context, call_back=None, stop_check=None, vowel_label='vowel', duration_threshold=None, multiprocessing=True)[source]

First pass of the algorithm; generates prototypes.

Parameters:
corpus_context : polyglot.corpus.context.CorpusContext

The CorpusContext object of the corpus.

call_back : callable

Information about callback.

stop_check : string

Information about stop check.

vowel_label : str

The subset of phones to analyze.

duration_threshold : float, optional

Segments with length shorter than this value (in milliseconds) will not be analyzed.

Returns:
dict

Track data

polyglotdb.acoustics.formants.refined.analyze_formant_points_refinement(corpus_context, vowel_label='vowel', duration_threshold=0, num_iterations=1, call_back=None, stop_check=None, vowel_prototypes_path='', drop_formant=False, multiprocessing=True, output_tracks=False)[source]

Extracts F1, F2, F3 and B1, B2, B3.

Parameters:
corpus_context : CorpusContext

The CorpusContext object of the corpus.

vowel_label : str

The subset of phones to analyze.

duration_threshold : float, optional

Segments with length shorter than this value (in milliseconds) will not be analyzed.

num_iterations : int, optional

How many times the algorithm should iterate before returning values.

output_tracks : bool, optional

Whether to save only the formant values as a point at 0.33 if false or have a track over the entire vowel duration if true.

Returns:
prototype_metadata : dict

Means of F1, F2, F3, B1, B2, B3 and covariance matrices per vowel class.

Conch function generators
polyglotdb.acoustics.formants.helper.generate_base_formants_function(corpus_context, gender=None, source='praat')[source]
Parameters:
corpus_context : polyglot.corpus.context.CorpusContext

The CorpusContext object of the corpus.

gender : str

The gender to use for the function, if “M”(male) then the max frequency is 5000 Hz, otherwise 5500

source : str

The source of the function, if it is “praat” then the formants will be calculated with Praat over each segment otherwise it will simply be tracks

Returns
——-
formant_function : Partial function object

The function used to call Praat.

polyglotdb.acoustics.formants.helper.generate_formants_point_function(corpus_context, gender=None)[source]

Generates a function used to call Praat to measure formants and bandwidths with variable num_formants.

Parameters:
corpus_context : CorpusContext

The CorpusContext object of the corpus.

min_formants : int

The minimum number of formants to measure with on subsequent passes (default is 4).

max_formants : int

The maximum number of formants to measure with on subsequent passes (default is 7).

Returns:
formant_function : Partial function object

The function used to call Praat.

polyglotdb.acoustics.formants.helper.generate_variable_formants_point_function(corpus_context, min_formants, max_formants)[source]

Generates a function used to call Praat to measure formants and bandwidths with variable num_formants. This specific function returns a single point per formant at a third of the way through the segment

Parameters:
corpus_context : CorpusContext

The CorpusContext object of the corpus.

min_formants : int

The minimum number of formants to measure with on subsequent passes (default is 4).

max_formants : int

The maximum number of formants to measure with on subsequent passes (default is 7).

Returns:
formant_function : Partial function object

The function used to call Praat.

Intensity
polyglotdb.acoustics.intensity.analyze_intensity(corpus_context, source='praat', call_back=None, stop_check=None, multiprocessing=True)[source]

Analyze intensity of an entire utterance, and save the resulting intensity tracks into the database.

Parameters:
corpus_context : CorpusContext

corpus context to use

source : str

Source program to use (only praat available)

call_back : callable

call back function, optional

stop_check : function

stop check function, optional

multiprocessing : bool

Flag to use multiprocessing rather than threading

Conch function generators
polyglotdb.acoustics.intensity.generate_base_intensity_function(corpus_context)[source]

Generate an Intensity function from Conch

Parameters:
corpus_context : CorpusContext

CorpusContext to use for getting path to Praat (if not on the system path)

Returns:
PraatSegmentIntensityTrackFunction

Intensity analysis function

Pitch
polyglotdb.acoustics.pitch.base.analyze_pitch(corpus_context, source='praat', algorithm='base', call_back=None, absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, multiprocessing=True)[source]
Parameters:
corpus_context : AudioContext
source : str

Program to use for analyzing pitch, either praat or reaper

algorithm : str

Algorithm to use, base, gendered, or speaker_adjusted

absolute_min_pitch : int

Absolute pitch floor

absolute_max_pitch : int

Absolute pitch ceiling

adjusted_octaves : int

How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling

stop_check : callable

Function to check whether processing should stop early

call_back : callable

Function to report progress

multiprocessing : bool

Flag whether to use multiprocessing or threading

Conch function generators
polyglotdb.acoustics.pitch.helper.generate_pitch_function(algorithm, min_pitch, max_pitch, path=None, kwargs=None)[source]
VOT
polyglotdb.acoustics.vot.base.analyze_vot(corpus_context, classifier, stop_label='stops', vot_min=5, vot_max=100, window_min=-30, window_max=30, overwrite_edited=False, call_back=None, stop_check=None, multiprocessing=False)[source]

Analyze VOT for stops using a pretrained AutoVOT classifier.

Parameters:
corpus_context : AudioContext
classifier : str

Path to an AutoVOT classifier model

stop_label : str

Label of subset to analyze

vot_min : int

Minimum VOT in ms

vot_max : int

Maximum VOT in ms

window_min : int

Window minimum in ms

window_max : int

Window maximum in Ms

overwrite_edited:

Whether to updated VOTs which have the property, edited set to True

call_back : callable

call back function, optional

stop_check : callable

stop check function, optional

multiprocessing : bool

Flag to use multiprocessing, otherwise will use threading

Other
polyglotdb.acoustics.other.analyze_track_script(corpus_context, acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, call_back=None, file_type='consonant', stop_check=None, multiprocessing=True)[source]
polyglotdb.acoustics.other.analyze_script(corpus_context, phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, call_back=None, file_type='consonant', stop_check=None, multiprocessing=True)[source]

Perform acoustic analysis of phones using an input praat script.

Saves the measurement results from the praat script into the database under the same names as the Praat output columns Praat script requirements:

  • the only input is the full path to the sound file containing (only) the phone
  • the script prints the output to the Praat Info window in two rows (i.e. two lines).
  • the first row is a space-separated list of measurement names: these are the names that will be saved into the database
  • the second row is a space-separated list of the value for each measurement
Parameters:
corpus_context : CorpusContext

corpus context to use

phone_class : str

DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run

subset : str, optional

the name of an already encoded subset of an annotation type, on which the analysis will be run

annotation_type : str

the type of annotation that the analysis will go over

script_path : str

full path to the praat script

duration_threshold : float

Minimum duration of segments to be analyzed

file_type : str

File type to use for the script (consonant = 16kHz sample rate, vowel = 11kHz, low_freq = 1200 Hz)

arguments : list

a list containing any arguments to the praat script (currently not working)

call_back : callable

call back function, optional

stop_check : callable

stop check function, optional

multiprocessing : bool

Flag to use multiprocessing, otherwise will use threading

Conch function generators
polyglotdb.acoustics.other.generate_praat_script_function(praat_path, script_path, arguments=None)[source]

Generate a partial function that calls the praat script specified. (used as input to analyze_file_segments)

Parameters:
praat_path : string

full path to praat/praatcon

script_path: string

full path to the script

arguments : list

a list containing any arguments to the praat script, optional (currently not implemented)

Returns:
function

the partial function which applies the Praat script to a phone and returns the script output

Changelog

Version 1.2

  • Upgraded Neo4j compatibility to 4.3.3
  • Upgraded InfluxDB compatibility to 1.8.9
  • Changed Praat TextGrid handling to use praatio 4.1
  • Phone parsing no longer includes blank intervals (i.e. silences), so preceding and following phone calculations have changed
  • Update speaker adjusted pitch algorithm to use octave based min and max pitch rather than the more permissive standard deviation approach

Version 1.0

  • Added functionality to analyze voice-onset-time through AutoVOT
  • Added functionality to analyze formant points and tracks using a refinement process based on vowel formant prototypes
  • Added ability to enrich tokens from CSV
  • Added parser for TextGrids generated from the Web-MAUS aligner
  • Optimized loading of corpora for lower-memory computers
  • Optimized queries involving acoustic tracks

See the PolyglotDB wiki for the changelog.

Indices and tables