Welcome to PolyglotDB’s documentation!¶
Contents:
Introduction¶
General Background¶
PolyglotDB is a Python package that focuses on representing linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to large speech corpora.
In general there are two ways to leverage PolyglotDB for analyzing a dataset:
- The first way, more appropriate for technically skilled users, is through a Python API: writing Python scripts that import functions and classes from PolyglotDB. (For this route, see Getting started for setting up PolyglotDB, followed by Tutorial for walk-through examples.) This way also makes more sense for users in an individual lab, where it can be assumed that all users have the same level of access to datasets (without any ethical issues).
- The second way, more appropriate for a user group dispersed across multiple sites and where some users are less comfortable with Python scripting, is by setting up an ISCAN (Integrated Speech Corpus ANalysis) server—see the ISCAN documentation for more details. ISCAN servers allow users to view information and perform most functions of PolyglotDB through a web browser. In addition, ISCAN servers include features for the use case of multiple datasets with differential access: by user/corpus permissions level, and functionality for managing multiple Polyglot databases.
This documentation site is relevant for ways PolyglotDB canbeused, but is geared towards a technically-skilled user and thus focuses more on the use case of using PolyglotDB “by script” (#1).
The general workflow for working with PolyglotDB is:
- Import
- Parse and load initial data from corpus files into a Polyglot
database
- This step can take a while, from a couple of minutes to hours depending on corpus size.
- Intended to be done once per corpus
- See Importing the tutorial corpus for an example
- See Importing corpora for more details on the import process
- Parse and load initial data from corpus files into a Polyglot
database
- Enrichment
- Add further information through analysis algorithms or from CSV files
- Can take a while, from a couple of minutes to hours depending on enrichment and corpus size
- Intended to be done once per corpus
- See Tutorial 2: Adding extra information for an example
- See Enrichment for more details on the enrichment process
- Add further information through analysis algorithms or from CSV files
- Query
- Find specific linguistic units
- Should be quick, from a couple of minutes to ~10 minutes depending on corpus size
- Intended to be done many times per corpus, for different queries
- See Tutorial 3: Getting information out for an example
- See Querying corpora for more details on the query process
- Find specific linguistic units
- Export
- Generate a CSV file for data analysis with specific information extracted from the previous query
- Should be quick, and intended to be done many times per corpus (like Query)
- See Exporting a CSV file for an example
- See Exporting query results for more details on the export process
- Generate a CSV file for data analysis with specific information extracted from the previous query
The thinking behind this workflow is explained in more detail in the ISCAN conference paper.
Note
There are also many PolyglotDB scripts written for the SPADE project that can be used as examples. These scripts are available in the SPADE GitHub repo.
High level overview¶
PolyglotDB represents language (speech and text corpora) using the annotation graph formalism put forth in Bird and Liberman (2001). Annotations are represented in a directed acyclic graph, where nodes are points in time in an audio file or points in a text file. Directed edges are labelled with annotations, and multiple levels of annotations can be included or excluded as desired. They put forth a relational formalism for annotation graphs, and later work implements the formalism in SQL. Similarly, the LaBB-CAT and EMU-SDMS use the annotation graph formalism.
Recently, NoSQL databases have been rising in popularity, and one type of these is the graph database. In this type of database, nodes and relationships are primitives rather than relational tables. Graph databases map on annotation graphs in a much cleaner fashion than relational databases. The graph database used in PolyglotDB is Neo4j.
PolyglotDB also uses a NoSQL time-series database called InfluxDB. Acoustic measurements like F0 and formants are stored here as every time step (10 ms) has a value associated with it. Each measurement is also associated with a speaker and a phone from the graph database.
Multiple versions of imported sound files are generated at various sampling rates (1200 Hz, 11000 Hz, and 22050 Hz) to help speed up relevant algorithms. For example, pitch algorithms don’t need a highly sampled signal and higher sample rates will slow down the processing of files.
The idea of using multiple languages or technologies that suit individual problems has been known, particularly in the realm of merging SQL and NoSQL databases, as “polyglot persistence.”
More detailed information on specific implementation details is available in the Developer documentation, as well as in the InterSpeech proceedings paper.
Development history¶
PolyglotDB was originally conceptualized for use in Phonological CorpusTools, developed at the University of British Columbia. However, primary development shifted to the umbrella of Montreal Corpus Tools, developed by members of the Montreal Language Modelling Lab at McGill University (now part of MCQLL Lab).
A graphical program named Speech Corpus Tools was originally developed to allow users to interact with Polyglot without writing scripts. However, in the context of the the Speech Across Dialects of English (SPADE) project, a more flexible solution was needed to accommodate use cases involving multiple users, with physical separation between users and data, and differing levels of permission across datasets. ISCAN has been developed within the SPADE project with these requirements in mind.
Contributors¶
- Michael McAuliffe (@mmcauliffe)
- Elias Stengel-Eskin (@esteng)
- Sarah Mihuc (@samihuc)
- Michael Goodale (@MichaelGoodale)
- Jeff Mielke (@jeffmielke)
- Arlie Coles (@a-coles)
Citation¶
A citeable paper for PolyglotDB is:
McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017. [PDF]
Or you can cite it via:
McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, Arlie Coles, Sarah Mihuc, Michael Goodale, and Morgan Sonderegger (2019). PolyglotDB [Computer program]. Version 0.1.0, retrieved 26 March 2019 from https://github.com/MontrealCorpusTools/PolyglotDB.
Getting started¶
PolyglotDB is the Python API for interacting with Polyglot databases and is installed through pip
. There are other
dependencies that must be installed prior to using a Polyglot database, depending on the user’s platform.
Note
Another way to use Polyglot functionality is through setting up an ISCAN server. An Integrated Speech Corpus Analysis (ISCAN) server can be set up on a lab’s central server, or you can run it on your local computer as well (though many of PolyglotDB’s algorithms benefit from having more processors and memory available). Please see the ISCAN documentation for more information on setting it up (http://iscan.readthedocs.io/en/latest/getting_started.html). The main feature benefits of ISCAN are multiple Polyglot databases (separating out different corpora and allowing any of them to be started or shutdown), graphical interfaces for inspecting data, and a user authentication system with different levels of permission for remote access through a web application.
Installation¶
To install via pip:
pip install polyglotdb
To install from source (primarily for development):
- Clone or download the Git repository (https://github.com/MontrealCorpusTools/PolyglotDB).
- Navigate to the directory via command line and install the dependencies via
pip install -r requirements.txt
- Install PolyglotDB via
python setup.py install
, which will install thepgdb
utility that can be run anywhere and manages a local database.
Note
The use of sudo
is not recommended for installation. Ideally your Python installation should be managed by either
Anaconda or Homebrew (for Macs).
Set up local database¶
Installing the PolyglotDB package also installs a utility script (pgdb
) that is then callable from the command line
anywhere on the system. The pgdb
command allows for the administration of a single Polyglot database (install/start/stop/uninstall).
Using pgdb
requires that several prerequisites be installed first, and the remainder of this section will detail how
to install these on various platforms.
Please be aware that using the pgdb
utility to set up a database is not recommended for larger groups or those needing
remote access.
See the ISCAN server for a more fully featured solution.
Mac¶
- Ensure Java 11 is installed inside Anaconda distribution (
conda install -c anaconda openjdk
) if using Anaconda, or via Homebrew otherwise (brew cask install java
) - Check Java version is 11 via
java --version
- Once PolyglotDB is installed, run
pgdb install /path/to/where/you/want/data/to/be/stored
, orpgdb install
to save data in the default directory.
Warning
Do not use sudo
with this command on Macs, as it will lead to permissions issues later on.
Once you have installed PolyglotDB, to start it run pgdb start
.
Likewise, you can close PolyglotDB by running pgdb stop
.
To uninstall, run pgdb uninstall
Windows¶
- Ensure Java 11 is installed (https://www.java.com/) and on the path (
java --version
works in the command prompt) - Check Java version is 11 via
java --version
- Start an Administrator command prompt (right click on cmd.exe and select “Run as administrator”), as Neo4j will be installed as a Windows service.
- Run
pgdb install /path/to/where/you/want/data/to/be/stored
, orpgdb install
to save data in the default directory.
To start the database, you likewise have to use an administrator command prompt before entering the commands pgdb start
or pgdb stop
.
To uninstall, run pgdb uninstall
(also requires an administrator command prompt).
Linux¶
Ensure Java 11 is installed. On Ubuntu:
sudo apt-get update
sudo apt-get install openjdk-11-jdk-headless
Once installed, double check that java --version
returns Java 11. Then run pgdb install /path/to/where/you/want/data/to/be/stored
, or
pgdb install
to save data in the default directory.
Once you have installed PolyglotDB, to start it run pgdb start
.
Likewise, you can close PolyglotDB by running pgdb stop
.
To uninstall, navigate to the PolyglotDB directory and type pgdb uninstall
Tutorial¶
Under development!
Contents:
Tutorial 1: First steps¶
The main objective of this tutorial is to import a downloaded corpus consisting of sound files and TextGrids into a Polyglot database so that they can be queried. This tutorial is available as a Jupyter notebook as well.
Downloading the tutorial corpus¶
The tutorial corpus used here is a version of the LibriSpeech test-clean subset, forced aligned with the Montreal Forced Aligner (tutorial corpus download link). Extract the files to somewhere on your local machine.
Importing the tutorial corpus¶
To import the tutorial corpus, the following lines of code are necessary:
from polyglotdb import CorpusContext
import polyglotdb.io as pgio
corpus_root = '/mnt/e/Data/pg_tutorial'
parser = pgio.inspect_mfa(corpus_root)
parser.call_back = print
with CorpusContext('pg_tutorial') as c:
c.load(parser, corpus_root)
Important
If during the running of the import code, a neo4j.exceptions.ServiceUnavailable
error is raised, then double check
that the pgdb database is running. Once polyglotdb is installed, simply call pgdb start
, assuming pgdb install
has already been called. See Set up local database for more information.
The import statements at the top get the necessary classes and functions for importing, namely the CorpusContext class and
the polyglot IO module. CorpusContext objects are how all interactions with the database are handled. The CorpusContext is
created as a context manager in Python (the with ... as ...
pattern), so that clean up and closing of connections are
automatically handled both on successful completion of the code as well as if errors are encountered.
The IO module handles all import and export functionality in polyglotdb. The principle functions that a user will encounter
are the inspect_X
functions that generate parsers for corpus formats. In the above code, the MFA parser is used because
the tutorial corpus was aligned using the MFA. See Importing corpora for more information on the inspect functions and parser
objects they generate for various formats.
Resetting the corpus¶
If at any point there’s some error or interruption in import or other stages of the tutorial, the corpus can be reset to a fresh state via the following code:
from polyglotdb import CorpusContext
with CorpusContext('pg_tutorial') as c:
c.reset()
Warning
Be careful when running this code as it will delete any and all information in the corpus. For smaller corpora such as the one presented here, the time to set up is not huge, but for larger corpora this can result in several hours worth of time to reimport and re-enrich the corpus.
Testing some simple queries¶
To ensure that data import completed successfully, we can print the list of speakers, discourses, and phone types in the corpus, via:
from polyglotdb import CorpusContext
with CorpusContext('pg_tutorial') as c:
print('Speakers:', c.speakers)
print('Discourses:', c.discourses)
q = c.query_lexicon(c.lexicon_phone)
q = q.order_by(c.lexicon_phone.label)
q = q.columns(c.lexicon_phone.label.column_name('phone'))
results = q.all()
print(results)
A more interesting summary query is perhaps looking at the count and average duration of different phone types across the corpus, via:
from polyglotdb.query.base.func import Count, Average
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.phone).group_by(c.phone.label.column_name('phone'))
results = q.aggregate(Count().column_name('count'), Average(c.phone.duration).column_name('average_duration'))
for r in results:
print('The phone {} had {} occurrences and an average duration of {}.'.format(r['phone'], r['count'], r['average_duration']))
Next steps¶
You can see a full version of the script.
See Tutorial 2: Adding extra information for the next tutorial covering how to enrich the corpus and create more interesting queries.
Tutorial 2: Adding extra information¶
The main objective of this tutorial is to enrich an already imported corpus (see Tutorial 1: First steps) with additional information not present in the original audio and transcripts. This additional information will then be used for creating linguistically interesting queries in the next tutorial (Tutorial 3: Getting information out). This tutorial is available as a Jupyter notebook as well.
Note
Different kinds of enrichment, corresponding to different subsections of this section, can be performed in any order. For example, speaker enrichment is independent of syllable encoding, so you can perform either one before the other and the resulting database will be the same. Within a section, however (i.e., Encoding syllables), the ordering of steps matters. For example, syllabic segments must be specified before syllables can be encoded, because the syllable encoding algorithm builds up syllalbes around syllabic phones.
As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:
import os
from polyglotdb import CorpusContext
## CHANGE THIS PATH to location of pg_tutorial corpus on your system
corpus_root = '/mnt/e/Data/pg_tutorial'
Encoding syllables¶
To create syllables requires two steps. The first is to specify the subset of phones in the corpus that are syllabic segments and function as syllabic nuclei. In general these will be vowels, but can also include syllabic consonants. Subsets in PolyglotDB are completely arbitrary sets of labels that speed up querying and allow for simpler references; see Subset enrichment for more details.
syllabics = ["ER0", "IH2", "EH1", "AE0", "UH1", "AY2", "AW2", "UW1", "OY2", "OY1", "AO0", "AH2", "ER1", "AW1",
"OW0", "IY1", "IY2", "UW0", "AA1", "EY0", "AE1", "AA0", "OW1", "AW0", "AO1", "AO2", "IH0", "ER2",
"UW2", "IY0", "AE2", "AH0", "AH1", "UH2", "EH2", "UH0", "EY1", "AY0", "AY1", "EH0", "EY2", "AA2",
"OW2", "IH1"]
with CorpusContext('pg_tutorial') as c:
c.encode_type_subset('phone', syllabics, 'syllabic')
The database now contains the information that each phone type above (“ER0”, etc.) is a member of a subset called “syllabics”. Thus, each phone token, which belongs to one of these phone types, is also a syllabic.
Once the syllabic segments have been marked as such in the phone inventory, the next step is to actually create the syllable annotations as follows:
with CorpusContext('pg_tutorial') as c:
c.encode_syllables(syllabic_label='syllabic')
The encode_syllables
function uses a maximum onset algorithm based on all existing word-initial sequences of phones not
marked as syllabic
in this case, and then maximizes onsets between syllabic phones. As an example, something like
astringent
would have a phone sequence of AH0 S T R IH1 N JH EH0 N T
. In any reasonably-sized corpus of English, the
list of possible onsets would in include S T R
and JH
, but not N JH
, so the sequence would be syllabified as
AH0 . S T R IH1 N . JH EH0 N T
.
Note
See Creating syllable units for more details on syllable enrichment.
Encoding utterances¶
As with syllables, encoding utterances consists of two steps. The first is marking the “words” that are actually non-speech
elements within the transcript. When a corpus is first imported,
every annotation is treated as speech. So we muist first encode
labels like <SIL>
as pause elements and not actual speech sounds:
pause_labels = ['<SIL>']
with CorpusContext('pg_tutorial') as c:
c.encode_pauses(pause_labels)
(Note that in the tutorial corpus <SIL>
happens to be the only
possible non-speech “word”, but in other corpora there will probably
be others, so you’d use a different pause_labels
list.)
Once pauses are encoded, the next step is to actually create the utterance annotations as follows:
with CorpusContext('pg_tutorial') as c:
c.encode_utterances(min_pause_length=0.15)
The min_puase_length argument specifies how long (in seconds) a non-speech element has to be to act as an utterance boundary. In many cases, “pauses” that are short enough, such as those inserted by a forced alignment error, are not good utterance boundaries (or just signal a smaller unit than an “utterance”).
Note
See Creating utterance units for more details on encoding pauses and utterances.
Speaker enrichment¶
Included in the tutorial corpus is a CSV containing speaker information, namely their gender and their actual name rather than the numeric code used in LibriSpeech. This information can be imported into the corpus as follows:
speaker_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'speaker_info.csv')
with CorpusContext('pg_tutorial') as c:
c.enrich_speakers_from_csv(speaker_enrichment_path)
Note that the CSV file could have an arbitrary name and location, in general. The command above assumes the name and location for the tutorial corpus.
Once enrichment is complete, we can then query information and extract information about these characteristics of speakers.
Note
See Enriching speaker information for more details on enrichment from csvs.
Stress enrichment¶
Important
Stress enrichment requires the Encoding syllables step has been completed.
Once syllables have been encoded, there are a couple of ways to encode the stress level of the syllable (i.e., primary stress,
secondary stress, or unstressed). The way used in this tutorial will use a lexical enrichment file included in the tutorial
corpus. This file has a field named stress_pattern
that gives a pattern for the syllables based on the stress. For
example, astringent
will have a stress pattern of 0-1-0
.
lexicon_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'iscan_lexicon.csv')
with CorpusContext('pg_tutorial') as c:
c.enrich_lexicon_from_csv(lexicon_enrichment_path)
c.encode_stress_from_word_property('stress_pattern')
Following this enrichment step, words will have a type property of stress_pattern
and syllables will have a token property
of stress
that can be queried on and extracted.
Note
See Encoding stress for more details on how to encode stress in various ways.
Additional enrichment¶
Important
Speech rate enrichment requires that both the Encoding syllables and Encoding utterances steps have been completed.
One of the final enrichment in this tutorial is to encode speech rate onto utterance annotations. The speech rate measure used here is going to to be syllables per second.
with CorpusContext('pg_tutorial') as c:
c.encode_rate('utterance', 'syllable', 'speech_rate')
Next we will encode the number of syllables per word:
with CorpusContext('pg_tutorial') as c:
c.encode_count('word', 'syllable', 'num_syllables')
Once the enrichments complete, a token property of speech_rate
will be available for query and export on utterance
annotations, as well as one for num_syllables
on word tokens.
Note
See Hierarchical enrichment for more details on encoding properties based on the rate/count/position of lower annotations (i.e., phones or syllables) within higher annotations (i.e., syllables, words, or utterances).
Next steps¶
You can see a full version of the script which carries out all steps shown in code above.
See Tutorial 3: Getting information out for the next tutorial covering how to create and export interesting queries using the information enriched above. See Enrichment for a full list and example usage of the various enrichments possible in PolyglotDB.
Tutorial 3: Getting information out¶
The main objective of this tutorial is to export a CSV file using a query on an imported (Tutorial 1: First steps) and enriched (Tutorial 2: Adding extra information) corpus. This tutorial is available as a Jupyter notebook as well.
As in the other tutorials, import statements and the location of the corpus root must be set for the code in this tutorial to be runnable:
from polyglotdb import CorpusContext
## CHANGE FOR YOUR SYSTEM
export_path = '/path/to/export/pg_tutorial.csv'
Creating an initial query¶
The first steps for generating a CSV file is to create a query that selects just the linguistic objects (“annotations”) of a particular type (e.g. words, syllables) that are of interest to our study.
For this example, we will query for all syllables, which are:
- stressed (defined here as having a
stress
value equal to'1'
)- At the beginning of the word,
- In words that are at the end of utterances.
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.syllable)
q = q.filter(c.syllable.stress == '1') # Stressed syllables...
q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.
Next, we want to specify the particular information to extract for each syllable found.
# duplicated from above
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.syllable)
q = q.filter(c.syllable.stress == '1') # Stressed syllables...
q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.
q = q.columns(c.syllable.label.column_name('syllable'),
c.syllable.duration.column_name('syllable_duration'),
c.syllable.word.label.column_name('word'),
c.syllable.word.begin.column_name('word_begin'),
c.syllable.word.end.column_name('word_end'),
c.syllable.word.num_syllables.column_name('word_num_syllables'),
c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
c.syllable.speaker.name.column_name('speaker'),
c.syllable.discourse.name.column_name('file'),
)
With the above, we extract information of interest about the syllable, the word it is in, the utterance it is in, the
speaker and the sound file (discourse
in PolyglotDB’s API).
To test out the query, we can limit
the results (for readability) and print them:
# duplicated from above
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.syllable)
q = q.filter(c.syllable.stress == '1') # Stressed syllables...
q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words...
q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances.
q = q.columns(c.syllable.label.column_name('syllable'),
c.syllable.duration.column_name('syllable_duration'),
c.syllable.word.label.column_name('word'),
c.syllable.word.begin.column_name('word_begin'),
c.syllable.word.end.column_name('word_end'),
c.syllable.word.num_syllables.column_name('word_num_syllables'),
c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
c.syllable.speaker.name.column_name('speaker'),
c.syllable.discourse.name.column_name('file'),
)
q = q.limit(10)
results = q.all()
print(results)
Which will show the first ten rows that would be exported to a csv.
Exporting a CSV file¶
Once the query is constructed with filters and columns, exporting to a CSV is a simple method call on the query object. For completeness, the full code for the query and export is given below.
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.syllable)
q = q.filter(c.syllable.stress == 1)
q = q.filter(c.syllable.begin == c.syllable.word.begin)
q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end)
q = q.columns(c.syllable.label.column_name('syllable'),
c.syllable.duration.column_name('syllable_duration'),
c.syllable.word.label.column_name('word'),
c.syllable.word.begin.column_name('word_begin'),
c.syllable.word.end.column_name('word_end'),
c.syllable.word.num_syllables.column_name('word_num_syllables'),
c.syllable.word.stress_pattern.column_name('word_stress_pattern'),
c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'),
c.syllable.speaker.name.column_name('speaker'),
c.syllable.discourse.name.column_name('file'),
)
q.to_csv(export_path)
The CSV file generated will then be ready to open in other programs or in R for data analysis.
Next steps¶
See the related ISCAN tutorial for R code on visualizing and analyzing the exported results.
Interacting with a local Polyglot database¶
There are two potential ways to have a local Polyglot instance up and running on your local machine. The first is a
command line utility pgdb
. The other option is to connect to
a locally running ISCAN server instance.
pgdb utility¶
This utility provides a basic way to install/start/stop all of the required databases in a Polyglot database (see Set up local database for more details on setting up a Polyglot instance this way).
When using this set up the following ports are used (and are relevant for later connecting with the corpus):
Port | Protocol | Database |
---|---|---|
7474 | HTTP | Neo4j |
7475 | HTTPS | Neo4j |
7687 | Bolt | Neo4j |
8086 | HTTP | InfluxDB |
8087 | UDP | InfluxDB |
If any of those ports are in use by other programs (they’re also the default ports for the respective database software), then the Polyglot instance will not be able to start.
Once pgdb start
has executed, the local Neo4j instance can be seen at http://localhost:7474/
.
Connecting from a script¶
When the Polyglot instance is running locally, scripts can connect to the relevant databases through the use of parameters passed to CorpusContext objects (or CorpusConfig objects):
from polyglotdb import CorpusContext, CorpusConfig
connection_params = {'host': 'localhost',
'graph_http_port': 7474,
'graph_bolt_port': 7687,
'acoustic_http_port': 8086}
config = CorpusConfig('corpus_name', **connection_params)
with CorpusContext(config) as c:
pass # replace with some task, i.e., import, enrichment, or query
These port settings are used by default and so connecting to a vanilla install of the pgdb
utility can be done more simply
through the following:
from polyglotdb import CorpusContext
with CorpusContext('corpus_name') as c:
pass # replace with some task, i.e., import, enrichment, or query
See the tutorial scripts for examples that use this style of connecting to a local pgdb
instance.
Local ISCAN server¶
A locally running ISCAN server is a more fully functional system that can manage multiple Polyglot databases (creating, starting and stopping
as necessary through a graphical web interface).
While ISCAN servers are intended to be run on dedicated remote servers, there will often be times where scripts
will need to connect a locally running server. For this, there is a utility function ensure_local_database_running
:
from polyglotdb import CorpusContext, CorpusConfig
from polyglotdb.utils import ensure_local_database_running
with ensure_local_database_running('database', port=8080, token='auth_token_from_iscan') as connection_params:
config = CorpusConfig('corpus_name', **connection_params)
with CorpusContext(config) as c:
pass # replace with some task, i.e., import, enrichment, or query
Important
Replace the database
, auth_token_from_iscan
, and corpus_name
with relevant values. In the use case of one
corpus per database,
database
and corpus_name
can be the same name, as in the SPADE analysis repository.
As compared to the example above, the only difference is the context manager use of ensure_local_database_running
.
What this function does is first try to connect to a ISCAN server running on the local machine.
If it successfully connects, then it creates a new database named "database"
if it does not already exist, starts it if
it is not already running, and then returns the connection parameters as a dictionary that can be used for instantiating
the CorpusConfig
object. Once all the work inside the context of ensure_local_database_running
has been completed, the
database will be stopped.
The token keyword argument should be an authentication token for a user with appropriate permissions to access the ISCAN server. This token can be found by going to the admin page for tokens within ISCAN (by default, http://localhost:8080/admin/auth_token/) and choosing an appropriate one. However, please ensure that this token is not committed or made public in any way as that would lead to security issues. One way to use this in committed code is to have the token saved in a separate text document that git does not track, and load it via a function like:
def load_token():
token_path = os.path.join(base_dir, 'auth_token')
if not os.path.exists(token_path):
return None
with open(token_path, 'r') as f:
token = f.read().strip()
return token
Note
The ISCAN server keeps track of all existing databases and ensures that the ports do not overlap, so multiple databases can be run simultaneously. The ports are all in the 7400 and 8400 range, and should not (but may) conflict with other applications.
This utility is thus best for isolated work by a single user, where only they will be interacting with the particular database specified and the database only needs to be available during the running of the script.
You can see an example of connecting to local ISCAN server used in the scripts for the SPADE analysis repository, for instance the basic_queries.py script.
Importing corpora¶
Corpora can be imported from several input formats. The list of currently supported formats is:
- TextGrids from Montreal Forced Aligner or FAVE-align
- Praat TextGrids
- TextGrids exported from LaBB-CAT
- BAS Partitur format
- TIMIT
- Buckeye
Each format has a inspection function in the polyglot.io
submodule that will check that format of the specified directory
or file matches the input format and return the appropriate parser.
These functions would be used as follows:
import polyglotdb.io as pgio
corpus_directory = '/path/to/directory'
parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids
# OR
parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids
# OR
parser = pgio.inspect_textgrid(corpus_directory)
# OR
parser = pgio.inspect_labbcat(corpus_directory)
# OR
parser = pgio.inspect_partitur(corpus_directory)
# OR
parser = pgio.inspect_timit(corpus_directory)
# OR
parser = pgio.inspect_buckeye(corpus_directory)
Note
For more technical detail on the inspect functions and the parser objects they return, see PolyglotDB I/O.
To import a corpus, the CorpusContext
context manager has to be imported
from polyglotdb
:
from polyglotdb import CorpusContext
CorpusContext
is the primary way through which corpora can be interacted
with.
Before importing a corpus, you should ensure that a Neo4j server is running.
Interacting with corpora requires submitting the connection details. The
easiest way to do this is with a utility function ensure_local_database_running
(see Interacting with a local Polyglot database for more
information):
from polyglotdb.utils import ensure_local_database_running
from polyglotdb import CorpusConfig
with ensure_local_database_running('database_name') as connection_params:
config = CorpusConfig('corpus_name', **connection_params)
The above config
object contains all the configuration for the corpus.
To import a file into a corpus (in this case a TextGrid):
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')
with ensure_local_database_running('database_name') as connection_params:
config = CorpusConfig('my_corpus', **connection_params)
with CorpusContext(config) as c:
c.load(parser, '/path/to/textgrid.TextGrid')
In the above code, the io
module is imported and provides access to
all the importing and exporting functions. For every format, there is an
inspect function to generate a parser for that file and other ones that are
formatted the same. In the case of a TextGrid,
the parser has annotation types correspond to interval and point tiers.
The inspect function
tries to guess the relevant attributes of each tier.
Note
The discourse load function of Corpuscontext
objects takes
a parser as the first argument. Parsers contain an attribute annotation_types
,
which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from
aligners, no modification is necessary.
All interaction with the databases is via the CorpusContext
context manager.
Further details on import arguments can be found
in the API documentation.
Once the above code is run, corpora can be queried and explored.
Enrichment¶
Following import, the corpus is often fairly bare, with just word and phone annotations. An important step in analyzing corpora is therefore enriching it with other information. Most of the methods here are automatic once a function is called.
Contents:
Subset enrichment¶
One of the most basic aspects of linguistic analysis is creating functional subsets of linguistic units. In phonology,
for instance, this would be creating classes like syllabic
or coronal
. For words, this might be classes like
content
vs functional
, or something more fine-grained like noun
, adjective
, verb
, etc. At the core of
these analyses is the idea that we treat some subset of linguistic units separately from others. In PolyglotDB, subsets are
a fairly broad and general concept and can be applied to both linguistic types (i.e., phones or words in a lexicon) or
to tokens (i.e., actual productions in a discourse).
For instance, if we wanted to create a subset of phone types that are syllabic, we can run the following code:
syllabics = ['aa', 'ih']
with CorpusContext('corpus') as c:
c.encode_type_subset('phone', syllabics, 'syllabic')
This type subset can then be used as in Creating syllable units, or for the queries in Subsetting annotations.
Token subsets can also be created, see Enrichment via queries.
Creating syllable units¶
Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).
At the moment, only one algorithm is supported (maximal onset) because its simplicity lends it to be language agnostic.
To encode syllables, there are two steps:
Encoding syllabic segments¶
Syllabic segments are called via a specialized function:
syllabic_segments = ['aa', 'ae','ih']
with CorpusContext(config) as c:
c.encode_syllabic_segments(syllabic_segments)
Following this code, all phones with labels of aa, ae, ih will belong to the subset syllabic. This subset can be then queried in the future, in addition to allowing syllables to be encoded.
Encoding syllables¶
with CorpusContext(config) as c:
c.encode_syllables()
Note
The function encode_syllables can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.
Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of syllables at the beginnings of words:
with CorpusContext(config) as c:
q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
print(q.all())
Encoding syllable properties from syllabics¶
Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded in some transcription systems. This section details functions that strip this information from the vowel and place it on the syllable unit instead.
Note
Removing the stress/tone information from the vowel makes queries easier, as getting all AA tokens no longer requires specifying that the label is in the set of AA1, AA2, AA0. This functionality can be disabled by specifying clean_phone_label=False in the two functions that follow.
Encoding stress¶
with CorpusContext(config) as c:
c.encode_stress_to_syllables()
Note
By default, stress is taken to be numbers in the vowel label (i.e., AA1 would have a stress of 1). A different pattern to use for stress information can be specified through the optional regex keyword argument.
Encoding tone¶
with CorpusContext(config) as c:
c.encode_tone_to_syllables()
Note
As for stress, a different regex can be specified with the regex keyword argument.
Creating utterance units¶
Utterances are groups of words that are continuous in some sense. The can be thought of as similar to interpausal units or chunks in other work. The basic idea is that there are intervals in which there are no speech, a subset of which count as breaks in speech depending on the length of these non-speech intervals.
To encode utterances, there are two steps:
Encoding non-speech elements¶
Non-speech elements in PolyglotDB are termed pause. Pauses are encoded as follows:
nonspeech_words = ['<SIL>','<IVER>']
with CorpusContext(config) as c:
c.encode_pauses(nonspeech_words)
The function encode_pauses takes a list of word labels that should not be considered speech in a discourse and marks them as such.
Note
Non-speech words can also be encoded through regular expressions, as in:
nonspeech_words = '^[<[{].*'
with CorpusContext(config) as c:
c.encode_pauses(nonspeech_words)
Where the pattern to be matched is any label that starts with < or [.
Once pauses are encoded, aspects of pauses can be queried, as follows:
with CorpusContext(config) as c:
q = c.query_graph(c.pause).filter(c.pause.discourse.name == 'one_discourse')
print(q.all())
Additionally, word annotations can have previous and following pauses that can be found:
with CorpusContext(config) as c:
q = c.query_graph(c.word).columns(c.word.label,
c.word.following_pause_duration.column_name('pause_duration'))
print(q.all())
Note
Once pauses are encoded, accessing an annotation’s previous or following word via c.word.previous will skip over any pauses. So for a string like I <SIL> go…, the previous word to the word go would be I rather than <SIL>.
Encoding utterances¶
Once pauses are encoded, utterances can be encoded by specifying the minimum length of non-speech elements that count as a break between stretches of speech.
with CorpusContext(config) as c:
c.encode_utterances(min_pause_length=0.15)
Note
The function encode_utterances can be given a keyword argument for call_back, which is a function like print that allows for progress to be output to the console.
Following encoding, utterances are available to queried and used as any other linguistic unit. For example, to get a list of all the instances of words at the beginnings of utterances:
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.begin == c.word.utterance.begin)
print(q.all())
Hierarchical enrichment¶
Hierarchical enrichment is for encoding properties that reference multiple levels of annotations. For instance, something like speech rate of an utterance requires referencing both utterances as well as the rate per second of an annotation type below, usually syllables. Likewise, encoding number of syllables in a word or the position of a phone in a word again reference multiple levels of annotation.
Note
See Annotation Graphs for details on the implementation and representations of the annotation graph hierarchy that PolyglotDB uses.
Encode count¶
Count enrichment creates a property on the higher annotation that is a measure of the number of lower annotations of a type it contains. For instance, if we want to encode how many phones there are within each word, the following code is used:
with CorpusContext('corpus') as c:
c.encode_count('word', 'phone', 'number_of_phones')
Following enrichment, all word tokens will have a property for number_of_phones
that can be referenced in queries
and exports.
Encode rate¶
Rate enrichment creates a property on a higher annotation that is a measure of lower annotations per second. It is calculated as the count of units contained by the higher annotation divided by the duration of the higher annotation.
with CorpusContext('corpus') as c:
c.encode_rate('word', 'phone', 'phones_per_second')
Following enrichment, all word tokens will have a property for phones_per_second
that can be referenced in queries
and exports.
Encode position¶
Position enrichment creates a property on the lower annotation that is the position of the element in relation to other annotations within a higher annotation. It starts at 1 for the first element.
with CorpusContext('corpus') as c:
c.encode_position('word', 'phone', 'position_in_word')
The encoded property is then queryable/exportable, as follows:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone).filter(c.phone.position_in_word == 1)
print(q.all())
The above query will match all phones in the first position (i.e., identical results to a query using alignment, see Hierarchical queries for more details on those).
Enrichment via CSV files¶
PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying a local CSV file to add information from. When constructing this CSV file, the first column should be the label used to identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus.
ID_column,property_one,property_two
first_item,first_item_value_one,first_item_value_two
second_item,,second_item_value_two
Enriching using this file would look up elements based on the ID_column
, and the one matching first_item
would get
both property_one
and property_two (with the respective values). The one matching second_item
would only get a
property_two
(because the value for property_one
is empty.
Enriching the lexicon¶
lexicon_csv_path = '/full/path/to/lexicon/data.csv'
with CorpusContext(config) as c:
c.enrich_lexicon_from_csv(lexicon_csv_path)
Note
The function enrich_lexicon_from_csv
accepts an optional keyword case_sensitive
and defaults to False
. Changing this
will respect capitalization when looking up words.
Enriching the phonological inventory¶
The phone inventory can be enriched with arbitrary properties via:
inventory_csv_path = '/full/path/to/inventory/data.csv'
with CorpusContext(config) as c:
c.enrich_inventory_from_csv(inventory_csv_path)
Enriching speaker information¶
Speaker information can be added via:
speaker_csv_path = '/full/path/to/speaker/data.csv'
with CorpusContext(config) as c:
c.enrich_speakers_from_csv(speaker_csv_path)
Enriching discourse information¶
Metadata about the discourses or sound files can be added via:
discourse_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
c.enrich_discourses_from_csv(discourse_csv_path)
Enriching arbitrary tokens¶
Often it’s necessary or useful to encode a new property on tokens of an annotation without directly interfacing the database.
This could happen, for example, if you wanted to use a different language or tool for a certain phonetic analysis than Python.
In this case, it is possible to enrich any type of token via CSV.
This can be done using the corpus_context.enrich_tokens_with_csv
function.
token_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
c.enrich_tokens_from_csv(token_csv_path,
annotation_type="phone",
id_column="phone_id"
properties=["measurement_1", "measurement_2"])
The only requirement for the CSV is that there is a column which contains the IDs of the tokens you wish to update.
You can get these IDs (along with other parameters) by querying the tokens before hand, and exporting a CSV, see Export for token CSVs.
The only columns from the CSV that will be added as token properties, are those which are included in the properties parameter.
If this parameter is left as None
, then all the columns of the CSV except the id_column
will be included.
Enrichment via queries¶
Queries have the functionality to set properties and create subsets of elements based on results.
For instance, if you wanted to make word initial phones more easily queryable, you could perform the following:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.begin == c.phone.word.begin)
q.create_subset('word-initial')
Once that code completes, a subsequent query could be made of:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset == 'word-initial)
print(q.all()))
Or instead of a subset, a property could be encoded as:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.begin == c.phone.word.begin)
q.set_properties(position='word-initial')
And then this property can be exported as a column in a csv:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q.columns(c.position)
q.to_csv(some_csv_path)
Lexicon queries can also be used in the same way to create subsets and encode properties that do not vary on a token by token basis.
For instance, a subset for high vowels can be created as follows:
with CorpusContext(config) as c:
high_vowels = ['iy', 'ih','uw','uh']
q = c.query_lexicon(c.lexicon_phone)
q = q.filter(c.lexicon_phone.label.in_(high_vowels))
q.create_subset('high_vowel')
Which can then be used to query phone annotations:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset == 'high_vowel')
print(q.all())
Acoustic measures¶
One of the most important steps in analyzing a corpus is taking acoustic measurements of the data in the corpus. The acoustics functions listed here allow you to analyze and save acoustics into the database, to be queried later. There are several automatic acoustics functions in PolyglotDB, and you can also encode other measures using the Praat script function.
Contents:
Encoding acoustic measures¶
PolyglotDB has some built-in functions to encode certain acoustic measures, and also supports encoding measures from a Praat script. All of these functions carry out the acoustic analysis and save the results into the database. It currently contains built-in functions to encode pitch, intensity, and formants.
In order to use any of them on your own computer, you must set your CorpusContext’s config.praat_path
to point
to Praat, and likewise config.reaper_path
to Reaper if you want to use Reaper for pitch.
Encoding pitch¶
Pitch is encoded using the analyze_pitch
function.
This is done as follows:
with CorpusContext(config) as c:
c.analyze_pitch()
Note
The function analyze_pitch
requires that utterances be encoded prior to being run.
See Creating utterance units for further details on encoding utterances.
Following encoding, pitch tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.
Pitch analysis can be configured in two ways, the source program of the measurement and the algorithm for fine tuning the pitch range.
Sources¶
The keyword argument source
can be set to
either 'praat'
or 'reaper'
, depending on which program you would like PolyglotDB to use to measure pitch.
The default source is Praat.
with CorpusContext(config) as c:
c.analyze_pitch()
# OR
c.analyze_pitch(source='reaper')
If the source is praat, the Praat executable must be discoverable on the system path (i.e., a call of praat in a terminal works). Likewise, if the source is reaper, the Reaper executable must be on the path or the full path to the Reaper executable must be specified.
Algorithms¶
Similar to the source, attribute, the algorithm can be toggled between "base"
, "gendered"
, and "speaker_adapted"
.
with CorpusContext(config) as c:
c.analyze_pitch()
# OR
c.analyze_pitch(algorithm='gendered')
# OR
c.analyze_pitch(algorithm='speaker_adapted')
The "base"
algorithm uses a default minimum pitch of 50 Hz and a maximum pitch of 500 Hz, but these can be changed through the absolute_min_pitch
and absolute_max_pitch
parameters.
The "gendered"
algorithm checks whether a Gender property is available for speakers. If a speaker has a property
value that starts with f (i.e., female),
utterances by that speakers will use a minimum pitch of 100 Hz and a maximum pitch of 500 Hz. If they have a property
value of m (i.e., male),
utterances by that speakers will use a minimum pitch of 50 Hz and a maximum pitch of 400 Hz.
The "speaker_adapted"
algorithm does two passes of pitch estimation. The first is identical to "base"
and uses a minimum pitch of 50 Hz and a maximum pitch of 500 Hz (or whatever the parameters have been set to).
This first pass is used to estimate by-speaker means of F0. Speaker-specific pitch floors and ceilings are calculated by adding or subtracting the number of octaves that the adjusted_octaves
parameter specifies. The default is 1, so the per-speaker pitch range will be one octave below and above the speaker’s mean pitch.
Encoding intensity¶
Intensity is encoded using analyze_intensity()
, as follows:
with CorpusContext(config) as c:
c.analyze_intensity()
Note
The function analyze_intensity
requires that utterances be encoded prior to being run. See
Creating utterance units for further details on encoding utterances.
Following encoding, intensity tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.
Encoding formants¶
There are several ways of encoding formants. The first is encodes formant tracks similar to encoding pitch or intensity tracks (i.e., done over utterances). There is also support for encoding formants tracks just over specified vowel segments. Finally, point measures of formants can be encoded. Both formant tracks and points can be calculated using either just a simple one-pass algorithm or by using a multiple-pass refinement algorithm.
Basic formant tracks¶
Formant tracks over utterances are encoded using analyze_formant_tracks
, as follows:
with CorpusContext(config) as c:
c.analyze_formant_tracks()
Note
The function analyze_formant_tracks
requires that utterances be encoded prior to being run. See
Creating utterance units for further details on encoding utterances.
Following encoding, formant tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.
Formant tracks can also be encoded just for specific phones by specifying a subset of phones:
with CorpusContext(config) as c:
c.analyze_formant_tracks(vowel_label='vowel')
Note
This usage requires that a vowel
subset of phone types be already encoded in the database.
See Enrichment via queries for more details on creating subsets
These formant tracks do not do any specialised analysis to ensure that they are not false formants.
Basic formant point measurements¶
The analyze_formant_points
function will generate measure for F1, F2, F3, B1, B2, and B3 at the time
point 33% of the way through the vowel for every vowel specified.
with CorpusContext(config) as c:
c.analyze_formant_points(vowel_label='vowel')
Note
The function analyze_formant_points
requires that a vowel
subset of phone types be already encoded in the database.
See Enrichment via queries for more details on creating subsets
Refined formant points and tracks¶
The other function for generating both point and track measurements is the analyze_formant_points_refinement
. This function computes
formant measurementss for
multiple values of n_formants
from 4 to 7. To pick the best measurement, the function initializes per-vowel
means and standard deviations with the F1, F2, F3, B1, B2, B3
values
generated by n_formants=5
. Then, it performs multiple iterations that select the new best track as the one that
minimizes the Mahalanobis distance to the relevant prototype.
In order to choose whether you wish to save tracks or points in the database, just change the output_tracks parameter to true if you would
like tracks, and false otherwise.
When operating over tracks, the algorithm still only evaluates the best parameters by using the 33% point.
with CorpusContext(config) as c:
c.analyze_formant_points_refinement(vowel_label='vowel')
Following encoding, phone types that were analyzed will have properties for F1
, F2
, F3
,
B1
, B2
, and B3
available for query and export. See Querying acoustic point measures for more details.
Encoding Voice Onset Time(VOT)¶
Currently there is only one method to encode Voice Onset Times(VOTs) into PolyglotDB. This makes use of the AutoVOT program which automatically calculates VOTs based on various acoustic properties.
VOTs are encoded over a specific subset of phones using :code: analyze_vot as follows:
with CorpusContext(config) as c:
c.analyze_vot(self, classifier,
stop_label="stops",
vot_min=5,
vot_max=100,
window_min=-30,
window_max=30):
Note
The function analyze_vot
requires that utterances and any subsets be encoded prior to being run. See
Creating utterance units for further details on encoding utterances and Subset enrichment for subsets.
Parameters¶
The :code: analyze_vot function has a variety of parameters that are important for running the function properly. classifier is a string which has a paht to an AutoVOT classifier directory. A default classifier is available in /tests/data/classifier/sotc_classifiers.
stop_label refers to the name of the subset of phones that you intend to calculate VOTs for.
vot_min and vot_max refer to the minimum and maximum duration of any VOT that is calculated. The AutoVOT repo <https://github.com/mlml/autovot> has some sane defaults for English voiced and voiceless stops.
window_min and window_max refer to the edges of a given phone’s duration. So, a window_min of -30 means that AutoVOT will look up to 30 milliseconds before the start of a phone for the burst, and a window_max of 30 means that it will look up to 30 milliseconds after the end of a phone.
Encoding other measures using a Praat script¶
Other acoustic measures can be encoded by passing a Praat script to analyze_script
.
The requirements for the Praat script are:
exactly one input: the full path to the sound file containing (only) the phone. (Any other parameters can be set manually within your script, and an existing script may need some other modifications in order to work on this type of input)
print the resulting acoustic measurements (or other properties) to the Praat Info window in the following format:
- The first line should be a space-separated list of column names. These are the names of the properties that will be saved into the database.
- The second line should be a space-separated list containing one measurement for each property.
- (It is okay if there is some blank space before/after these two lines.)
An example of the Praat output:
peak slope cog spread 5540.7376 24.3507 6744.0670 1562.1936
Output format if you are only taking one measure:
cog 6013.9
To run analyze_script
, do the following:
- encode a phone class for the subset of phones you would like to analyze
- call
analyze_script
on that phone class, with the path to your script
For example, to run a script which takes measures for sibilants:
with CorpusContext(config) as c:
c.encode_class(['S', 'Z', 'SH', 'ZH'], 'sibilant')
c.analyze_script('sibilant', 'path/to/script/sibilant.praat')
Querying acoustic measures¶
Querying acoustic tracks¶
All of the built-in acoustic measures are saved as tracks with 10 ms intervals in the database (and formants created using one of the
functions in Encoding formants). To access these values, you use corpus_context.phone.MEASUREMENT.track
, replacing MEASUREMENT
with the name of the measurement you want: pitch
, formants
, or intensity
.
Example: querying for formant track (TODO: I haven’t tested whether this really works exactly as it’s written)
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.begin, c.phone.end, c.phone.formants.track)
results = q.all()
q.to_csv('path/to/output.csv')
You can also find the min
, max
, and mean
of the track for each phone, using corpus_context.phone.MEASUREMENT.min
, etc.
Querying acoustic point measures¶
Acoustic measures that only have one measurement per phone are termed point measures and are accessed as regular properties of the annotation.
Anything encoded using analyze_script
is not saved as a track, and are instead recorded once for each phone. These are accessed using corpus_context.phone.MEASUREMENT
, replacing MEASUREMENT
with the name of the measurement you want.
Example: querying for cog
(center of gravity)
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.begin, c.phone.end, c.phone.cog)
results = q.all()
q.to_csv('path/to/output.csv')
Querying Voice Onset Time¶
Querying voice onset time is done in the same method as acoustic point measures, however, the vot object itself has different measures associated with it.
So, you must also include what you would like from the vot measurement as shown below.
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.vot.begin, c.phone.vot.end, c.phone.vot.confidence)
results = q.all()
q.to_csv('path/to/output.csv')
Subannotation enrichment¶
Often there are details which we would like to include on a linguistic annotation (word, syllable, phone, etc.) which are not a simple measure like a single value or a one value across time. An example of this would be Voice Onset Time (VOT), where we have two distinct parts (voicing onset and burst) which cannot just be reduced to a single value.
In PolyglotDB, we refer to these more complicated structures as sub-annotations as they provide details that cannot just be a single measure like formants or pitch. These sub-annotations are always attached to a regular linguistic annotation, but they have all of their own properties.
So for example, a given phone token could have a vot
subannotation on it, which would consist of several different values that are all related.
This would be the onset, burst or confidence (of the prediction) of the VOT in question.
This allows semantically linked measurements to be linked and treated as a single object with multiple values rather than several distinct measurements that happen to be related.
For information on querying subannotations, see Subannotation queries.
Querying corpora¶
Queries are the primary function of PolyglotDB. The goal for the query API is to provide an extensible base for more complex linguistic query systems to be built.
Contents:
Querying annotations¶
The main way of finding specific annotations is through the query_graph
method of
CorpusContext
objects.
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label == 'are')
results = q.all()
print(results)
The above code will find and print all instances of word
annotations that are
labeled with ‘are’. The method query_graph
takes one argument, which is
an attribute of the context manager corresponding to the name of the
annotation type.
The primary function for queries is filter
. This function takes one or more
conditional expressions on attributes of annotations. In the above example,
word
annotations have an attribute label
which corresponds to the
orthography.
Conditional expressions can take on any normal Python conditional (==
,
!=
, <
, <=
, >
, >=
). The Python
operator in
does not work; a special pattern has to be used:
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
results = q.all()
print(results)
The in_
conditional function can take any iterable, including another query:
with CorpusContext(config) as c:
sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
results = q.all()
print(results)
In this case, it will find all phone
annotations that are in the words
listed. Using the id
attribute will use unique identifiers for the filter.
In this particular instance, it does not matter, but it does in the following:
with CorpusContext(config) as c:
sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
sub_q = sub_q.filter_right_aligned(c.word.line)
q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
results = q.all()
print(results)
The above query will find all instances of the three words, but only where
they are right-aligned with a line
annotation.
Note
Queries are lazy evaluated. In the above example, sub_q
is
not evaluated until q.all()
is called. This means that filters
can be chained across multiple lines without a performance hit.
Following and previous annotations¶
Filters can reference the surrounding local context. For instance:
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
results = q.all()
print(results)
The above query will find all the ‘aa’ phones that are followed by an ‘r’
phone. Similarly, c.phone.previous
would provide access to filtering on
preceding phones.
Subsetting annotations¶
In linguistics, it’s often useful to specify subsets of symbols as particular classes. For instance, phonemes are grouped together by whether they are syllabic, their manner/place of articulation, and vowel height/backness/rounding, and words are grouped by their parts of speech.
Suppose a subset has been created as in Subset enrichment, so that the phones ‘aa’ and ‘ih’ have been marked as syllabic. Once this category is encoded in the database, it can be used in filters.
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset=='syllabic')
results = q.all()
print(results)
Note
The results returned by the above query will be identical to the similar query:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.label.in_(['aa', 'ih']))
results = q.all()
print(results)
The primary benefits of using subsets are performance based due to the inner workings of Neo4j. See Neo4j implementation for more details.
Another way to specify subsets is on the phone annotations themselves, as follows:
with CorpusContext(config) as c:
q = c.query_graph(c.phone.filter_by_subset('syllabic'))
results = q.all()
print(results)
Both of these queries are identical and will return all instances of ‘aa’ and ‘ih’ phones. The benefit of filter_by_subset is generally for use in Hierarchical queries.
Note
Using repeated subsets repeatedly in queries can make them overly verbose. The objects that the queries use are normal Python objects and can therefore be assigned to variables for easier use.
with CorpusContext(config) as c:
syl = c.phone.filter_by_subset('syllabic')
q = c.query_graph(syl)
q = q.filter(syl.end == syl.word.end)
results = q.all()
print(results)
The above query would find all phones marked by '+syllabic' that are
at the ends of words.
Hierarchical queries¶
A key facet of language is that it is hierarchical. Words contain phones,
and can be contained in larger utterances. There are several ways to
query hierarchical information. If we want to find all aa
phones in the
word dogs
, then we can perform the following query:
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.word.label == 'dogs')
results = q.all()
print(results)
Starting from the word level, we might want to know what phones each word contains.
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.label.column('phones'))
results = q.all()
print(results)
In the output of the above query, there would be a column labeled phones
that contains a list of the labels of phones that belong to the word
(['d', 'aa', 'g', 'z']
). Any property of phones can be queried this
way (i.e., begin
, end
, duration
, etc).
Going down the hierarchy, we can also find all words that contain a certain phone.
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
q = q.filter(c.word.phone.label == 'aa')
results = q.all()
print(results)
In this example, it will find all instances of the three words that contain
an aa
phone.
Special keywords exist for these containment columns. The keyword rate
will return the elements per second for the word (i.e., phones per second).
The keyword count
will return the number of elements.
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.rate.column('phones_per_second'))
q = q.columns(c.word.phone.count.column('num_phones'))
results = q.all()
print(results)
These keywords can also leverage subsets, as above:
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.rate.column('phones_per_second'))
q = q.columns(c.word.phone.filter_by_subset('+syllabic').count.column('num_syllabic_phones'))
q = q.columns(c.word.phone.count.column('num_phones'))
results = q.all()
print(results)
Additionally, there is a special keyword can be used to query the position
of a contained element in a containing one.
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.word.label == 'dogs')
q = q.columns(c.word.phone.position.column_name('position_in_word'))
results = q.all()
print(results)
The above query should return 2
for the value of position_in_word
,
as the aa
phone would be the second phone.
Subannotation queries¶
Annotations can have subannotations associated with them. Subannotations are not independent linguistic types, but have more information associated with them than just a single property. For instance, voice onset time (VOT) would be a subannotation of stops (as it has a begin time and an end time that are of interest). For mor information on subannotations, see Subannotation enrichment. Querying such subannotations would be performed as follows:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.vot.duration.column_name('vot'))
results = q.all()
print(results)
In some cases, it may be desirable to have more than one subannotation of the same type associated with a single annotation. For instance, voicing during the closure of a stop can take place at both the beginning and end of closure, with an unvoiced period in the middle. Using a similar query as above would get the durations of each of these (in the order of their begin time):
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.voicing_during_closure.duration.column_name('voicing'))
results = q.all()
print(results)
In some cases, we might like to know the total duration of such subannotations,
rather than the individual durations. To query that information, we can
use an aggregate
:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
results = q.aggregate(Sum(c.phone.voicing_during_closure.duration).column_name('total_voicing'))
print(results)
Miscellaneous¶
Aggregates and groups¶
Aggregate functions are available in polyglotdb.query.base.func
. Aggregate
functions available are:
- Average
- Count
- Max
- Min
- Stdev
- Sum
In general, these functions take a numeric attribute as an argument. The
only one that does not follow this pattern is Count
.
from polyglotdb.query.base.func import Count
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
result = q.aggregate(Count())
print(result)
Like the all
function, aggregate
triggers evaluation of the query.
Instead of returning rows, it will return a single number, which is the
number of rows matching this query.
from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
result = q.aggregate(Average(c.phone.duration))
print(result)
The above aggregate function will return the average duration for all ‘aa’ phones followed by ‘r’ phones.
Aggregates are particularly useful with grouping. For instance:
from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label.in_(['r','l']))
q = q.group_by(c.phone.following.label.column_name('following_label'))
result = q.aggregate(Average(c.phone.duration), Count())
print(result)
The above query will return the average duration and the count of ‘aa’ phones grouped by whether they’re followed by an ‘r’ or an ‘l’.
Note
In the above example, the group_by
attribute is supplied with
an alias for output. In the print statment and in the results, the column
will be called ‘following_label’ instead of the default (more opaque) one.
Ordering¶
The order_by
function is used to provide an ordering to the results of
a query.
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label.in_(['r','l']))
q = q.filter(c.phone.discourse == 'a_discourse')
q = q.order_by(c.phone.begin)
results = q.all()
print(results)
The results for the above query will be ordered by the timepoint of the annotation. Ordering by time is most useful for when looking at single discourses (as including multiple discourses in a query would invalidate the ordering).
Note
In grouped aggregate queries, ordering is by default by the
first group_by
attribute. This can be changed by calling order_by
before evaluating with aggregate
.
Lexicon queries¶
Querying the lexicon is in many ways similar to querying annotations in graphs.
with CorpusContext(config) as c:
q = c.query_lexicon(c.lexicon_phone).filter(c.lexicon_phone.label == 'aa')
print(q.all())
The above query will just return one result (as there is only one phone type with a given label) as opposed to the multiple results returned when querying annotations.
Speaker queries¶
Querying speaker information is similar to querying other aspects, and function very similarly to querying discourses. Queries are constructed through the function query_speakers
:
with CorpusContext(config) as c:
q = c.query_speakers()
speakers = [x['name'] for x in q.all()]
print(speakers)
The above code will print all of the speakers in the current corpus. Like other queries, speakers can be filtered by properties that are encoded for them and specific information can be extracted.
with CorpusContext(config) as c:
q = c.query_speakers().filter(c.speaker.name == 'Speaker 1').columns(c.speaker.discourses.name.column_name('discourses'))
speaker1_discourses = q.all()[0]['discourses']
print(speaker1_discourses)
The above query will print out all the discourses that a speaker identified as "Speaker 1"
spoke in.
Discourse queries¶
Discourses can also be queried, and function very similarly to speaker queries. Queries are constructed through the function query_discourses
:
with CorpusContext(config) as c:
q = c.query_discourses()
discourses = [x['name'] for x in q.all()]
print(discourses)
The above code will print all of the discourses in the current corpus. Like other queries, discourses can be filtered by properties that are encoded for them and specific information can be extracted.
with CorpusContext(config) as c:
q = c.query_discourses().filter(c.discourse.name == 'File 1').columns(c.discourse.speakers.name.column_name('speakers'))
file1_speakers = q.all()[0]['speakers']
print(file1_speakers)
The above query will print out all the speakers that spoke in the discourse identified as "File 1"
.
Exporting query results¶
Exporting queries is simply a matter of calling the to_csv
function of a query, rather than its all
function.
csv_path = '/path/to/save/file.csv'
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label == 'are')
q = q.columns(c.word.label.column_name('word'), c.word.duration,
c.word.speaker.name.column_name('speaker'))
q.to_csv(csv_path)
All queries, including those over annotations, speakers, discourses, etc, have this method available for creating CSV files from
their results. The columns
function allows for users to list any attributes within the query, (i.e., properties of the
word, or any higher/previous/following annotation, or anything about the speaker, etc). These attributes by default have
a column header generated based on the query, but these headers can be overwritten through the use of the column_name
function, as above.
Export for token CSVs¶
If you wish to add properties to a set of tokens by means of a CSV, this can be achieved by using the token import tool explained in Enriching arbitrary tokens. In order do this you will need a CSV that contains the ID of each token that you wish to evaluate. The following code shows how to export all phones with their ID, begin, end and sound file, which could be useful for a phonetic analysis in an external tool.
csv_path = '/path/to/save/file.csv'
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.label, \
c.phone.id, \
c.phone.begin, \
c.phone.end, \
c.phone.discourse.name)
q = q.order_by(g.phone.begin)
q.to_csv(csv_path)
Query Reference¶
Getting elements¶
c.phone
c.lexicon_phone
c.speaker
c.discourse
Attributes¶
In addition to any values that get added through enrichment, there are several built in attributes that allow access to different parts of the database.
Attribute type | Code | Notes |
---|---|---|
Label [1] | c.phone.label |
|
Name [2] | c.speaker.name |
|
Begin [3] | c.phone.begin |
|
End [3] | c.phone.end |
|
Duration [3] | c.phone.duration |
|
Previous annotation [3] | c.phone.previous |
|
Following annotation [3] | c.phone.following |
|
Previous pause [3] | c.phone.word.previous_pause |
Must be from a word annotation |
Following pause [3] | c.phone.word.following_pause |
Must be from a word annotation |
Speaker [3] | c.phone.speaker |
|
Discourse [3] | c.phone.discourse |
|
Pitch attribute [3] | c.phone.pitch |
|
Formants attribute [3] | c.phone.formants |
|
Intensity attribute [3] | c.phone.intensity |
|
Minimum value [4] | c.phone.pitch.min |
|
Maximum value [4] | c.phone.pitch.max |
|
Mean value [4] | c.phone.pitch.mean |
|
Raw track [4] | c.phone.pitch.track |
|
Sampled track [4] | c.phone.pitch.sampled_track |
|
Interpolated track [4] | c.phone.pitch.interpolated_track |
[1] | Only available for graph annotations and lexicon annotations |
[2] | Only available for speakers/discourses |
[3] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) Only available for graph annotations |
[4] | (1, 2, 3, 4, 5, 6) Only available for acoustic attributes |
Filters¶
Filter type | Code | Notes |
---|---|---|
Equal | c.phone.label == 'aa' |
|
Not equal | c.phone.label != 'aa' |
|
Greater than | c.phone.begin > 0 |
|
Greater than or equal | c.phone.begin >= 0 |
|
Less than | c.phone.end < 10 |
|
Less than or equal | c.phone.end <= 10 |
|
In | c.phone.label.in_(['aa','ae']) |
in_ can also take a query |
Not in | c.phone.label.not_in_(['aa']) |
not_in_ can also take a query |
Is null | c.phone.label == None |
|
Is not null | c.phone.label != None |
|
Regular expression match | c.phone.label.regex('a,') |
|
In subset | c.phone.subset == 'syllabic' |
|
Not in subset | c.phone.subset != 'syllabic' |
|
Precedes pause | c.word.precedes_pause == True |
Only available for graph annotations |
Does not precede pause | c.word.precedes_pause == False |
Only available for graph annotations |
Follows pause | c.word.follows_pause == True |
Only available for graph annotations |
Does not follow pause | c.word.follows_pause == False |
Only available for graph annotations |
Right aligned | c.phone.end == c.phone.word.end |
Only available for graph annotations |
Not right aligned | c.phone.end != c.phone.word.end` |
Only available for graph annotations |
Left aligned | c.phone.begin == c.phone.word.begin |
Only available for graph annotations |
Not left aligned | c.phone.begin != c.phone.word.begin |
Only available for graph annotations |
Developer documentation¶
This section of the documentation is devoted to explaining implementation details of PolyglotDB. In large part this is currently a brain dump of Michael McAuliffe to hopefully allow for easier implementation of new features in the future.
The overarching structure of PolyglotDB is based around two database technologies: Neo4j and InfluxDB. Both of these database systems are devoted to modelling, storing, and querying specific types of data, namely, graph and time series data. Because speech data can be modelled in each of these ways (see Annotation Graphs for more details on representing annotations as graphs), using these databases leverages their performance and scalability for increasing PolyglotDB’s ability to deal with large speech corpora. Please see the InterSpeech proceedings paper for more information on the high level motivations of PolyglotDB.
Contents:
Neo4j implementation¶
This section details how PolyglotDB saves and structures data within Neo4j.
Note
This section assumes some familiarity with the Cypher query language and Neo4j, see the Cypher documentation for more details and reference.
Annotation Graphs¶
The basic formalism in PolyglotDB for modelling and storing transcripts is that of annotation graphs, originally proposed by Bird & Liberman (1999). In this formalism, transcripts are directed acyclic graphs. Nodes in the graph represent time points in the audio file and edges between the nodes represent annotations (such as phones, words, utterances, etc). This style of graph is illustrated below.
Note
Annotation is a broad term in this conceptual framework and basically represents anything that can be defined as a label, and begin/end time points. Single time point annotations (something like ToBI) are not strictly covered in this framework. Annotations as phoneticians typically think of annotation (i.e., extra information annotated by the researcher like VOT or categorization by listener) is modelled in PolyglotDB as Subannotations (annotations of annotations) and are handled differently than the principle annotations which are linguistic units.

Annotation graph representation of the word “polyglot” spoken over the course of one second.
The annotation graph framework is a conceptual way to model linear time signals and interval annotations, independent of a specific implementation. The annotation graph formalism has been implemented in other speech corpus management systems, in either SQL (LaBB-CAT) or custom file-based storage systems (EMU-SDMS). One of the principle goals in developing PolyglotDB was to be scalable to large datasets (potentially hundreds of hours) and still have good performance in querying the database. Initial implementations in SQL were not as fast as I would have liked, so Neo4j was selected as the storage backend. Neo4j is a NoSQL graph database where nodes and edges are fundamental elements in both the storage and Cypher query language. Given its active development and use in enterprise systems, it is the best choice for meeting the scalability and performance considerations.
However, Neo4j prioritizes nodes far more than edges (see Neo4j’s introductory materials for more details). In general, their use case is primarily something like IMDB, for instance. In such a case, you’ll have nodes for movies, shows, actors, directors, crew, etc, each with different labels associated with them. Edges represent relationships like “acted in”, or “directed”. The nodes have the majority of the properties, like names, dates of birth, gender, etc, and relationships are sparse/empty. The annotation graph formalism has nodes being relatively sparse (just time point), and the edges containing the properties (label, token properties, speaker information, etc). Neo4j uses indices to speed up queries, but these are focused on node properties rather than edge properties (or were at the beginning of development). As such, the storage model was modified from the annotation graph formalism into something more node-based, seen below.

PolyglotDB implementation of the annotation graph formalism for a production of the word “polyglot”.
Rather than time points as nodes, the actual annotations are nodes, and relationships between them are either hierarchical
(i.e., the phone P
is contained by the syllable P.AA1
, represented by solid lines in the figure above)
or precedence (the phone P
precedes the phone AA1
, represented by dashed lines in the figure above).
Each node has properties for begin and end time points, as well as any arbitrary encoded information
(i.e., part of speech tags). Each node of a given annotation type (word, syllable, phone) is labelled as such in Neo4j,
speeding up queries.
All interaction with corpus data is done through the CorpusContext
class. When this class
is instantiated, and then used as a context manager, it connects to both a Neo4j database and an InfluxDB database (described
in more detail in InfluxDB implementation). When a corpus is imported (see Importing corpora), nodes and edges in
the Neo4j database are created, along with appropriate labels on the nodes to organize and aid querying. By default, from
a simple import of forced aligned TextGrids, the full list of node types in a fresh PolyglotDB Neo4j database is as follows:
:phone
:phone_type
:word
:word_type
:Speaker
:Discourse
:speech
In the previous figure (Fig. 2), for instance, all green nodes would have the Neo4j label :phone
, all orange nodes would have the Neo4j label :syllable
,
and the purple node would have the Neo4j label :word
. All nodes would also have the Neo4j label :speech
. Each of the nodes in figure
would additionally have links to other aspects of the graph. The word node would have a link to a node with the Neo4j label of :word_type
,
the syllables nodes would each link a node with the Neo4j label :syllable_type
, and the phone nodes would link to nodes with Neo4j label of :phone_type
.
These type nodes would then contain any type information that is not dependent on the particular production. Each node in the figure would also
link to a :Speaker
node for whoever produced the word, as well as a :Discourse
node for whichever file it was recorded in.
Note
A special tag for the corpus name is added to every node in the corpus, in case multiple corpora are imported in the
same database. For instance, if the CorpusContext is instantiated as CorpusContext('buckeye')
, then any imported
annotations would have a Neo4j label of :buckeye
associated with them. If another CorpusContext was instantiated
as CorpusContext('not_buckeye')
, then any queries for annotations in the buckeye
corpus would not be found, as
it would be looking only at annotations tagged with the Neo4j label :not_buckeye
.
The following node types can further be added to via enrichment (see Enrichment):
:pause
:utterance
:utterance_type (never really used)
:syllable
:syllable_type
In addition to node labels, Neo4j and Cypher use relationship labels on edges in queries. In the above example, all solid
lines would have the label of :contained_by
, as the lower annotation is contained by the higher one (see Corpus hierarchy representation below
for details of the hierarchy implementation). All the dashed lines would have the Neo4j label of :precedes
as the previous annotation
precedes the following one.
The following is a list of all the relationship types in the Neo4j database:
:is_a (relation between type and token nodes)
:precedes (precedence relation)
:precedes_pause (precedence relation for pauses when encoded)
:contained_by (hierarchical relation)
:spoken_by (relation between tokens and speakers)
:spoken_in (relation between tokens and discourses)
:speaks_in (relation between speakers and discourses)
:annotates (relation between annotations and subannotations)
Corpus hierarchy representation¶
Neo4j is a schemaless database, each node can have arbitrary information added to it without requiring that information on any other node. However, enforcing a bit of a schema on the Neo4j is helpful for dealing with corpora which are more structured than an arbitrary graph. For a user, knowing that a typo leads to a property name that doesn’t exist on any annotations that they’re querying is useful. Additionally, knowing the type of the data stored (string, boolean, float, etc) allows for restricting certain operations (for instance, calculating a by speaker z-score is only relevant for numeric properties). As such a schema in the form of a Hierarchy is explicitly defined and used in PolyglotDB.
Each CorpusContext
has a polyglotdb.structure.Hierarchy
object which stores metadata about the corpus.
Hierarchy objects are basically schemas for the Neo4j database, telling the user what information annotations of a given type
should have (i.e., do word
annotations have frequency
as a type property? part_of_speech
as a token property?).
Additionally it also gives the strict hierarchy between levels of annotation. A freshly imported corpus with just words and phones
will have a simple hierarchy that phones are contained by words. Enrichment can add more levels to the hierarchy for syllables
and utterances. All aspects of the Hierarchy object are stored in the Neo4j database and synced with the CorpusContext
object.
In the Neo4j graph, there is a Corpus
root node, with all encoded annotations linked as they would be
in an annotation graph for a given discourse (i.e., Utterance -> Word -> Syllable -> Phone in orange below). These nodes contain
a list of properties that will be found on each node in the annotation graphs (i.e., label
, begin
, end
),
along with what type of data each property is (i.e., string, numeric, boolean, etc). There will also be a property for subsets
that
is a list of all the token subsets of that annotation type.
Each of these
annotations are linked to type nodes (in blue below) that has a list of properties that belong to the type (i.e., in the figure below, word types
have label
, transcription
and frequency
).

In addition, if subannotations are encoded, they will be represented in the hierarchy graph as well (i.e., Burst
,
Closure
, and Intonation
in yellow above), along with all the properties they contain. Speaker
and Discourse
properties are encoded in the graph hierarchy object as well as any acoustics that have been encoded
and are stored in the InfluxDB portion of the database (see Saving acoustics for details on encoding acoustic measures).
Query implementation¶
One of the primary functions of PolyglotDB is querying information in the Neo4j databases. The fundamental function of the polyglotdb.query
module
is to convert Python syntax and objects (referred to as PolyglotDB queries below) into Cypher strings that extract the correct
elements from the Neo4j database. There is a fair bit of “magic” behind the scenes as much of this conversion is done by hijacking
built in Python functionality. For instance c.phone.label == 'AA1'
does not actually return a boolean, but rather
a Clause
object. This Clause
object has functionality for generating a Cypher string like node_phone.label = 'AA1'
, which
would then be slotted into the appropriate place in the larger Cypher query. There is a larger Query object that has many subobjects,
such a filters, and columns to return, and uses these subobjects to slot into a query template to generate the final Cypher query.
This section attempts to break down the individual pieces that get added together to create the final Cypher query.
There are 4 principle types
of queries currently implemented in PolyglotDB based on the information desired (annotations, lexicon, speaker, and discourse
queries). Annotation queries are the most common, as they will search over the produced annotation tokens in discourses. For instance,
finding all stops in a particular environment and returning relevant information is going to be an annotation query with each
matching token having its own result.
Lexicon queries are queries over annotation types rather than tokens. Speaker and Discourse queries are those over their
respective entities.
Queries are constructed as Python objects (descended from polyglotdb.query.base.query.BaseQuery
) and are generated
from methods on a CorpusContext
object, as below. Each of the four types of queries has their
own submodule within the polyglotdb.query
module.
Data type | CorpusContext method | Query class |
---|---|---|
Annotations | polyglotdb.corpus.CorpusContext.query_graph() |
polyglotdb.query.annotations.query.GraphQuery |
Lexicon | polyglotdb.corpus.CorpusContext.query_lexicon() |
polyglotdb.query.lexicon.query.LexiconQuery |
Speaker | polyglotdb.corpus.CorpusContext.query_speakers() |
polyglotdb.query.speaker.query.SpeakerQuery |
Discourse | polyglotdb.corpus.CorpusContext.query_discourses() |
polyglotdb.query.discourse.query.DiscourseQuery |
The main structure of each of the query submodules is as follows:
The following walk through of the basic components of a query submodule will use a speaker query for illustration purposes.
In this example, we’ll be trying to extract the list of male speakers (with the assumption that speakers have been encoded
for gender and that the corpus is appropriately named corpus
). In Cypher, this query would be:
MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
RETURN node_Speaker.name AS speaker_name
This query in polyglotdb would be:
with CorpusContext('corpus') as c:
q = c.query_speakers() # Generate SpeakerQuery object
q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
q = q.columns(c.speaker.name.column_name('speaker_name')) # Return just the speaker name (with the `speaker_name` alias)
results = q.all()
The attributes.py
file contains the definitions of classes corresponding to nodes and attributes in the Neo4j database.
These classes have code for how to represent them in cypher queries and how properties are extracted. As an example of a somewhat simple case,
consider polyglotdb.query.speaker.attributes.SpeakerNode
and polyglotdb.query.speaker.attributes.SpeakerAttribute
.
A SpeakerNode
object will have an alias in the Cypher query of node_Speaker and an initial look up definition for
the query as follows:
(node_Speaker:Speaker:corpus)
The polyglotdb.query.speaker.attributes.SpeakerAttribute
class is used for the gender
and name
attributes referenced in the query. These are created through calling c.speaker.gender
(the __getattr__
method for
both the CorpusContext
class and the SpeakerNode
class are overwritten to allow for this kind of access).
Speaker attributes use their node’s alias to construct how they are referenced in Cypher, i.e. for c.speaker.gender
:
node_Speaker.gender
When the column_name
function is called, an output alias is used when constructing RETURN
statements in Cypher:
node_Speaker.name AS speaker_name
The crucial part of a query is, of course, the ability to filter. Filters are constructed using Python operators, such as
==
or !=
, or functions replicating other operators like .in_()
. Operators on attributes return
classes from the elements.py
file of a query submodule. For instance, the polyglotdb.query.base.elements.EqualClauseElement
is returned when the ==
is used (as in the above query), and this object handles how to convert the operator into
Cypher, in the above case of c.speaker.gender == 'male'
, it will generate the following Cypher code when requested:
node_Speaker.gender = "male"
The query.py
file contains the definition of the Query class descended from polyglotdb.query.base.query.BaseQuery
.
The filter
and columns
methods allow ClauseElements and Attributes to be added for the construction of the
Cypher query. When all
is called (or cypher
which does the actual creation of the Cypher string), the first step
is to inspect the elements and attributes to see what nodes are necessary for the query. The definitions of each of these nodes are then
concatenated into a list for the MATCH
part of the Cypher query, giving the following for our example:
MATCH (node_Speaker:Speaker:corpus)
Next the filtering elements are constructed into a WHERE
clause (separated by AND
for more than one element),
giving the following for our example:
WHERE node_Speaker.gender = "male"
And finally the RETURN
statement is constructed from the list of columns specified (along with their specified column names):
RETURN node_Speaker.name AS speaker_name
If columns are not specified in the query, then a Python object containing all the information of the node is returned, according
to classes in the models.py
file of the submodule. For our speaker query, if the columns are omitted, then the returned
results will have all speaker properties encoded in the corpus. In terms of implementation, the following query in polyglotdb
with CorpusContext('corpus') as c:
q = c.query_speakers() # Generate SpeakerQuery object
q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
results = q.all()
print(results[0].name) # Get the name of the first result
will generate the following Cypher query:
MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
RETURN node_Speaker
Annotation queries¶
Annotation queries are the most complicated kind due to all of the relationships linking nodes. Where Speaker, Discourse and Lexicon queries are really just lists of nodes with little linkages between nodes, Annotation queries leverage the relationships in the annotation graph quite a bit.
Basic query¶
Given a relatively basic query like the following:
with CorpusContext('corpus') as c:
q = c.query_graph(c.word)
q = q.filter(c.word.label == 'some_word')
q = q.columns(c.word.label.column_name('word'), c.word.transcription.column_name('transcription'),
c.word.begin.column_name('begin'),
c.word.end.column_name('end'), c.word.duration.column_name('duration'))
results = q.all()
Would give a Cypher query as follows:
MATCH (node_word:word:corpus)-[:is_a]->(node_word_type:word_type:corpus),
WHERE node_word_type.label = "some_word"
RETURN node_word_type.label AS word, node_word_type.transcription AS transcription,
node_word.begin AS begin, node_word.end AS end,
node_word.end - node_word.begin AS duration
The process of converting the Python code into the Cypher query is similar to the above Speaker example, but each step has
some complications. To begin with, rather than defining a single node, the annotation node definition contains two nodes, a word token
node and a word type node linked by the is_a
relationship.
The use of type properties allows for a more efficient look up on the label
property (for convenience and debugging, word
tokens also have a label
property). The Attribute objects will look up what properties are type vs token for constructing
the Cypher statement.
Additionally, duration
is a special property that is calculated based off of the token’s begin
and end
properties at query time. This way if the time points are updated, the duration remains accurate. In terms of efficiency,
subtraction at query time is not costly, and it does save on space for storing an additional property. Duration can still be
used in filtering, i.e.:
with CorpusContext('corpus') as c:
q = c.query_graph(c.word)
q = q.filter(c.word.duration > 0.5)
q = q.columns(c.word.label.column_name('word'),
c.word.begin.column_name('begin'),
c.word.end.column_name('end'))
results = q.all()
which would give the Cypher query:
MATCH (node_word:word:corpus)-[:is_a]->(node_word_type:word_type:corpus),
WHERE node_word.end - node_word.begin > 0.5
RETURN node_word_type.label AS word, node_word.begin AS begin,
node_word.end AS end, AS duration
Precedence queries¶
Aspects of the previous annotation can be queried via precedence queries like the following:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.label == 'AE')
q = q.filter(c.phone.previous.label == 'K')
results = q.all()
will result the following Cypher query:
MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone)<-[:precedes]-(prev_1_node_phone:phone:corpus)-[:is_a]->(prev_1_node_phone_type:phone_type:corpus)
WHERE node_phone_type.label = "AE"
AND prev_1_node_phone_type.label = "K"
RETURN node_phone, node_phone_type, prev_1_node_phone, prev_1_node_phone_type
Hierarchical queries¶
Hierarchical queries are those that reference some annotation higher or lower than the originally specified annotation. For instance to do a search on phones and also include information about the word as follows:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.label == 'AE')
q = q.filter(c.phone.word.label == 'cat')
results = q.all()
This will result in Cypher query as follows:
MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone_word:word:corpus)-[:is_a]->(node_phone_word_type:word_type:corpus),
(node_phone)-[:contained_by]->(node_phone_word)
WHERE node_phone_type.label = "AE"
AND node_phone_word_type.label = "cat"
RETURN node_phone, node_phone_type, node_phone_word, node_phone_word_type
Spoken queries¶
Queries can include aspects of speaker and discourse as well. A query like the following:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.speaker.name == 'some_speaker')
q = q.filter(c.phone.discourse.name == 'some_discourse')
results = q.all()
Will result in the following Cypher query:
MATCH (node_phone:phone:corpus)-[:is_a]->(node_phone_type:phone_type:corpus),
(node_phone)-[:spoken_by]->(node_phone_Speaker:Speaker:corpus),
(node_phone)-[:spoken_in]->(node_phone_Discourse:Discourse:corpus)
WHERE node_phone_Speaker.name = "some_speaker"
AND node_phone_Discourse.name = "some_discourse"
RETURN node_phone, node_phone_type
Annotation query optimization¶
There are several aspects to query optimization that polyglotdb does. The first is that rather than polyglotdb.query.annotations.query.GraphQuery
the default objects returned are actually polyglotdb.query.annotations.query.SplitQuery
objects. The behavior of these
objects is to split a query into either Speakers or Discourse and have smaller GraphQuery
for each speaker/discourse.
The results object that gets returned then iterates over each of the results objects returned by the GraphQuery
objects.
In general splitting functionality by speakers/discourses (and sometimes both) is the main way that Cypher queries are performant in polyglotdb.
Aspects such as enriching syllables and utterances are quite complicated and can result in out of memory errors if the splits are
too big (despite the recommended optimizations by Neo4j, such as using PERIODIC COMMIT
to split the transactions).
Lexicon queries¶
Note
While the name of this type of query is lexicon
, it’s really just queries over types, regardless of their linguistic
type. Phone, syllable, and word types are all queried via this interface. Utterance types are not really used
for anything other than consistency with the other annotations, as the space of possible utterance is basically infinite,
but the space of phones, syllables and words are more constrained, and type properties are more useful.
Lexicon queries are more efficient queries of annotation types than the annotation queries above. Assuming word types have been enriched with a frequency property, a polyglotdb query like:
with CorpusContext('corpus') as c:
q = c.query_lexicon(c.word_lexicon) # Generate LexiconQuery object
q = q.filter(c.word_lexicon.frequency > 100) # Subset of word types based on their frequency
results = q.all()
Would result in a Cypher query like:
MATCH (node_word_type:word_type:corpus)
WHERE node_word_type.frequency > 100
RETURN node_word_type
Speaker/discourse queries¶
Speaker and discourse queries are relatively straightforward with only a few special annotation node types or attribute types. See Query implementation for an example using a SpeakerQuery.
The special speaker attribute is discourses
which will return a list of the discourses that the speaker spoke in,
and conversely, the speakers
attribute of DiscourseNode objects will return a list of speakers who spoke in that discourse.
A polyglotdb query like the following:
with CorpusContext('corpus') as c:
q = c.query_speakers() # Generate SpeakerQuery object
q = q.filter(c.speaker.gender == 'male') # Filter to just the speakers that have `gender` set to "male"
q = q.columns(c.speaker.discourses.name.column_name('discourses')) # Return just the speaker name (with the `speaker_name` alias)
results = q.all()
will generate the following Cypher query:
MATCH (node_Speaker:Speaker:corpus)
WHERE node_Speaker.gender = "male"
WITH node_Speaker
MATCH (node_Speaker)-[speaks:speaks_in]->(node_Speaker_Discourse:Discourse:corpus)
WITH node_Speaker, collect(node_Speaker_Discourse) AS node_Speaker_Discourse
RETURN extract(n in node_Speaker_Discourse|n.name) AS discourses
Aggregation functions¶
In the simplest case, aggregation queries give a way to get an aggregate over the full query. For instance, given the following PolyglotDB query:
from polyglotdb.query.base.func import Average
with CorpusContext('corpus') as c:
q = g.query_graph(g.phone).filter(g.phone.label == 'aa')
result = q.aggregate(Average(g.phone.duration))
Will generate a resulting Cypher query like the following:
MATCH (node_phone:phone:corpus)-[:is_a]->(type_node_phone:phone_type:corpus)
WHERE node_phone.label = "aa"
RETURN avg(node_phone.end - node_phone.begin) AS average_duration
In this case, there will be one result returned: the average duration of all phones in the query. If, however, you wanted
to get the average duration per phone type (i.e., for each of aa
, iy
, ih
, and so on), then aggregation functions
can be combined with group_by
clauses:
from polyglotdb.query.base.func import Average
with CorpusContext('corpus') as c:
q = g.query_graph(g.phone).filter(g.phone.label.in_(['aa', 'iy', 'ih']))
results = q.group_by(g.phone.label.column_name('label')).aggregate(Average(g.phone.duration))
MATCH (node_phone:phone:corpus)-[:is_a]->(type_node_phone:phone_type:corpus)
WHERE node_phone.label IN ["aa", "iy", "ih"]
RETURN node_phone.label AS label, avg(node_phone.end - node_phone.begin) AS average_duration
Note
See Aggregate functions for more details on the aggregation functions available.
InfluxDB implementation¶
This section details how PolyglotDB saves and structures data within InfluxDB. InfluxDB is a NoSQL time series database, with a SQL-like query language.
Note
This section assumes a bit of familiarity with the InfluxDB query language, which is largely based on SQL. See the InfluxDB documentation for more details and reference to other aspects of InfluxDB.
InfluxDB Schema¶
Each measurement encoded (i.e., pitch, intensity, formants) will have a separate table in InfluxDB, similar to SQL.
When querying, the query will select
columns from a a table (i.e., select * from "pitch"
). Each row in InfluxDB
minimally has a time
field, as it is a time series database. In addition, each row will have queryable fields and tags, in InfluxDB parlance.
Tags can function as separate tables, speeding up queries, while fields are simply values that are indexed.
All InfluxDB tables will have three tags (these create different indices for the database and speed up queries) for
speaker
, discourse
, and channel
. The union of discourse
(i.e., file name) and channel
(usually 0, particularly for mono sound files)
along with the time
in seconds will always give a unique acoustic time point, and indexing by speaker
is crucial for PolyglotDB’s algorithms.
Note
The time resolution for PolyglotDB is at the millisecond level. In general, I think having measurements every 10ms is a balanced time resolution for acoustic measures. Increasing the time resolution will also increase the processing time for PolyglotDB algorithms, as well as the database size. Time resolution is generally a property of the analyses done, so greater time resolution than 10 ms is possible, but not greater than 1 ms, as millisecond time resolution is hardcoded in the current code. Any time point will be rounded/truncated to the nearest millisecond.
In addition to these tags, there are several queryable fields which are always present in addition to the measurement fields.
First, the phone
for the time point is saved to allow for efficient aggregation across phones. Second, the utterance_id
for the time point is also saved. The utterance_id
is used for general querying, where each utterance’s track for the
requested acoustic property is queried once and then cached for any further results to use without needing to query the
InfluxDB again. For instance, a query on phone formant tracks might return 2000 phones. Without the utterance_id
, there
would be 2000 look ups for formant tracks (each InfluxDB query would take about 0.15 seconds), but using the utterance-based caching,
the number of hits to the InfluxDB database would be a fraction (though the queries themselves would take a little bit longer).
Note
For performance reasons internal to InfluxDB, phone
and utterance_id
are fields
rather than tags
, because
the cross of them with speaker
, discourse
, and channel
would lead to an extremely large cross of possible tag
combinations. This mix of tags and fields has been found to be the most performant.
Finally, there are the actual measurements that are saved. Each acoustic track (i.e., pitch
, formants
, intensity
)
can have multiple measurements. For instance, a formants
track can have F1
, F2
, F3
, B1
, B2
, and B3
,
which are all stored together on the same time point and accessed at the same time. These measures are kept in the corpus
hierarchy in Neo4j. Each measurement track (i.e. pitch
) will be a node linked to the corpus (see the example in Corpus hierarchy representation).
That node will have each property listed along with its data type (i.e., F0
is a float
).
Optimizations for acoustic measures¶
PolyglotDB has default functions for generating pitch
, intensity
, and formants
tracks (see Reference functions for specific examples
and Saving acoustics for more details on how they are implemented). For implementing
future built in acoustic track analysis functions, one realm of optimization lays in the differently sampled files that
PolyglotDB generates. On import, three files are generated per discourse at 1,200Hz, 11,000Hz, and 16,000Hz. The intended
purpose of these files are for acoustic analysis of different kinds of segments/measurements. The file at 1,200Hz is ideal
for pitch analysis (maximum pitch of 600Hz), the file at 11,000Hz is ideal for formant analysis (maximum formant frequency
of 5,500Hz). The file at 16,000Hz is intended for consonantal analysis (i.e., fricative spectral analysis) or any other
algorithm requiring higher frequency information. The reason these three files are generated is that analysis functions
generally include the resampling to these frequencies as part of the analysis, so performing it ahead of time can speed up
the analysis. Some programs also don’t necessarily include resampling (i.e., pitch estimation in REAPER), so using the
appropriate file can lead to massive speed ups.
Query implementation¶
Given a PolyglotDB query like the following:
with CorpusContext('corpus') as c:
q = c.query_graph(c.word)
q = q.filter(c.word.label == 'some_word')
q = q.columns(c.word.label.column_name('word'), c.word.pitch.track)
results = q.all()
Once the Cypher query completes and returns results for a matching word, that information is used to create an InfluxDB query. The inclusion of an acoustic column like the pitch track also ensures that necessary information like the utterance ID and begin and end time points of the word are returned. The above query would result in several queries like the following being run:
SELECT "time", "F0" from "pitch"
WHERE "discourse" = 'some_discourse'
AND "utterance_id" = 'some_utterance_id'
AND "speaker" = 'some_speaker'
The above query will get all pitch points for the utterance of the word in question, and create Python objects for the
track (polyglotdb.acoustics.classes.Track
) and each time point (polyglotdb.acoustics.classes.TimePoint
).
With the begin
and end
properties of the word, a slice of the track is added to the output row.
Aggregation¶
Unlike for aggregation of properties in the Neo4j database (see Aggregation functions), aggregation of acoustic properties occurs in Python rather than being implemented in a query to InfluxDB, for the same performance reasons above. By caching utterance tracks as needed, and then performing aggregation over necessary slices (i.e., words or phones), the overall query is much faster.
Low level implementation¶
Saving acoustics¶
The general pipeline for generating and saving acoustic measures is as follows:
- Acoustic analysis using Conch’s analysis functions
- Format output from Conch into InfluxDB format and fill in any needed information (phone labels)
- Write points to InfluxDB
- Update the Corpus hierarchy with information about acoustic properties
Acoustic analysis is first performed in Conch, a Python package for processing sound files into acoustic and auditory
representations. To do so, segments are created in PolyglotDB through calls to polyglotdb.acoustics.segments.generate_segments()
and related functions. The generated SegmentMapping
object from Conch is an iterable of Segment
objects. Each Segment
minimally
has a path to a sound file, the begin time stamp, the end time stamp, and the channel. With these four pieces of information,
the waveform signal can be extracted and acoustic analysis can be performed. Segment
objects can also have other
properties associated with them, so that the SegmentMapping
can be grouped into sensible bits of analysis (SegmentMapping.grouped_mapping()
.
This is done in PolyglotDB to split analysis by speakers, for instance.
SegmentMapping
and those returned by the grouped_mapping
can then be passed to analyze_segments
, which in addition
to a SegmentMapping
take a callable function that takes the minimal set of arguments above (file path, begin, end, and channel)
and return some sort of track or point measure from the signal segment. Below for a list of generator functions that return
a callable to be used with analyze_segments
. The analyze_segments
function uses multiprocessing to apply the callable
function to each segment, allowing for speed ups for the number of available cores on the machine.
Once the Conch analysis function completes, the tracks are saved via polyglotdb.corpus.AudioContext.save_acoustic_tracks()
.
In addition to the discourse
, speaker
, channel
, and utterance_id
, phone
label information is also added to each time
point’s measurements. These points are then saved using the write_points
function of the InfluxDBClient
, returned
from the acoustic_client()
function.
Reference functions¶
Hard-coded functions for saving acoustics are:
polyglotdb.acoustics.formants.base.analyze_formant_tracks()
polyglotdb.acoustics.intensity.analyze_intensity()
polyglotdb.acoustics.other.analyze_track_script()
polyglotdb.acoustics.pitch.base.analyze_pitch()
polyglotdb.acoustics.vot.base.analyze_vot()
Additionally, point measure acoustics analysis functions that don’t involve InfluxDB (point measures are saved as Neo4j properties):
polyglotdb.acoustics.formants.base.analyze_formant_points()
polyglotdb.acoustics.other.analyze_script()
Generator functions for Conch analysis:
polyglotdb.acoustics.formants.helper.generate_variable_formants_point_function()
polyglotdb.acoustics.formants.helper.generate_formants_point_function()
polyglotdb.acoustics.formants.helper.generate_base_formants_function()
polyglotdb.acoustics.intensity.generate_base_intensity_function()
polyglotdb.acoustics.other.generate_praat_script_function()
polyglotdb.acoustics.pitch.helper.generate_pitch_function()
Querying acoustics¶
In general, the pipeline for querying is as follows:
- Construct InfluxDB query string from function arguments
- Pass this query string to an
InfluxDBClient
- Iterate over results and construct a
polyglotdb.acoustics.classes.Track
object
All audio functions, and hence all interface with InfluxDB, is handled through the polyglotdb.corpus.AudioContext
parent class for the CorpusContext. Any constructed InfluxDB queries will get executed through an InfluxDBClient
, constructed
in the polyglotdb.corpus.AudioContext.acoustic_client()
function, which uses the InfluxDB connection parameters
from the CorpusContext. As an example, see
polyglotdb.corpus.AudioContext.get_utterance_acoustics
. First, a InfluxDB client is constructed, then a query
string is formatted from the relevant arguments passed to get_utterance_acoustics
, and the relevant property names for the acoustic
measure (i.e., F1
, F2
and F3
for formants
, see InfluxDB Schema for more details). This query string is then run via the
query
method of the InfluxDBClient. The results are iterated over and a polyglotdb.acoustics.classes.Track
object
is constructed from the results and then returned.
PolyglotDB I/O¶
In addition to documenting the IO module of PolyglotDB, this document should serve as a guide for implementing future importers for additional formats.
Import pipeline¶
Importing a corpus consists of several steps. First, a file must be
inspected with the relevant inspect function (i.e., inspect_textgrid
or
inspect_buckeye
). These functions generate Parsers for a given format
that allow annotations across many tiers to be coalesced into linguistic
types (word, segments, etc).
As an example, suppose a TextGrid has an interval tier for word labels, an interval tier for phone labels, tiers for annotating stop information (closure duration, bursts, VOT, etc). In this case, our parser would want to associate the stop information annotations with the phones (or rather a subset of the phones), and not have them as a separate linguistic type.
Following inspection, the file can be imported easily using a CorpusContext’s
load
function. Under the hood, what happens is the Parser object creates
standardized linguistic annotations from the annotations in the text file,
which are then imported into the database.
Currently the following formats are supported:
- Praat TextGrids (Inspect TextGrids)
- TextGrid output from forced aligners (Montreal Forced Aligner, FAVE-align, and Web-MAUS)
- Output from other corpus management software (LaBB-CAT)
- BAS Partitur format
- Corpus-specific formats
Inspect¶
Inspect functions (i.e., inspect_textgrid
) return a guess for
how to parse the annotations present in a given file (or files in a given
directory). They return a parser of the respective type (i.e., TextgridParser
)
with an attribute for the annotation_tiers
detected. For instance, the inspect function for TextGrids
will return a parser with annotation types for each interval and point tier in the TextGrid.
Inspect TextGrids¶
Note
See TextGrid parser for full API of the TextGrid Parser
Consider the following TextGrid with interval tiers for words and phones:

Running the inspect_textgrid
function for this file will return two annotation types. From bottom to top, it will
generate a phone
annotation type and a word
annotation type. Words and phones are two special linguistic
types in PolyglotDB. Other linguistic types can be defined in a TextGrid (i.e., grouping words into utterances or phones into syllables,
though functionality exists for computing both of those automatically), but word and phone tiers must be defined.
Note
Which tier corresponds to which special word
and phone
type is done via heuristics. The first and most
reliable is whether the tier name contains “word” or “phone” in their tier name. The second is done by using cutoffs
for mean and SD of word and phone durations in the Buckeye corpus to determine if the intervals are more likely to be
word or phones. For reference, the mean and SD of words used is 0.2465409 and 0.03175723, and those used for phones
is 0.08327773 and 0.03175723.
From the above TextGrid, phones will have a label
property (i.e., “Y”), a begin
property (i.e., 0.15),
and a end
property (i.e., 0.25).
Words will have a label
property (i.e., “you”), a begin
property (i.e., 0.15),
and a end
property (i.e., 0.25), as well as a computed transcription
property
made of up of all of the included phones based on timings (i.e., “Y.UW1”). Any empty intervals will result in “words”
that have the label
of “<SIL>”, which can then be marked as pause later in corpus processing
(see Encoding non-speech elements for more details).
Note
The computation of transcription uses the midpoints of phones and whether they are between the begin and end time points of words.
Inspect forced aligned TextGrids¶
Both the Montreal Forced Aligner and FAVE-aligner generate TextGrids for files in two formats that PolyglotDB can parse. The first format
is for files with a single speaker. These files will have two tiers, one for words (named words
or word
)
and one for phones (named phones
or phone
).
The second format is for files with multiple speakers, where each speaker will have a pair of tiers for words (formatted as Speaker name - words
)
and phones (formatted as Speaker name - phones
).
TextGrids generated from Web-MAUS have a single format with a tier for words (named ORT
), a tier for the canonical
transcription (named KAN
) and a tier for phones (named MAU
). In parsing, just the tiers for words and
phones are used, as the transcription will be generated automatically.
Inspect LaBB-CAT formatted TextGrids¶
The LaBB-CAT system generates force-aligned TextGrids for files in a format that PolyglotDB can parse (though some editing may be
required due to issues in exporting single speakers in LaBB-CAT). As with the other supported aligner output formats,
PolyglotDB looks for word and phone tiers per speaker (or for just a single speaker depending on export options). The
parser will use transcript
to find the word tiers (i.e. Speaker name - transcript
) and segment
to find
the phone tiers (i.e., Speaker name - phones
).
Note
See LaBB-CAT parser for full API of the LaBB-CAT Parser
Inspect Buckeye Corpus¶
The Buckeye Corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., s0101a.wav
), one detailing information
about words (i.e., s0101a.words
) and one detailing information about surface phones (i.e. s0101a.phones
). The PolyglotDB
parser extracts label, begin and end for each phone. Words have type properties for their underlying transcription and
token properties for their part of speech and begin/end.
Note
See Buckeye parser for full API of the Buckeye Parser
Inspect TIMIT Corpus¶
The TIMIT corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., sa1.wav
), one detailing information
about words (i.e., sa1.WRD
) and one detailing information about surface phones (i.e. sa1.PHN
). The PolyglotDB
parser extracts label, begin and end for each phone and each word. Time stamps are converted from samples in the original text files
to seconds for use in PolyglotDB.
Note
See TIMIT parser for full API of the Buckeye Parser
Modifying aspects of parsing¶
Additional properties for linguistic units can be imported as well through the use of extra interval tiers when using a TextGrid parser (see Inspect TextGrids), as in the following TextGrid:

Here we have properties for each word’s part of speech (POS tier) and transcription. The transcription tier will overwrite the automatic calculation of transcription based on contained segments. Each of these will be properties will be type properties by default (see Neo4j implementation for more details). If these properties are meant to be token level properties (i.e., the part of speech of a word varies depending on the token produced), it can changed as follows:
from polyglotdb import CorpusContext
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[2].type_property = False # The index of the TextGrid tier for POS is 2
# ... code that uses the parser to import data
If the content of a tier should be ignored (i.e., if it contains information not related to any annotations in particular), then it can be manually marked to be ignored as follows:
from polyglotdb import CorpusContext
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[0].ignored = True # Index of 0 if the first tier should be ignored
# ... code that uses the parser to import data
Parsers created through other inspect functions (i.e. Buckeye) can be modified in similar ways, though the TextGrid parser is necessarily the most flexible.
Speaker parsers¶
There are two currently implemented schemes for parsing speaker names from a file path. The first is the Filename Speaker Parser,
which takes a number of characters in the base file name (without the extension) starting either from the left or right. For
instance, the path /path/to/buckeye/s0101a.words
for a Buckeye file would return the speaker s01
using 3 characters from the left.
The other speaker parser is the Directory Speaker Parser, which parses speakers from the directory that contains
the specified path. For instance, given the path /path/to/buckeye/s01/s0101a.words
would return s01
because the containing
folder of the file is named s01
.
Load discourse¶
Loading of discourses is done via a CorpusContext’s load
function:
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')
with CorpusContext(config) as c:
c.load(parser, '/path/to/textgrid.TextGrid')
Alternatively, load_discourse
can be used with the same arguments.
The load
function automatically determines whether the input path to
be loaded is a single file or a folder, and proceeds accordingly.
Load directory¶
As stated above, a CorpusContext’s load
function will import a directory of
files as well as a single file, but the load_directory
can be explicitly
called as well:
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrids')
with CorpusContext(config) as c:
c.load_directory(parser, '/path/to/textgrids')
Writing new parsers¶
New parsers can be created through extending either the Base parser class or one of the more specialized
parser classes. There are in general three aspects that need to be implemented. First, the _extensions
property should
be updated to reflect the file extensions that the parser will find and attempt to parse. This property should be an iterable,
even if only one extension is to be used.
Second, the __init__
function should be implemented if anything above and beyond the based class init function is required
(i.e., special speaker parsing).
Finally, the parse_discourse
function should be overwritten to implement some way of populating data on the annotation tiers
from the source data files and ultimately create a DiscourseData
object (intermediate data representation for straight-forward importing
into the Polyglot databases).
Creating new parsers for forced aligned TextGrids requires simply extending the polyglotdb.io.parsers.aligner.AlignerParser
and overwriting the word_label
and phone_label
class properties. The name
property should also be
set to something descriptive, and the speaker_first
should be set to False if speakers follow word/phone labels in
the TextGrid tiers (i.e., words -Speaker name
rather than Speaker name - words
). See polyglotdb.io.parsers.mfa.MfaParser
,
polyglotdb.io.parsers.fave.FaveParser
, polyglotdb.io.parsers.maus.MausParser
, and
polyglotdb.io.parsers.labbcat.LabbcatParser
for examples.
Exporters¶
Under development.
API Reference¶
Contents:
Corpus API¶
Corpus classes¶
Base corpus¶
-
class
polyglotdb.corpus.
BaseContext
(*args, **kwargs)[source]¶ Base CorpusContext class. Inherit from this and extend to create more functionality.
Parameters: - *args
If the first argument is not a
CorpusConfig
object, it is the name of the corpus- **kwargs
If a
CorpusConfig
object is not specified, all arguments and keyword arguments are passed to a CorpusConfig object
-
annotation_types
¶ Get a list of all the annotation types in the corpus’s Hierarchy
Returns: - list
Annotation types
-
cypher_safe_name
¶ Escape the corpus name for use in Cypher queries
Returns: - str
Corpus name made safe for Cypher
-
discourses
¶ Gets a list of discourses in the corpus
Returns: - list
Discourse names in the corpus
-
encode_type_subset
(annotation_type, annotation_labels, subset_label)[source]¶ Encode a type subset from labels of annotations
Parameters: - annotation_type : str
Annotation type of labels
- annotation_labels : list
a list of labels of annotations to subset together
- subset_label : str
the label for the subset
-
execute_cypher
(statement, **parameters)[source]¶ Executes a cypher query
Parameters: - statement : str
the cypher statement
- parameters : kwargs
keyword arguments to execute a cypher statement
Returns: BoltStatementResult
Result of Cypher query
-
exists
()[source]¶ Check whether the corpus has a Hierarchy schema in the Neo4j database
Returns: - bool
True if the corpus Hierarchy has been saved to the database
-
hierarchy_path
¶ Get the path to cached hierarchy information
Returns: - str
Path to the cached hierarchy data on disk
-
lowest_annotation
¶ Returns the annotation type that is the lowest in the Hierarchy.
Returns: - str
Lowest annotation type in the Hierarchy
-
phone_name
¶ Gets the phone label
Returns: - str
phone name
-
phones
¶ Get a list of all phone labels in the corpus.
Returns: - list
All phone labels in the corpus
-
query_discourses
()[source]¶ Start a query over discourses in the corpus
Returns: DiscourseQuery
DiscourseQuery object
-
query_graph
(annotation_node)[source]¶ Start a query over the tokens of a specified annotation type (i.e.
corpus.word
)Parameters: - annotation_node :
polyglotdb.query.attributes.AnnotationNode
The type of annotation to look for in the corpus
Returns: SplitQuery
SplitQuery object
- annotation_node :
-
query_lexicon
(annotation_node)[source]¶ Start a query over types of a specified annotation type (i.e.
corpus.lexicon_word
)Parameters: - annotation_node :
polyglotdb.query.attributes.AnnotationNode
The type of annotation to look for in the corpus’s lexicon
Returns: LexiconQuery
LexiconQuery object
- annotation_node :
-
query_speakers
()[source]¶ Start a query over speakers in the corpus
Returns: SpeakerQuery
SpeakerQuery object
-
remove_discourse
(name)[source]¶ Remove the nodes and relationships associated with a single discourse in the corpus.
Parameters: - name : str
Name of the discourse to remove
-
reset
(call_back=None, stop_check=None)[source]¶ Reset the Neo4j and InfluxDB databases for a corpus
Parameters: - call_back : callable
Function to monitor progress
- stop_check : callable
Function the check whether the process should terminate early
-
reset_graph
(call_back=None, stop_check=None)[source]¶ Remove all nodes and relationships in the corpus.
-
reset_type_subset
(annotation_type, subset_label)[source]¶ Reset and remove a type subset
Parameters: - annotation_type : str
Annotation type of the subset
- subset_label : str
the label for the subset
-
speakers
¶ Gets a list of speakers in the corpus
Returns: - list
Speaker names in the corpus
-
word_name
¶ Gets the word label
Returns: - str
word name
-
words
¶ Get a list of all word labels in the corpus.
Returns: - list
All word labels in the corpus
Phonological functionality¶
-
class
polyglotdb.corpus.
PhonologicalContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with phones
-
encode_class
(phones, label)[source]¶ encodes phone classes
Parameters: - phones : list
a list of phones
- label : str
the label for the class
-
encode_features
(feature_dict)[source]¶ gets the phone if it exists, queries for each phone and sets type to kwargs (?)
Parameters: - feature_dict : dict
features to encode
-
enrich_features
(feature_data, type_data=None)[source]¶ Sets the data type and feature data, initializes importers for feature data, adds features to hierarchy for a phone
Parameters: - feature_data : dict
the enrichment data
- type_data : dict
By default None
-
enrich_inventory_from_csv
(path)[source]¶ Enriches corpus from a csv file
Parameters: - path : str
the path to the csv file
-
remove_pattern
(pattern='[0-2]')[source]¶ removes a stress or tone pattern from all phones
Parameters: - pattern : str
the regular expression for the pattern to remove Defaults to ‘[0-2]’
-
reset_features
(feature_names)[source]¶ resets features
Parameters: - feature_names : list
list of names of features to remove
-
Syllabic functionality¶
-
class
polyglotdb.corpus.
SyllabicContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with syllables
-
encode_stress_from_word_property
(word_property_name)[source]¶ Use a property on words formatted like “0-1-0” to encode stress on syllables.
The number of syllables and the position of syllables within a word will also be encoded as a result of this function.
Parameters: - word_property_name : str
Property name of words that contains the stress pattern
-
encode_stress_to_syllables
(regex=None, clean_phone_label=True)[source]¶ Use numbers (0-9) in phone labels as stress property for syllables. If
clean_phone_label
is True, the numbers will be removed from the phone labels.Parameters: - regex : str
Regular expression character set for finding stress in the phone label
- clean_phone_label : bool
Flag for removing regular expression from the phone labels
-
encode_syllabic_segments
(phones)[source]¶ Encode a list of phones as ‘syllabic’
Parameters: - phones : list
A list of vowels and syllabic consonants
-
encode_syllables
(algorithm='maxonset', syllabic_label='syllabic', call_back=None, stop_check=None)[source]¶ Encodes syllables to a corpus
Parameters: - algorithm : str, defaults to ‘maxonset’
determines which algorithm will be used to encode syllables
- syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
- call_back : callable
Function to monitor progress
- stop_check : callable
Function the check whether the process should terminate early
-
encode_tone_to_syllables
(regex=None, clean_phone_label=True)[source]¶ Use numbers (0-9) in phone labels as tone property for syllables. If
clean_phone_label
is True, the numbers will be removed from the phone labels.Parameters: - regex : str
Regular expression character set for finding tone in the phone label
- clean_phone_label : bool
Flag for removing regular expression from the phone labels
-
enrich_syllables
(syllable_data, type_data=None)[source]¶ Sets the data type and syllable data, initializes importers for syllable data, adds features to hierarchy for a phone
Parameters: - syllable_data : dict
the enrichment data
- type_data : dict
By default None
-
find_codas
(syllabic_label='syllabic')[source]¶ Gets syllable codas across the corpus
Parameters: - syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
Returns: - data : dict
A dictionary with coda values as keys and frequency values as values
-
find_onsets
(syllabic_label='syllabic')[source]¶ Gets syllable onsets across the corpus
Parameters: - syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
Returns: - data : dict
A dictionary with onset values as keys and frequency values as values
-
has_syllabics
¶ Check whether there is a phone subset named
syllabic
Returns: - bool
True if
syllabic
is found as a phone subset
-
has_syllables
¶ Check whether the corpus has syllables encoded
Returns: - bool
True if the syllables are in the Hierarchy
-
Lexical functionality¶
-
class
polyglotdb.corpus.
LexicalContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with words
-
enrich_lexicon
(lexicon_data, type_data=None, case_sensitive=False)[source]¶ adds properties to lexicon, adds properties to hierarchy
Parameters: - lexicon_data : dict
the data in the lexicon
- type_data : dict
default to None
- case_sensitive : bool
default to False
-
Pause functionality¶
-
class
polyglotdb.corpus.
PauseContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with non-speech elements
-
encode_pauses
(pause_words, call_back=None, stop_check=None)[source]¶ Set words to be pauses, as opposed to speech.
Parameters: - pause_words : str, list, tuple, or set
Either a list of words that are pauses or a string containing a regular expression that specifies pause words
- call_back : callable
Function to monitor progress
- stop_check : callable
Function to check whether process should be terminated early
-
has_pauses
¶ Check whether corpus has encoded pauses
Returns: - bool
True if pause is in the subsets available for words
-
Utterance functionality¶
-
class
polyglotdb.corpus.
UtteranceContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with utterances
-
encode_speech_rate
(subset_label, call_back=None, stop_check=None)[source]¶ Encodes speech rate
Parameters: - subset_label : str
the name of the subset to encode
-
encode_utterance_position
(call_back=None, stop_check=None)[source]¶ Encodes position_in_utterance for a word
-
encode_utterances
(min_pause_length=0.5, min_utterance_length=0, call_back=None, stop_check=None)[source]¶ Encode utterance annotations based on minimum pause length and minimum utterance length. See get_pauses for more information about the algorithm.
Once this function is run, utterances will be queryable like other annotation types.
Parameters: - min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
enrich_utterances
(utterance_data, type_data=None)[source]¶ adds properties to lexicon, adds properties to hierarchy
Parameters: - utterance_data : dict
the data to enrich with
- type_data : dict
default to None
-
get_utterance_ids
(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶ Algorithm to find utterance boundaries in a discourse.
Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.
Parameters: - discourse : str
String identifier for a discourse
- min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
get_utterances
(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶ Algorithm to find utterance boundaries in a discourse.
Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.
Parameters: - discourse : str
String identifier for a discourse
- min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
Audio functionality¶
-
class
polyglotdb.corpus.
AudioContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing with audio files for corpora
-
acoustic_client
()[source]¶ Generate a client to connect to the InfluxDB for the corpus
Returns: - InfluxDBClient
Client through which to run queries and writes
-
analyze_formant_tracks
(source='praat', stop_check=None, call_back=None, multiprocessing=True, vowel_label=None)[source]¶ Compute formant tracks and save them to the database
See
polyglotdb.acoustics.formants.base.analyze_formant_tracks()
for more details.Parameters: - source : str
Program to compute formants
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- vowel_label : str, optional
Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.
-
analyze_intensity
(source='praat', stop_check=None, call_back=None, multiprocessing=True)[source]¶ Compute intensity tracks and save them to the database
See
polyglotdb.acoustics.intensity..analyze_intensity()
for more details.Parameters: - source : str
Program to compute intensity (only
praat
is supported)- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
-
analyze_pitch
(source='praat', algorithm='base', absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, call_back=None, multiprocessing=True)[source]¶ Analyze pitch tracks and save them to the database.
See
polyglotdb.acoustics.pitch.base.analyze_pitch()
for more details.Parameters: - source : str
Program to use for analyzing pitch, either
praat
orreaper
- algorithm : str
Algorithm to use,
base
,gendered
, orspeaker_adjusted
- absolute_min_pitch : int
Absolute pitch floor
- absolute_max_pitch : int
Absolute pitch ceiling
- adjusted_octaves : int
How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling
- stop_check : callable
Function to check whether processing should stop early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag whether to use multiprocessing or threading
-
analyze_script
(phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶ Use a Praat script to analyze annotation types in the corpus. The Praat script must return properties per phone (i.e., point measures, not a track), and these properties will be saved to the Neo4j database.
See
polyglotdb.acoustics.other..analyze_script()
for more details.Parameters: - phone_class : str
DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run
- subset : str, optional
the name of an already encoded subset of an annotation type, on which the analysis will be run
- annotation_type : str
the type of annotation that the analysis will go over
- script_path : str
Path to the Praat script
- duration_threshold : float
Minimum duration that phones should be to be analyzed
- arguments : list
Arguments to pass to the Praat script
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
Returns: - list
List of the names of newly added properties to the Neo4j database
-
analyze_track_script
(acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶ Use a Praat script to analyze phones in the corpus. The Praat script must return a track, and these tracks will be saved to the InfluxDB database.
See
polyglotdb.acoustics.other..analyze_track_script()
for more details.Parameters: - acoustic_name : str
Name of the acoustic measure
- properties : list
List of tuples of the form (
property_name
,Type
)- script_path : str
Path to the Praat script
- duration_threshold : float
Minimum duration that phones should be to be analyzed
- phone_class : str
Name of the phone subset to analyze
- arguments : list
Arguments to pass to the Praat script
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
-
analyze_utterance_pitch
(utterance, source='praat', **kwargs)[source]¶ Analyze a single utterance’s pitch track.
See
polyglotdb.acoustics.pitch.base.analyze_utterance_pitch()
for more details.Parameters: - utterance : str
Utterance ID from Neo4j
- source : str
Program to use for analyzing pitch, either
praat
orreaper
- kwargs
Additional settings to use in analyzing pitch
Returns: Track
Pitch track
-
analyze_vot
(classifier, stop_label='stops', stop_check=None, call_back=None, multiprocessing=False, overwrite_edited=False, vot_min=5, vot_max=100, window_min=-30, window_max=30)[source]¶ Compute VOTs for stops and save them to the database.
See
polyglotdb.acoustics.vot.base.analyze_vot()
for more details.Parameters: - classifier : str
Path to an AutoVOT classifier model
- stop_label : str
Label of subset to analyze
- vot_min : int
Minimum VOT in ms
- vot_max : int
Maximum VOT in ms
- window_min : int
Window minimum in ms
- window_max : int
Window maximum in Ms
- overwrite_edited : bool
Overwrite VOTs with the “edited” property set to true, if this is true
- call_back : callable
call back function, optional
- stop_check : callable
stop check function, optional
- multiprocessing : bool
Flag to use multiprocessing, otherwise will use threading
-
discourse_audio_directory
(discourse)[source]¶ Return the directory for the stored audio files for a discourse
-
discourse_has_acoustics
(acoustic_name, discourse)[source]¶ Return whether a discourse has any specific acoustic values associated with it
Parameters: - acoustic_name : str
Name of the acoustic type
- discourse : str
Name of the discourse
Returns: - bool
-
discourse_sound_file
(discourse)[source]¶ Get details for the audio file paths for a specified discourse.
Parameters: - discourse : str
Name of the audio file in the corpus
Returns: - dict
Information for the audio file path
-
encode_acoustic_statistic
(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶ Computes and saves as type properties summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.
Parameters: - acoustic_name : str
Name of the acoustic type
- statistic : str
One of mean, median, stddev, sum, mode, count
- by_speaker : bool, defaults to True
Flag for calculating summary statistic by speaker
- by_phone : bool, defaults to False
Flag for calculating summary statistic by phone
-
execute_influxdb
(query)[source]¶ Execute an InfluxDB query for the corpus
Parameters: - query : str
Query to run
Returns: influxdb.resultset.ResultSet
Results of the query
-
genders
()[source]¶ Gets all values of speaker property named
gender
in the Neo4j databaseReturns: - list
List of gender values
-
generate_spectrogram
(discourse, file_type='consonant', begin=None, end=None)[source]¶ Generate a spectrogram from an audio file. If
begin
is unspecified, the segment will start at the beginning of the audio file, and ifend
is unspecified, the segment will end at the end of the audio file.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
- begin : float
Timestamp in seconds
- end : float
Timestamp in seconds
Returns: - numpy.array
Spectrogram information
- float
Time step between each window
- float
Frequency step between each frequency bin
-
get_acoustic_measure
(acoustic_name, discourse, begin, end, channel=0, relative_time=False, **kwargs)[source]¶ Get acoustic for a given discourse and time range
Parameters: - acoustic_name : str
Name of acoustic track
- discourse : str
Name of the discourse
- begin : float
Beginning of time range
- end : float
End of time range
- channel : int, defaults to 0
Channel of the audio file
- relative_time : bool, defaults to False
Flag for retrieving relative time instead of absolute time
- kwargs : kwargs
Tags to filter on
Returns: polyglotdb.acoustics.classes.Track
Track object
-
get_acoustic_statistic
(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶ Computes summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.
Parameters: - acoustic_name : str
Name of the acoustic type
- statistic : str
One of mean, median, stddev, sum, mode, count
- by_speaker : bool, defaults to True
Flag for calculating summary statistic by speaker
- by_phone : bool, defaults to False
Flag for calculating summary statistic by phone
Returns: - dict
Dictionary where keys are phone/speaker/phone-speaker pairs and values are the summary statistic of the acoustic measure
-
get_utterance_acoustics
(acoustic_name, utterance_id, discourse, speaker)[source]¶ Get acoustic for a given utterance and time range
Parameters: - acoustic_name : str
Name of acoustic track
- utterance_id : str
ID of the utterance from the Neo4j database
- discourse : str
Name of the discourse
- speaker : str
Name of the speaker
Returns: polyglotdb.acoustics.classes.Track
Track object
-
has_all_sound_files
()[source]¶ Check whether all discourses have a sound file
Returns: - bool
True if a sound file exists for each discourse name in corpus, False otherwise
-
has_sound_files
¶ Check whether any discourses have a sound file
Returns: - bool
True if there are any sound files at all, false if there aren’t
-
load_audio
(discourse, file_type)[source]¶ Loads a given audio file at the specified sampling rate type (
consonant
,vowel
orlow_freq
). Consonant files have a sampling rate of 16 kHz, vowel files a sampling rate of 11 kHz, and low frequency files a sampling rate of 1.2 kHz.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
Returns: - numpy.array
Audio signal
- int
Sampling rate of the file
-
load_waveform
(discourse, file_type='consonant', begin=None, end=None)[source]¶ Loads a segment of a larger audio file. If
begin
is unspecified, the segment will start at the beginning of the audio file, and ifend
is unspecified, the segment will end at the end of the audio file.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
- begin : float, optional
Timestamp in seconds
- end : float, optional
Timestamp in seconds
Returns: - numpy.array
Audio signal
- int
Sampling rate of the file
-
reassess_utterances
(acoustic_name)[source]¶ Update utterance IDs in InfluxDB for more efficient querying if utterances have been re-encoded after acoustic measures were encoded
Parameters: - acoustic_name : str
Name of the measure for which to update utterance IDs
-
relativize_acoustic_measure
(acoustic_name, by_speaker=True, by_phone=False)[source]¶ Relativize acoustic tracks by taking the z-score of the points (using by speaker or by phone means and standard deviations, or both by-speaker, by phone) and save them as separate measures, i.e., F0_relativized from F0.
Parameters: - acoustic_name : str
Name of the acoustic measure
- by_speaker : bool, defaults to True
Flag for relativizing by speaker
- by_phone : bool, defaults to False
Flag for relativizing by phone
-
reset_acoustic_measure
(acoustic_type)[source]¶ Reset a given acoustic measure
Parameters: - acoustic_type : str
Name of the acoustic measurement to reset
-
reset_relativized_acoustic_measure
(acoustic_name)[source]¶ Reset any relativized measures that have been encoded for a specified type of acoustics
Parameters: - acoustic_name : str
Name of the acoustic type
-
save_acoustic_track
(acoustic_name, discourse, track, **kwargs)[source]¶ Save an acoustic track for a sound file
Parameters: - acoustic_name : str
Name of the acoustic type
- discourse : str
Name of the discourse
- track :
Track
Track to save
- kwargs: kwargs
Tags to save for acoustic measurements
-
save_acoustic_tracks
(acoustic_name, tracks, speaker)[source]¶ Save multiple acoustic tracks for a collection of analyzed segments
Parameters: - acoustic_name : str
Name of the acoustic type
- tracks : iterable
Iterable of
Track
objects to save- speaker : str
Name of the speaker of the tracks
-
update_utterance_pitch_track
(utterance, new_track)[source]¶ Save a pitch track for the specified utterance.
See
polyglotdb.acoustics.pitch.base.update_utterance_pitch_track()
for more details.Parameters: - utterance : str
Utterance ID from Neo4j
- new_track : list or
Track
Pitch track
Returns: - int
Time stamp of update
-
utterance_sound_file
(utterance_id, file_type='consonant')[source]¶ Generate an audio file just for a single utterance in an audio file.
Parameters: - utterance_id : str
Utterance ID from Neo4j
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
Returns: - str
Path to the generated sound file
-
Summarization functionality¶
-
class
polyglotdb.corpus.
SummarizedContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with summary measures for linguistic items
-
average_speech_rate
()[source]¶ Get the average speech rate for each speaker in a corpus
Returns: - result: list
the average speech rate by speaker
-
baseline_duration
(annotation, speaker=None)[source]¶ Get the baseline duration of each word in corpus. Baseline duration is determined by summing the average durations of constituent phones for a word. If there is no underlying transcription available, the longest duration is considered the baseline.
Parameters: - speaker : str
a speaker name, if desired (defaults to None)
Returns: - word_totals : dict
a dictionary of words and baseline durations
-
encode_baseline
(annotation_type, property_name, by_speaker=False)[source]¶ Encode a baseline measure of a property, that is, the expected value of a higher annotation given the average property value of the phones that make it up. For instance, the expected duration of a word or syllable given its phonological content.
Parameters: - annotation_type : str
Name of annotation type to compute for
- property_name : str
Property of phones to compute based off of (i.e.,
duration
)- by_speaker : bool
Flag for whether to use by-speaker means
-
encode_measure
(property_name, statistic, annotation_type, by_speaker=False)[source]¶ Compute and save an aggregate measure for annotation types
Available statistic names:
- mean/average/avg
- sd/stdev
Parameters: - property_name : str
Name of the property
- statistic : str
Name of the statistic to use for aggregation
- annotation_type : str
Name of the annotation type
- by_speaker : bool
Flag for whether to compute aggregation by speaker
-
encode_relativized
(annotation_type, property_name, by_speaker=False)[source]¶ Compute and save to the database a relativized measure (i.e., the property value z-scored using a mean and standard deviation computed from the corpus). The computation of means and standard deviations can be by-speaker.
Parameters: - annotation_type : str
Name of the annotation type
- property_name : str
Name of the property to relativize
- by_speaker : bool
Flag to use by-speaker means and standard deviations
-
get_measure
(data_name, statistic, annotation_type, by_speaker=False, speaker=None)[source]¶ abstract function to get statistic for the data_name of an annotation_type
Parameters: - data_name : str
the aspect to summarize (duration, pitch, formants, etc)
- statistic : str
how to summarize (mean, stdev, median, etc)
- annotation_type : str
the annotation to summarize
- by_speaker : boolean
whether to summarize by speaker or not
- speaker : str
the specific speaker to encode baseline duration for (only for baseline duration)
-
Spoken functionality¶
-
class
polyglotdb.corpus.
SpokenContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with speaker and sound file metadata
-
enrich_discourses
(discourse_data, type_data=None)[source]¶ Add properties about discourses to the corpus, allowing them to be queryable.
Parameters: - discourse_data : dict
the data about the discourse to add
- type_data : dict
Specifies the type of the data to be added, defaults to None
-
enrich_discourses_from_csv
(path)[source]¶ Enriches discourses from a csv file
Parameters: - path : str
the path to the csv file
-
enrich_speakers
(speaker_data, type_data=None)[source]¶ Add properties about speakers to the corpus, allowing them to be queryable.
Parameters: - speaker_data : dict
the data about the speakers to add
- type_data : dict
Specifies the type of the data to be added, defaults to None
-
enrich_speakers_from_csv
(path)[source]¶ Enriches speakers from a csv file
Parameters: - path : str
the path to the csv file
-
get_channel_of_speaker
(speaker, discourse)[source]¶ Get the channel that the speaker is in
Parameters: - speaker : str
Speaker to query
- discourse : str
Discourse to query
Returns: - int
Channel of audio that speaker is in
-
get_discourses_of_speaker
(speaker)[source]¶ Get a list of all discourses that a given speaker spoke in
Parameters: - speaker : str
Speaker to query over
Returns: - list
All discourses the speaker spoke in
-
get_speakers_in_discourse
(discourse)[source]¶ Get a list of all speakers that spoke in a given discourse
Parameters: - discourse : str
Audio file to query over
Returns: - list
All speakers who spoke in the discourse
-
make_speaker_annotations_dict
(data, speaker, property)[source]¶ helper function to turn dict of {} format to {speaker :{property :{data}}}
Parameters: - data : dict
annotations and values
- property : str
the name of the property being encoded
- speaker : str
the name of the speaker
-
Structured functionality¶
-
class
polyglotdb.corpus.
StructuredContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with metadata for the corpus
-
encode_count
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes the rate of the lower type in the higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
encode_position
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes position of lower type in higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
encode_rate
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes the rate of the lower type in the higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
generate_hierarchy
()[source]¶ Get hierarchy schema information from the Neo4j database
Returns: Hierarchy
the structure of the corpus
-
Omnibus class¶
-
class
polyglotdb.corpus.
CorpusContext
(*args, **kwargs)[source]¶ Main corpus context, inherits from the more specialized contexts.
Parameters: - args : args
Either a CorpusConfig object or sequence of arguments to be passed to a CorpusConfig object
- kwargs : kwargs
sequence of keyword arguments to be passed to a CorpusConfig object
Corpus structure class¶
-
class
polyglotdb.structure.
Hierarchy
(data=None, corpus_name=None)[source]¶ Class containing information about how a corpus is structured.
Hierarchical data is stored in the form of a dictionary with keys for linguistic types, and values for the linguistic type that contains them. If no other type contains a given type, its value is
None
.Subannotation data is stored in the form of a dictionary with keys for linguistic types, and values of sets of types of subannotations.
Parameters: - data : dict
Information about the hierarchy of linguistic types
- corpus_name : str
Name of the corpus
-
acoustics
¶ Get all currently encoded acoustic measurements in the corpus
Returns: - list
All encoded acoustic measures
-
add_acoustic_properties
(corpus_context, acoustic_type, properties)[source]¶ Add acoustic properties to an encoded acoustic measure. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- acoustic_type : str
Acoustic measure to add properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_annotation_type
(annotation_type, above=None, below=None)[source]¶ Adds an annotation type to the Hierarchy object along with default type and token properties for the new annotation type
Parameters: - annotation_type : str
Annotation type to add
- above : str
Annotation type that is contained by the new annotation type, leave out if new annotation type is at the bottom of the hierarchy
- below : str
Annotation type that contains the new annotation type, leave out if new annotation type is at the top of the hierarchy
-
add_discourse_properties
(corpus_context, properties)[source]¶ Adds discourse properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_speaker_properties
(corpus_context, properties)[source]¶ Adds speaker properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_subannotation_properties
(corpus_context, subannotation_type, properties)[source]¶ Adds properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Name of the subannotation type
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_subannotation_type
(corpus_context, annotation_type, subannotation_type, properties=None)[source]¶ Adds subannotation type for a given annotation type to the Hierarchy object and syncs it to a Neo4j database. The list of optional properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add a subannotation to
- subannotation_type : str
Name of the subannotation type
- properties : iterable
Optional iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_token_properties
(corpus_context, annotation_type, properties)[source]¶ Adds token properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add token properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_token_subsets
(corpus_context, annotation_type, subsets)[source]¶ Adds token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to add subsets for
- subsets : iterable
List of subsets to add for the annotation tokens
- corpus_context :
-
add_type_properties
(corpus_context, annotation_type, properties)[source]¶ Adds type properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add type properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_type_subsets
(corpus_context, annotation_type, subsets)[source]¶ Adds type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to add subsets for
- subsets : iterable
List of subsets to add for the annotation type
- corpus_context :
-
annotation_types
¶ Get a list of all the annotation types in the hierarchy
Returns: - list
All annotation types in the hierarchy
-
from_json
(json)[source]¶ Set all properties from a dictionary deserialized from JSON
Parameters: - json : dict
Object information
-
get_depth
(lower_type, higher_type)[source]¶ Get the distance between two annotation types in the hierarchy
Parameters: - lower_type : str
Name of the lower type
- higher_type : str
Name of the higher type
Returns: - int
Distance between the two types
-
get_higher_types
(annotation_type)[source]¶ Get all annotation types that are higher than the specified annotation type
Parameters: - annotation_type : str
Annotation type from which to get higher annotation types
Returns: - list
List of all annotation types that are higher the specified annotation type
-
get_lower_types
(annotation_type)[source]¶ Get all annotation types that are lower than the specified annotation type
Parameters: - annotation_type : str
Annotation type from which to get lower annotation types
Returns: - list
List of all annotation types that are lower the specified annotation type
-
has_discourse_property
(key)[source]¶ Check for whether discourses have a given property
Parameters: - key : str
Property to check for
Returns: - bool
True if discourses have the given property
-
has_speaker_property
(key)[source]¶ Check for whether speakers have a given property
Parameters: - key : str
Property to check for
Returns: - bool
True if speakers have the given property
-
has_subannotation_property
(subannotation_type, property_name)[source]¶ Check whether the Hierarchy has a property associated with a subannotation type
Parameters: - subannotation_type : str
Name of subannotation to check
- property_name : str
Name of the property to check for
Returns: - bool
True if subannotation type has the given property name
-
has_subannotation_type
(subannotation_type)[source]¶ Check whether the Hierarchy has a subannotation type
Parameters: - subannotation_type : str
Name of subannotation to check for
Returns: - bool
True if subannotation type is present
-
has_token_property
(annotation_type, key)[source]¶ Check whether a given annotation type has a given token property.
Parameters: - annotation_type : str
Annotation type to check for the given token property
- key : str
Property to check for
Returns: - bool
True if the annotation type has the given token property
-
has_token_subset
(annotation_type, key)[source]¶ Check whether a given annotation type has a given token subset.
Parameters: - annotation_type : str
Annotation type to check for the given token subset
- key : str
Subset to check for
Returns: - bool
True if the annotation type has the given token subset
-
has_type_property
(annotation_type, key)[source]¶ Check whether a given annotation type has a given type property.
Parameters: - annotation_type : str
Annotation type to check for the given type property
- key : str
Property to check for
Returns: - bool
True if the annotation type has the given type property
-
has_type_subset
(annotation_type, key)[source]¶ Check whether a given annotation type has a given type subset.
Parameters: - annotation_type : str
Annotation type to check for the given type subset
- key : str
Subset to check for
Returns: - bool
True if the annotation type has the given type subset
-
highest
¶ Get the highest annotation type of the Hierarchy
Returns: - str
Highest annotation type
-
highest_to_lowest
¶ Get a list of annotation types sorted from highest to lowest
Returns: - list
Annotation types from highest to lowest
-
lowest
¶ Get the lowest annotation type of the Hierarchy
Returns: - str
Lowest annotation type
-
lowest_to_highest
¶ Get a list of annotation types sorted from lowest to highest
Returns: - list
Annotation types from lowest to highest
-
phone_name
¶ Alias function for getting the lowest annotation type
Returns: - str
Name of the lowest annotation type
-
remove_acoustic_properties
(corpus_context, acoustic_type, properties)[source]¶ Remove acoustic properties to an encoded acoustic measure.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- acoustic_type : str
Acoustic measure to remove properties for
- properties : iterable
List of property names
- corpus_context :
-
remove_annotation_type
(annotation_type)[source]¶ Removes an annotation type from the hierarchy
Parameters: - annotation_type : str
Annotation type to remove
-
remove_discourse_properties
(corpus_context, properties)[source]¶ Removes discourse properties and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_speaker_properties
(corpus_context, properties)[source]¶ Removes speaker properties and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_subannotation_properties
(corpus_context, subannotation_type, properties)[source]¶ Removes properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Name of the subannotation type
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_subannotation_type
(corpus_context, subannotation_type)[source]¶ Remove a subannotation type from the Hierarchy object and sync it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Subannotation type to remove
- corpus_context :
-
remove_token_properties
(corpus_context, annotation_type, properties)[source]¶ Removes token properties for an annotation type and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to remove token properties for
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_token_subsets
(corpus_context, annotation_type, subsets)[source]¶ Removes token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to remove subsets for
- subsets : iterable
List of subsets to remove for the annotation tokens
- corpus_context :
-
remove_type_properties
(corpus_context, annotation_type, properties)[source]¶ Removes type properties for an annotation type and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to remove type properties for
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_type_subsets
(corpus_context, annotation_type, subsets)[source]¶ Removes type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to remove subsets for
- subsets : iterable
List of subsets to remove for the annotation type
- corpus_context :
-
to_json
()[source]¶ Convert the Hierarchy object to a dictionary for JSON serialization
Returns: - dict
All necessary information for the Hierarchy object
-
update
(other)[source]¶ Merge Hierarchies together. If other is a dictionary, then only the hierarchical data is updated.
Parameters: - other : Hierarchy or dict
Data to be merged in
-
values
()[source]¶ Values (containing types) of the hierarchy.
Returns: - generator
Values of the hierarchy
-
word_name
¶ Shortcut for returning the annotation type matching “word”
Returns: - str or None
Annotation type that begins with “word”
Corpus config class¶
-
class
polyglotdb.config.
CorpusConfig
(corpus_name, data_dir=None, **kwargs)[source]¶ Class for storing configuration information about a corpus.
Parameters: - corpus_name : str
Identifier for the corpus
- kwargs : keyword arguments
All keywords will be converted to attributes of the object
Attributes: - corpus_name : str
Identifier of the corpus
- graph_user : str
Username for connecting to the graph database
- graph_password : str
Password for connecting to the graph database
- graph_host : str
Host for the graph database
- graph_port : int
Port for connecting to the graph database
- engine : str
Type of SQL database
- base_dir : str
Base directory to store information and temporary files for the corpus defaults to “.pgdb” under the current user’s home directory
-
acoustic_connection_kwargs
¶ Return connection parameters to use for connecting to an InfluxDB database
Returns: - dict
Connection parameters
-
graph_connection_string
¶ Construct a connection string to use for Neo4j
Returns: - str
Connection string
Query API¶
Base¶
BaseQuery (corpus, to_find) |
Attributes¶
Node (node_type[, corpus, hierarchy]) |
|
NodeAttribute (node, label) |
|
CollectionNode (anchor_node, collected_node) |
|
CollectionAttribute (node, label) |
Clause elements¶
ClauseElement (attribute, value) |
Base class for filter elements that will be translated to Cypher. |
EqualClauseElement (attribute, value) |
|
GtClauseElement (attribute, value) |
|
GteClauseElement (attribute, value) |
|
LtClauseElement (attribute, value) |
|
LteClauseElement (attribute, value) |
|
NotEqualClauseElement (attribute, value) |
|
InClauseElement (attribute, value) |
Annotation queries¶
GraphQuery (corpus, to_find[, stop_check]) |
Base GraphQuery class. |
SplitQuery (corpus, to_find[, stop_check]) |
Attributes¶
AnnotationNode (node_type[, corpus, hierarchy]) |
Class for annotations referenced in graph queries |
AnnotationAttribute (annotation, label) |
Class for information about the attributes of annotations in a graph query |
Clause elements¶
ContainsClauseElement (attribute, value) |
Clause for filtering based on hierarchical relations. |
Lexicon queries¶
LexiconQuery (corpus, to_find) |
Class for generating a Cypher query over the lexicon (type information about annotations in the corpus) |
Speaker queries¶
SpeakerQuery (corpus) |
Class for generating a Cypher query over speakers |
Attributes¶
SpeakerNode ([corpus, hierarchy]) |
|
SpeakerAttribute (node, label) |
|
DiscourseNode ([corpus, hierarchy]) |
|
DiscourseCollectionNode (speaker_node) |
|
ChannelAttribute (node) |
Discourse queries¶
DiscourseQuery (corpus) |
Class for generating a Cypher query over discourses |
Attributes¶
DiscourseNode ([corpus, hierarchy]) |
|
DiscourseAttribute (node, label) |
|
SpeakerNode ([corpus, hierarchy]) |
|
SpeakerCollectionNode (discourse_node) |
|
ChannelAttribute (node) |
I/O API¶
Contents:
I/O Types API¶
Parsing types¶
parsing.BreakIndexTier (name, linguistic_type) |
|
parsing.GroupingTier (name, linguistic_type) |
|
parsing.MorphemeTier (name, linguistic_type) |
|
parsing.OrthographyTier (name, linguistic_type) |
|
parsing.SegmentTier (name, linguistic_type[, …]) |
|
parsing.TobiTier (name, linguistic_type[, label]) |
|
parsing.TranscriptionTier (name, linguistic_type) |
|
parsing.TextMorphemeTier (name, linguistic_type) |
|
parsing.TextOrthographyTier (name, …[, label]) |
|
parsing.TextTranscriptionTier (name, …) |
Linguistic types¶
standardized.PGAnnotationType (name) |
Import API¶
Contents:
Inspect Functions¶
inspect_buckeye (word_path) |
Generate a BuckeyeParser for the Buckeye corpus. |
inspect_ilg (path[, number]) |
Generate an IlgParser for a specified text file for parsing it as an interlinear gloss text file |
inspect_textgrid (path) |
Generate a TextgridParser for a specified TextGrid file |
inspect_mfa (path) |
Generate an MfaParser for a specified text file for parsing it as a MFA file |
inspect_fave (path) |
Generate an FaveParser for a specified text file for parsing it as an FAVE text file |
inspect_maus (path) |
Generate an MausParser for a specified text file for parsing it as a MAUS file |
inspect_timit (word_path) |
Generate a TimitParser . |
Parser Classes¶
-
class
polyglotdb.io.parsers.base.
BaseParser
(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]¶ Base parser, extend this class for new parsers.
Parameters: - annotation_tiers: list
Annotation types of the files to parse
- hierarchy :
Hierarchy
Details of how linguistic types relate to one another
- make_transcription : bool, defaults to True
If true, create a word attribute for transcription based on segments that are contained by the word
- stop_check : callable, optional
Function to check whether to halt parsing
- call_back : callable, optional
Function to output progress messages
-
match_extension
(filename)[source]¶ Ensures that filename ends with acceptable extension
Parameters: - filename : str
the filename of the file being checked
Returns: - boolean
True if filename is acceptable extension, false otherwise
-
parse_discourse
(name, types_only=False)[source]¶ Parse annotations for later importing.
Parameters: - name : str
Name of the discourse
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data
-
class
polyglotdb.io.parsers.textgrid.
TextgridParser
(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]¶ Parser for Praat TextGrid files.
Parameters: - annotation_tiers: list
Annotation types of the files to parse
- hierarchy :
Hierarchy
Details of how linguistic types relate to one another
- make_transcription : bool, defaults to True
If true, create a word attribute for transcription based on segments that are contained by the word
- stop_check : callable, optional
Function to check whether to halt parsing
- call_back : callable, optional
Function to output progress messages
-
load_textgrid
(path)[source]¶ Load a TextGrid file
Parameters: - path : str
Path to the TextGrid file
Returns: TextGrid
TextGrid object
-
parse_discourse
(path, types_only=False)[source]¶ Parse a TextGrid file for later importing.
Parameters: - path : str
Path to TextGrid file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
class
polyglotdb.io.parsers.aligner.
AlignerParser
(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]¶ Base class for parsing TextGrid output from forced aligners.
Parameters: - annotation_tiers : list
List of the annotation tiers to store data from the TextGrid
- hierarchy : Hierarchy
Basic hierarchy of the TextGrid
- make_transcription : bool
Flag for whether to add a transcription property to words based on phones they contain
- stop_check : callable
Function to check for whether parsing should stop
- call_back : callable
Function to report progress in parsing
Attributes: - word_label : str
Label identifying word tiers
- phone_label : str
Label identifying phone tiers
- name : str
Name of the aligner the TextGrids are from
- speaker_first : bool
Whether speaker names precede tier types in the TextGrid when multiple speakers are present
-
load_textgrid
(path)¶ Load a TextGrid file
Parameters: - path : str
Path to the TextGrid file
Returns: TextGrid
TextGrid object
-
match_extension
(filename)¶ Ensures that filename ends with acceptable extension
Parameters: - filename : str
the filename of the file being checked
Returns: - boolean
True if filename is acceptable extension, false otherwise
-
parse_discourse
(path, types_only=False)[source]¶ Parse a forced aligned TextGrid file for later importing.
Parameters: - path : str
Path to TextGrid file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
parse_information
(path, corpus_name)¶ Parses types out of a corpus
Parameters: - path : str
a path to the corpus
- corpus_name : str
name of the corpus
Returns: - data.types : list
a list of data types
-
class
polyglotdb.io.parsers.mfa.
MfaParser
(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]¶ Parser for TextGrids generated by the Montreal Forced Aligner.
-
load_textgrid
(path)¶ Load a TextGrid file
Parameters: - path : str
Path to the TextGrid file
Returns: TextGrid
TextGrid object
-
match_extension
(filename)¶ Ensures that filename ends with acceptable extension
Parameters: - filename : str
the filename of the file being checked
Returns: - boolean
True if filename is acceptable extension, false otherwise
-
parse_discourse
(path, types_only=False)¶ Parse a forced aligned TextGrid file for later importing.
Parameters: - path : str
Path to TextGrid file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
parse_information
(path, corpus_name)¶ Parses types out of a corpus
Parameters: - path : str
a path to the corpus
- corpus_name : str
name of the corpus
Returns: - data.types : list
a list of data types
-
-
class
polyglotdb.io.parsers.fave.
FaveParser
(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]¶ Parser for TextGrids generated by the FAVE-align.
-
load_textgrid
(path)¶ Load a TextGrid file
Parameters: - path : str
Path to the TextGrid file
Returns: TextGrid
TextGrid object
-
match_extension
(filename)¶ Ensures that filename ends with acceptable extension
Parameters: - filename : str
the filename of the file being checked
Returns: - boolean
True if filename is acceptable extension, false otherwise
-
parse_discourse
(path, types_only=False)¶ Parse a forced aligned TextGrid file for later importing.
Parameters: - path : str
Path to TextGrid file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
parse_information
(path, corpus_name)¶ Parses types out of a corpus
Parameters: - path : str
a path to the corpus
- corpus_name : str
name of the corpus
Returns: - data.types : list
a list of data types
-
-
class
polyglotdb.io.parsers.maus.
MausParser
(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]¶ Parser for TextGrids generated by the Web-MAUS aligner.
-
load_textgrid
(path)¶ Load a TextGrid file
Parameters: - path : str
Path to the TextGrid file
Returns: TextGrid
TextGrid object
-
match_extension
(filename)¶ Ensures that filename ends with acceptable extension
Parameters: - filename : str
the filename of the file being checked
Returns: - boolean
True if filename is acceptable extension, false otherwise
-
parse_discourse
(path, types_only=False)¶ Parse a forced aligned TextGrid file for later importing.
Parameters: - path : str
Path to TextGrid file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
parse_information
(path, corpus_name)¶ Parses types out of a corpus
Parameters: - path : str
a path to the corpus
- corpus_name : str
name of the corpus
Returns: - data.types : list
a list of data types
-
-
class
polyglotdb.io.parsers.timit.
TimitParser
(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]¶ Parser for the TIMIT corpus.
Has annotation types for word labels and surface transcription labels.
Parameters: - annotation_tiers: list
Annotation types of the files to parse
- hierarchy :
Hierarchy
Details of how linguistic types relate to one another
- stop_check : callable, optional
Function to check whether to halt parsing
- call_back : callable, optional
Function to output progress messages
-
parse_discourse
(word_path, types_only=False)[source]¶ Parse a TIMIT file for later importing.
Parameters: - word_path : str
Path to TIMIT .wrd file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data from the file
-
class
polyglotdb.io.parsers.buckeye.
BuckeyeParser
(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]¶ Parser for the Buckeye corpus.
Has annotation types for word labels, word transcription, word part of speech, and surface transcription labels.
Parameters: - annotation_tiers: list
Annotation types of the files to parse
- hierarchy :
Hierarchy
Details of how linguistic types relate to one another
- stop_check : callable, optional
Function to check whether to halt parsing
- call_back : callable, optional
Function to output progress messages
-
parse_discourse
(word_path, types_only=False)[source]¶ Parse a Buckeye file for later importing.
Parameters: - word_path : str
Path to Buckeye .words file
- types_only : bool
Flag for whether to only save type information, ignoring the token information
Returns: DiscourseData
Parsed data
-
class
polyglotdb.io.parsers.labbcat.
LabbCatParser
(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]¶ Parser for TextGrids exported from LaBB-CAT
Parameters: - annotation_tiers : list
List of the annotation tiers to store data from the TextGrid
- hierarchy : Hierarchy
Basic hierarchy of the TextGrid
- make_transcription : bool
Flag for whether to add a transcription property to words based on phones they contain
- stop_check : callable
Function to check for whether parsing should stop
- call_back : callable
Function to report progress in parsing
-
class
polyglotdb.io.parsers.speaker.
FilenameSpeakerParser
(number_of_characters, left_orientation=True)[source]¶ Class for parsing a speaker name from a path that gets a specified number of characters from either the left or the right of the base file name.
Parameters: - number_of_characters : int
Number of characters to include in the speaker designation, set to 0 to get the full file name
- left_orientation : bool
Whether to pull characters from the left or right of the base file name, defaults to True
Exporters¶
Exporters are still under development.
Helper functions¶
DiscourseData (name, annotation_types, hierarchy) |
Class for collecting information about a discourse to be loaded |
inspect_directory (directory) |
Function to inspect a directory and return the most likely type of files within it. |
text_to_lines (path) |
Parse a text file into lines. |
find_wav_path (path) |
Find a sound file for a given file, by looking for a .wav file with the same base name as the given path |
Acoustics API¶
Classes¶
-
class
polyglotdb.acoustics.classes.
Track
[source]¶ Track class to contain, select, and manage
TimePoint
objectsAttributes: - points : iterable of
TimePoint
Time points with values of the acoustic track
-
items
()[source]¶ Generator for returning tuples of the time point and values
Returns: - generator
Tuples of time points and values
-
keys
()[source]¶ Get a list of all keys for TimePoints that the Track has
Returns: - list
All keys on TimePoint objects
- points : iterable of
-
class
polyglotdb.acoustics.classes.
TimePoint
(time)[source]¶ Class for handling acoustic measurements at a specific time point
Attributes: - time : float
The time of the time point
- values : dict
Dictionary of acoustic measures for the given time point
-
add_value
(name, value)[source]¶ Add a new named measure and value to the TimePoint
Parameters: - name : str
Name of the measure
- value : object
Measure value
-
has_value
(name)[source]¶ Check whether a time point contains a named measure
Parameters: - name : str
Name of the measure
Returns: - bool
True if name is in values and has a value
-
select_values
(columns)[source]¶ Generate a dictionary of only the specified measurement names
Parameters: - columns : iterable
Iterable of measurement names to include
Returns: - dict
Subset of values if their name is in the specified columns
-
update
(point)[source]¶ Update values in this time point from another TimePoint
Parameters: - point :
polyglotdb.acoustics.classes.TimePoint
TimePoint to get values from
- point :
Segments¶
-
polyglotdb.acoustics.segments.
generate_segments
(corpus_context, annotation_type='utterance', subset=None, file_type='vowel', duration_threshold=0.001, padding=0, fetch_subannotations=False)[source]¶ Generate segment vectors for an annotation type, to be used as input to analyze_file_segments.
Parameters: - corpus_context :
CorpusContext
The CorpusContext object of the corpus
- annotation_type : str, optional
The type of annotation to use in generating segments, defaults to utterance
- subset : str, optional
Specify a subset to use for generating segments
- file_type : str, optional
One of ‘low_freq’, ‘vowel’, or ‘consonant’, specifies the type of audio file to use
- duration_threshold: float, optional
Segments with length shorter than this value (in seconds) will not be included
Returns: - SegmentMapping
Object containing segments to be analyzed
- corpus_context :
-
polyglotdb.acoustics.segments.
generate_vowel_segments
(corpus_context, duration_threshold=None, padding=0, vowel_label='vowel')[source]¶ Generate segment vectors for each vowel, to be used as input to analyze_file_segments.
Parameters: - corpus_context :
polyglot.corpus.context.CorpusContext
The CorpusContext object of the corpus
- duration_threshold: float, optional
Segments with length shorter than this value (in seconds) will not be included
Returns: - SegmentMapping
Object containing vowel segments to be analyzed
- corpus_context :
-
polyglotdb.acoustics.segments.
generate_utterance_segments
(corpus_context, file_type='vowel', duration_threshold=None, padding=0)[source]¶ Generate segment vectors for each utterance, to be used as input to analyze_file_segments.
Parameters: - corpus_context :
polyglot.corpus.context.CorpusContext
The CorpusContext object of the corpus
- file_type : str, optional
One of ‘low_freq’, ‘vowel’, or ‘consonant’, specifies the type of audio file to use
- duration_threshold: float, optional
Segments with length shorter than this value (in seconds) will not be included
Returns: - SegmentMapping
Object containing utterance segments to be analyzed
- corpus_context :
Formants¶
-
polyglotdb.acoustics.formants.base.
analyze_formant_tracks
(corpus_context, vowel_label=None, source='praat', call_back=None, stop_check=None, multiprocessing=True)[source]¶ Analyze formants of an entire utterance, and save the resulting formant tracks into the database.
Parameters: - corpus_context : CorpusContext
corpus context to use
- vowel_label : str, optional
Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.
- call_back : callable
call back function, optional
- stop_check : callable
stop check function, optional
-
polyglotdb.acoustics.formants.base.
analyze_formant_points
(corpus_context, call_back=None, stop_check=None, vowel_label='vowel', duration_threshold=None, multiprocessing=True)[source]¶ First pass of the algorithm; generates prototypes.
Parameters: - corpus_context :
polyglot.corpus.context.CorpusContext
The CorpusContext object of the corpus.
- call_back : callable
Information about callback.
- stop_check : string
Information about stop check.
- vowel_label : str
The subset of phones to analyze.
- duration_threshold : float, optional
Segments with length shorter than this value (in milliseconds) will not be analyzed.
Returns: - dict
Track data
- corpus_context :
-
polyglotdb.acoustics.formants.refined.
analyze_formant_points_refinement
(corpus_context, vowel_label='vowel', duration_threshold=0, num_iterations=1, call_back=None, stop_check=None, vowel_prototypes_path='', drop_formant=False, multiprocessing=True, output_tracks=False)[source]¶ Extracts F1, F2, F3 and B1, B2, B3.
Parameters: - corpus_context :
CorpusContext
The CorpusContext object of the corpus.
- vowel_label : str
The subset of phones to analyze.
- duration_threshold : float, optional
Segments with length shorter than this value (in milliseconds) will not be analyzed.
- num_iterations : int, optional
How many times the algorithm should iterate before returning values.
- output_tracks : bool, optional
Whether to save only the formant values as a point at 0.33 if false or have a track over the entire vowel duration if true.
Returns: - prototype_metadata : dict
Means of F1, F2, F3, B1, B2, B3 and covariance matrices per vowel class.
- corpus_context :
Conch function generators¶
-
polyglotdb.acoustics.formants.helper.
generate_base_formants_function
(corpus_context, gender=None, source='praat')[source]¶ Parameters: - corpus_context :
polyglot.corpus.context.CorpusContext
The CorpusContext object of the corpus.
- gender : str
The gender to use for the function, if “M”(male) then the max frequency is 5000 Hz, otherwise 5500
- source : str
The source of the function, if it is “praat” then the formants will be calculated with Praat over each segment otherwise it will simply be tracks
- Returns
- ——-
- formant_function : Partial function object
The function used to call Praat.
- corpus_context :
-
polyglotdb.acoustics.formants.helper.
generate_formants_point_function
(corpus_context, gender=None)[source]¶ Generates a function used to call Praat to measure formants and bandwidths with variable num_formants.
Parameters: - corpus_context :
CorpusContext
The CorpusContext object of the corpus.
- min_formants : int
The minimum number of formants to measure with on subsequent passes (default is 4).
- max_formants : int
The maximum number of formants to measure with on subsequent passes (default is 7).
Returns: - formant_function : Partial function object
The function used to call Praat.
- corpus_context :
-
polyglotdb.acoustics.formants.helper.
generate_variable_formants_point_function
(corpus_context, min_formants, max_formants)[source]¶ Generates a function used to call Praat to measure formants and bandwidths with variable num_formants. This specific function returns a single point per formant at a third of the way through the segment
Parameters: - corpus_context :
CorpusContext
The CorpusContext object of the corpus.
- min_formants : int
The minimum number of formants to measure with on subsequent passes (default is 4).
- max_formants : int
The maximum number of formants to measure with on subsequent passes (default is 7).
Returns: - formant_function : Partial function object
The function used to call Praat.
- corpus_context :
Intensity¶
-
polyglotdb.acoustics.intensity.
analyze_intensity
(corpus_context, source='praat', call_back=None, stop_check=None, multiprocessing=True)[source]¶ Analyze intensity of an entire utterance, and save the resulting intensity tracks into the database.
Parameters: - corpus_context :
CorpusContext
corpus context to use
- source : str
Source program to use (only praat available)
- call_back : callable
call back function, optional
- stop_check : function
stop check function, optional
- multiprocessing : bool
Flag to use multiprocessing rather than threading
- corpus_context :
Conch function generators¶
-
polyglotdb.acoustics.intensity.
generate_base_intensity_function
(corpus_context)[source]¶ Generate an Intensity function from Conch
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for getting path to Praat (if not on the system path)
Returns: PraatSegmentIntensityTrackFunction
Intensity analysis function
- corpus_context :
Pitch¶
-
polyglotdb.acoustics.pitch.base.
analyze_pitch
(corpus_context, source='praat', algorithm='base', call_back=None, absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, multiprocessing=True)[source]¶ Parameters: - corpus_context :
AudioContext
- source : str
Program to use for analyzing pitch, either
praat
orreaper
- algorithm : str
Algorithm to use,
base
,gendered
, orspeaker_adjusted
- absolute_min_pitch : int
Absolute pitch floor
- absolute_max_pitch : int
Absolute pitch ceiling
- adjusted_octaves : int
How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling
- stop_check : callable
Function to check whether processing should stop early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag whether to use multiprocessing or threading
- corpus_context :
VOT¶
-
polyglotdb.acoustics.vot.base.
analyze_vot
(corpus_context, classifier, stop_label='stops', vot_min=5, vot_max=100, window_min=-30, window_max=30, overwrite_edited=False, call_back=None, stop_check=None, multiprocessing=False)[source]¶ Analyze VOT for stops using a pretrained AutoVOT classifier.
Parameters: - corpus_context :
AudioContext
- classifier : str
Path to an AutoVOT classifier model
- stop_label : str
Label of subset to analyze
- vot_min : int
Minimum VOT in ms
- vot_max : int
Maximum VOT in ms
- window_min : int
Window minimum in ms
- window_max : int
Window maximum in Ms
- overwrite_edited:
Whether to updated VOTs which have the property, edited set to True
- call_back : callable
call back function, optional
- stop_check : callable
stop check function, optional
- multiprocessing : bool
Flag to use multiprocessing, otherwise will use threading
- corpus_context :
Other¶
-
polyglotdb.acoustics.other.
analyze_track_script
(corpus_context, acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, call_back=None, file_type='consonant', stop_check=None, multiprocessing=True)[source]¶
-
polyglotdb.acoustics.other.
analyze_script
(corpus_context, phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, call_back=None, file_type='consonant', stop_check=None, multiprocessing=True)[source]¶ Perform acoustic analysis of phones using an input praat script.
Saves the measurement results from the praat script into the database under the same names as the Praat output columns Praat script requirements:
- the only input is the full path to the sound file containing (only) the phone
- the script prints the output to the Praat Info window in two rows (i.e. two lines).
- the first row is a space-separated list of measurement names: these are the names that will be saved into the database
- the second row is a space-separated list of the value for each measurement
Parameters: - corpus_context :
CorpusContext
corpus context to use
- phone_class : str
DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run
- subset : str, optional
the name of an already encoded subset of an annotation type, on which the analysis will be run
- annotation_type : str
the type of annotation that the analysis will go over
- script_path : str
full path to the praat script
- duration_threshold : float
Minimum duration of segments to be analyzed
- file_type : str
File type to use for the script (consonant = 16kHz sample rate, vowel = 11kHz, low_freq = 1200 Hz)
- arguments : list
a list containing any arguments to the praat script (currently not working)
- call_back : callable
call back function, optional
- stop_check : callable
stop check function, optional
- multiprocessing : bool
Flag to use multiprocessing, otherwise will use threading
Conch function generators¶
-
polyglotdb.acoustics.other.
generate_praat_script_function
(praat_path, script_path, arguments=None)[source]¶ Generate a partial function that calls the praat script specified. (used as input to analyze_file_segments)
Parameters: - praat_path : string
full path to praat/praatcon
- script_path: string
full path to the script
- arguments : list
a list containing any arguments to the praat script, optional (currently not implemented)
Returns: - function
the partial function which applies the Praat script to a phone and returns the script output
Changelog¶
Version 1.2¶
- Upgraded Neo4j compatibility to 4.3.3
- Upgraded InfluxDB compatibility to 1.8.9
- Changed Praat TextGrid handling to use praatio 4.1
- Phone parsing no longer includes blank intervals (i.e. silences), so preceding and following phone calculations have changed
- Update speaker adjusted pitch algorithm to use octave based min and max pitch rather than the more permissive standard deviation approach
Version 1.0¶
- Added functionality to analyze voice-onset-time through AutoVOT
- Added functionality to analyze formant points and tracks using a refinement process based on vowel formant prototypes
- Added ability to enrich tokens from CSV
- Added parser for TextGrids generated from the Web-MAUS aligner
- Optimized loading of corpora for lower-memory computers
- Optimized queries involving acoustic tracks
See the PolyglotDB wiki for the changelog.