PolyglotDB is a Python package that focuses on representing linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to large speech corpora.
In general there are two ways to leverage PolyglotDB for analyzing a dataset:
The first way, more appropriate for technically skilled users, is through a Python API: writing Python scripts that import functions and classes from PolyglotDB. (For this route, see Getting started for setting up PolyglotDB, followed by Tutorial for walk-through examples.) This way also makes more sense for users in an individual lab, where it can be assumed that all users have the same level of access to datasets (without any ethical issues).
The second way, more appropriate for a user group dispersed across multiple sites and where some users are less comfortable with Python scripting, is by setting up an ISCAN (Integrated Speech Corpus ANalysis) server—see the ISCAN documentation for more details. ISCAN servers allow users to view information and perform most functions of PolyglotDB through a web browser. In addition, ISCAN servers include features for the use case of multiple datasets with differential access: by user/corpus permissions level, and functionality for managing multiple Polyglot databases.
This documentation site is relevant for ways PolyglotDB canbeused, but is geared towards a technically-skilled user and thus focuses more on the use case of using PolyglotDB “by script” (#1).
The general workflow for working with PolyglotDB is:
Parse and load initial data from corpus files into a Polyglot database
This step can take a while, from a couple of minutes to hours depending on corpus size.
Intended to be done once per corpus
See Importing the tutorial corpus for an example
See Importing corpora for more details on the import process
Add further information through analysis algorithms or from CSV files
Can take a while, from a couple of minutes to hours depending on enrichment and corpus size
Intended to be done once per corpus
See Tutorial 2: Adding extra information for an example
See Enrichment for more details on the enrichment process
The thinking behind this workflow is explained in more detail in the ISCAN conference paper.
High level overview¶
PolyglotDB represents language (speech and text corpora) using the annotation graph formalism put forth in Bird and Liberman (2001). Annotations are represented in a directed acyclic graph, where nodes are points in time in an audio file or points in a text file. Directed edges are labelled with annotations, and multiple levels of annotations can be included or excluded as desired. They put forth a relational formalism for annotation graphs, and later work implements the formalism in SQL. Similarly, the LaBB-CAT and EMU-SDMS use the annotation graph formalism.
Recently, NoSQL databases have been rising in popularity, and one type of these is the graph database. In this type of database, nodes and relationships are primitives rather than relational tables. Graph databases map on annotation graphs in a much cleaner fashion than relational databases. The graph database used in PolyglotDB is Neo4j.
PolyglotDB also uses a NoSQL time-series database called InfluxDB. Acoustic measurements like F0 and formants are stored here as every time step (10 ms) has a value associated with it. Each measurement is also associated with a speaker and a phone from the graph database.
Multiple versions of imported sound files are generated at various sampling rates (1200 Hz, 11000 Hz, and 22050 Hz) to help speed up relevant algorithms. For example, pitch algorithms don’t need a highly sampled signal and higher sample rates will slow down the processing of files.
The idea of using multiple languages or technologies that suit individual problems has been known, particularly in the realm of merging SQL and NoSQL databases, as “polyglot persistence.”
PolyglotDB was originally conceptualized for use in Phonological CorpusTools, developed at the University of British Columbia. However, primary development shifted to the umbrella of Montreal Corpus Tools, developed by members of the Montreal Language Modelling Lab at McGill University (now part of MCQLL Lab).
A graphical program named Speech Corpus Tools was originally developed to allow users to interact with Polyglot without writing scripts. However, in the context of the the Speech Across Dialects of English (SPADE) project, a more flexible solution was needed to accommodate use cases involving multiple users, with physical separation between users and data, and differing levels of permission across datasets. ISCAN has been developed within the SPADE project with these requirements in mind.
A citeable paper for PolyglotDB is:
McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017. [PDF]
Or you can cite it via:
McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, Arlie Coles, Sarah Mihuc, Michael Goodale, and Morgan Sonderegger (2019). PolyglotDB [Computer program]. Version 0.1.0, retrieved 26 March 2019 from https://github.com/MontrealCorpusTools/PolyglotDB.