Introduction

General Background

PolyglotDB is a Python package that focuses on representing linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to large speech corpora.

In general there are two ways to leverage PolyglotDB for analyzing a dataset:

  1. The first way, more appropriate for technically skilled users, is through a Python API: writing Python scripts that import functions and classes from PolyglotDB. (For this route, see Getting started for setting up PolyglotDB, followed by Tutorial for walk-through examples.) This way also makes more sense for users in an individual lab, where it can be assumed that all users have the same level of access to datasets (without any ethical issues).
  2. The second way, more appropriate for a user group dispersed across multiple sites and where some users are less comfortable with Python scripting, is by setting up an ISCAN (Integrated Speech Corpus ANalysis) server—see the ISCAN documentation for more details. ISCAN servers allow users to view information and perform most functions of PolyglotDB through a web browser. In addition, ISCAN servers include features for the use case of multiple datasets with differential access: by user/corpus permissions level, and functionality for managing multiple Polyglot databases.

This documentation site is relevant for ways PolyglotDB canbeused, but is geared towards a technically-skilled user and thus focuses more on the use case of using PolyglotDB “by script” (#1).

The general workflow for working with PolyglotDB is:

  • Import
    • Parse and load initial data from corpus files into a Polyglot database
      • This step can take a while, from a couple of minutes to hours depending on corpus size.
      • Intended to be done once per corpus
    • See Importing the tutorial corpus for an example
    • See Importing corpora for more details on the import process
  • Enrichment
    • Add further information through analysis algorithms or from CSV files
      • Can take a while, from a couple of minutes to hours depending on enrichment and corpus size
      • Intended to be done once per corpus
    • See Tutorial 2: Adding extra information for an example
    • See Enrichment for more details on the enrichment process
  • Query
    • Find specific linguistic units
      • Should be quick, from a couple of minutes to ~10 minutes depending on corpus size
      • Intended to be done many times per corpus, for different queries
    • See Tutorial 3: Getting information out for an example
    • See Querying corpora for more details on the query process
  • Export
    • Generate a CSV file for data analysis with specific information extracted from the previous query
      • Should be quick, and intended to be done many times per corpus (like Query)
    • See Exporting a CSV file for an example
    • See Exporting query results for more details on the export process

The thinking behind this workflow is explained in more detail in the ISCAN conference paper.

Note

There are also many PolyglotDB scripts written for the SPADE project that can be used as examples. These scripts are available in the SPADE GitHub repo.

High level overview

PolyglotDB represents language (speech and text corpora) using the annotation graph formalism put forth in Bird and Liberman (2001). Annotations are represented in a directed acyclic graph, where nodes are points in time in an audio file or points in a text file. Directed edges are labelled with annotations, and multiple levels of annotations can be included or excluded as desired. They put forth a relational formalism for annotation graphs, and later work implements the formalism in SQL. Similarly, the LaBB-CAT and EMU-SDMS use the annotation graph formalism.

Recently, NoSQL databases have been rising in popularity, and one type of these is the graph database. In this type of database, nodes and relationships are primitives rather than relational tables. Graph databases map on annotation graphs in a much cleaner fashion than relational databases. The graph database used in PolyglotDB is Neo4j.

PolyglotDB also uses a NoSQL time-series database called InfluxDB. Acoustic measurements like F0 and formants are stored here as every time step (10 ms) has a value associated with it. Each measurement is also associated with a speaker and a phone from the graph database.

Multiple versions of imported sound files are generated at various sampling rates (1200 Hz, 11000 Hz, and 22050 Hz) to help speed up relevant algorithms. For example, pitch algorithms don’t need a highly sampled signal and higher sample rates will slow down the processing of files.

The idea of using multiple languages or technologies that suit individual problems has been known, particularly in the realm of merging SQL and NoSQL databases, as “polyglot persistence.”

More detailed information on specific implementation details is available in the Developer documentation, as well as in the InterSpeech proceedings paper.

Development history

PolyglotDB was originally conceptualized for use in Phonological CorpusTools, developed at the University of British Columbia. However, primary development shifted to the umbrella of Montreal Corpus Tools, developed by members of the Montreal Language Modelling Lab at McGill University (now part of MCQLL Lab).

A graphical program named Speech Corpus Tools was originally developed to allow users to interact with Polyglot without writing scripts. However, in the context of the the Speech Across Dialects of English (SPADE) project, a more flexible solution was needed to accommodate use cases involving multiple users, with physical separation between users and data, and differing levels of permission across datasets. ISCAN has been developed within the SPADE project with these requirements in mind.

Contributors

Citation

A citeable paper for PolyglotDB is:

McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017. [PDF]

Or you can cite it via:

McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, Arlie Coles, Sarah Mihuc, Michael Goodale, and Morgan Sonderegger (2019). PolyglotDB [Computer program]. Version 0.1.0, retrieved 26 March 2019 from https://github.com/MontrealCorpusTools/PolyglotDB.