Developer documentation

This section of the documentation explains implementation details of PolyglotDB.

Overview

PolyglotDB represents language (speech and text corpora) using the annotation graph formalism put forth by Bird and Liberman [2001]. Annotations are represented in a directed acyclic graph, where nodes are points in time in an audio file or points in a text file. Directed edges are labelled with annotations, and multiple levels of annotations can be included or excluded as desired. They put forth a relational formalism for annotation graphs, and later work implements the formalism in SQL. Similarly, the LaBB-CAT and EMU-SDMS speech database management systems use the annotation graph formalism, but implemented in SQL databases.

PolyglotDB uses a different approach to the annotation graph formalism, using NoSQL databases. One type of NoSQL database is the graph database, where nodes and relationships are primitives rather than relational tables. Graph databases map onto annotation graphs in a much cleaner fashion than relational databases, allowing the database to closely match the structure of speech corpora. The graph database used in PolyglotDB is Neo4j.

PolyglotDB also uses a NoSQL time-series database called InfluxDB. Acoustic measurements like F0 and formants are stored here as every time step (10 ms) has a value associated with it. Each measurement is also associated with a speaker and a phone from the graph database.

Multiple versions of imported sound files are generated at various sampling rates (1200 Hz, 11000 Hz, and 22050 Hz) to help speed up relevant algorithms. For example, pitch algorithms don’t need a highly sampled signal and higher sample rates will slow down the processing of files.

The overarching structure of PolyglotDB is based around these two database technologies: Neo4j and InfluxDB. (A SQL database is also used for certain tasks.) Both of these database systems are devoted to modelling, storing, and querying specific types of data, namely, graph and time series data. Because speech data can be modelled in each of these ways (see Annotation Graphs for more details on representing annotations as graphs), using these databases leverages their performance and scalability for increasing PolyglotDB’s ability to deal with large speech corpora.

The idea of using multiple languages or technologies that suit individual problems has been known, particularly in the realm of merging SQL and NoSQL databases, as “polyglot persistence”, hence the name PolyglotDB.

Please see the InterSpeech proceedings paper for more information on the high-level motivations for PolyglotDB.

Note

ISCAN is a separate project built on top of PolyglotDB that provides a web-based interface for corpus management and analysis. An Integrated Speech Corpus Analysis (ISCAN) server can be set up on a lab’s central server, or you can run it on your local computer as well (though many of PolyglotDB’s algorithms benefit from having more processors and memory available). Please see the ISCAN documentation for more information on setting it up and the ISCAN conference paper for details. The main feature benefits of ISCAN are multiple Polyglot databases (separating out different corpora and allowing any of them to be started or shutdown), graphical interfaces for inspecting data, and a user authentication system with different levels of permission for remote access through a web application. ISCAN is not actively maintained as of 2025 and may require additional effort to configure and use. It is not the recommended or default option for most users. The primary and supported way to interact with PolyglotDB remains through its Python API.

Developer documentation

Overview

Contents