Introduction
Background
PolyglotDB is a Python package for storage, phonetic analysis, and querying of speech corpora. It represents linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to speech corpora. While PolyglotDB can be used with corpora of any size, it is built to scale to very large corpora. It is most often used on corpora aligned with the Montreal Forced Aligner, but accepts corpora in other formats as well.
Users interact with PolyglotDB primarily through its Python API: writing Python scripts that import functions and classes from PolyglotDB. See Getting started for setting up PolyglotDB, followed by Tutorials for walk-through examples. Case studies show concrete cases of PolyglotDB’s use to address different kinds of phonetic research questions.
The general workflow for working with PolyglotDB is:
Import
Parse and load initial data from corpus files into a Polyglot database.
See Tutorial 1: First steps for an example, using a tutorial corpus.
See Importing corpora for more details on the import process.
Enrichment
Add further information through analysis algorithms or from CSV files.
See Tutorial 2: Adding extra information for an example.
See Enrichment for more details on the enrichment process.
Query
Find specific linguistic units.
See Tutorial 3: Getting information out for an example.
See Querying corpora for more details on the query process.
Export
Generate a CSV file for data analysis with specific information extracted from the previous query.
See Exporting a CSV file for an example.
See Exporting query results for more details on the export process.
More detailed information on specific implementation details is available in the Developer documentation, as well as in the PolyglotDB conference paper and the ISCAN conference paper.
Applications
In addition to tutorials, there are worked examples of PolyglotDB’s use to answer real-world research questions, in Case studies. These are Python scripts along with explanations of the entire workflow.
Further examples of PolyglotDB scripts used in the SPADE project are available in the SPADE GitHub repo (but without accompanying explanations). Both contain scripts which can be used as examples to work from for your own studies.
Some studies which have used PolyglotDB are:
Sibilant moments: Stuart-Smith et al. [2019], Gunter et al. [2021], Sonderegger et al. [2023]
Segment durations: Tanner et al. [2020], Lo and Sóskuthy [2023]
Vowel formants: Mielke et al. [2019], Tanner et al. [2022], Lipari and Sonderegger [2025]
f0: Ting et al. [2025]
Finding tokens: Johnson and Babel [2024]
Note
For those interested in a web-based interface, ISCAN (Integrated Speech Corpus ANalysis) is a separate project built on top of PolyglotDB. ISCAN is not actively maintained as of 2025. See Developer documentation for more information.
Contributors
Michael McAuliffe (@mmcauliffe)
Xiaoyi Li (@lxy2304)
Michael Haaf (@michaelhaaf)
Elias Stengel-Eskin (@esteng)
Arlie Coles (@a-coles)
Sarah Mihuc (@samihuc)
Michael Goodale (@MichaelGoodale)
Massimo Lipari (@massimolipari)
Jeff Mielke (@jeffmielke)
James Tanner (@james-tanner)
Morgan Sonderegger (@msonderegger)
Citation
If you use PolyglotDB in your research, please cite the following paper:
McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017, pp. 3887–3891. https://doi.org/10.21437/Interspeech.2017-1390.