Introduction

Background

PolyglotDB is a Python package for storage, phonetic analysis, and querying of speech corpora. It represents linguistic data in scalable, high-performance databases (called “Polyglot” databases here) to apply acoustic analysis and other algorithms to speech corpora. While PolyglotDB can be used with corpora of any size, it is built to scale to very large corpora. It is most often used on corpora aligned with the Montreal Forced Aligner, but accepts corpora in other formats as well.

Users interact with PolyglotDB primarily through its Python API: writing Python scripts that import functions and classes from PolyglotDB. See Getting started for setting up PolyglotDB, followed by Tutorials for walk-through examples. Case studies show concrete cases of PolyglotDB’s use to address different kinds of phonetic research questions.

The general workflow for working with PolyglotDB is:

More detailed information on specific implementation details is available in the Developer documentation, as well as in the PolyglotDB conference paper and the ISCAN conference paper.

Applications

In addition to tutorials, there are worked examples of PolyglotDB’s use to answer real-world research questions, in Case studies. These are Python scripts along with explanations of the entire workflow.

Further examples of PolyglotDB scripts used in the SPADE project are available in the SPADE GitHub repo (but without accompanying explanations). Both contain scripts which can be used as examples to work from for your own studies.

Some studies which have used PolyglotDB are:

  • Sibilant moments: Stuart-Smith et al. [2019], Gunter et al. [2021], Sonderegger et al. [2023]

  • Segment durations: Tanner et al. [2020], Lo and Sóskuthy [2023]

  • Vowel formants: Mielke et al. [2019], Tanner et al. [2022], Lipari and Sonderegger [2025]

  • f0: Ting et al. [2025]

  • Finding tokens: Johnson and Babel [2024]

Note

For those interested in a web-based interface, ISCAN (Integrated Speech Corpus ANalysis) is a separate project built on top of PolyglotDB. ISCAN is not actively maintained as of 2025. See Developer documentation for more information.

Contributors

Citation

If you use PolyglotDB in your research, please cite the following paper:

McAuliffe, Michael, Elias Stengel-Eskin, Michaela Socolof, and Morgan Sonderegger (2017). Polyglot and Speech Corpus Tools: a system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017, pp. 3887–3891. https://doi.org/10.21437/Interspeech.2017-1390.