In addition to documenting the IO module of PolyglotDB, this document should serve as a guide for implementing future importers for additional formats.
Importing a corpus consists of several steps. First, a file must be
inspected with the relevant inspect function (i.e.,
inspect_buckeye). These functions generate Parsers for a given format
that allow annotations across many tiers to be coalesced into linguistic
types (word, segments, etc).
As an example, suppose a TextGrid has an interval tier for word labels, an interval tier for phone labels, tiers for annotating stop information (closure duration, bursts, VOT, etc). In this case, our parser would want to associate the stop information annotations with the phones (or rather a subset of the phones), and not have them as a separate linguistic type.
Following inspection, the file can be imported easily using a CorpusContext’s
load function. Under the hood, what happens is the Parser object creates
standardized linguistic annotations from the annotations in the text file,
which are then imported into the database.
Currently the following formats are supported:
- Praat TextGrids (Inspect TextGrids)
- TextGrid output from forced aligners (Montreal Forced Aligner, FAVE-align, and Web-MAUS)
- Output from other corpus management software (LaBB-CAT)
- BAS Partitur format
- Corpus-specific formats
Inspect functions (i.e.,
inspect_textgrid) return a guess for
how to parse the annotations present in a given file (or files in a given
directory). They return a parser of the respective type (i.e.,
with an attribute for the
annotation_tiers detected. For instance, the inspect function for TextGrids
will return a parser with annotation types for each interval and point tier in the TextGrid.
See TextGrid parser for full API of the TextGrid Parser
Consider the following TextGrid with interval tiers for words and phones:
inspect_textgrid function for this file will return two annotation types. From bottom to top, it will
phone annotation type and a
word annotation type. Words and phones are two special linguistic
types in PolyglotDB. Other linguistic types can be defined in a TextGrid (i.e., grouping words into utterances or phones into syllables,
though functionality exists for computing both of those automatically), but word and phone tiers must be defined.
Which tier corresponds to which special
phone type is done via heuristics. The first and most
reliable is whether the tier name contains “word” or “phone” in their tier name. The second is done by using cutoffs
for mean and SD of word and phone durations in the Buckeye corpus to determine if the intervals are more likely to be
word or phones. For reference, the mean and SD of words used is 0.2465409 and 0.03175723, and those used for phones
is 0.08327773 and 0.03175723.
From the above TextGrid, phones will have a
label property (i.e., “Y”), a
begin property (i.e., 0.15),
end property (i.e., 0.25).
Words will have a
label property (i.e., “you”), a
begin property (i.e., 0.15),
end property (i.e., 0.25), as well as a computed
made of up of all of the included phones based on timings (i.e., “Y.UW1”). Any empty intervals will result in “words”
that have the
label of “<SIL>”, which can then be marked as pause later in corpus processing
(see Encoding non-speech elements for more details).
The computation of transcription uses the midpoints of phones and whether they are between the begin and end time points of words.
Inspect forced aligned TextGrids¶
Both the Montreal Forced Aligner and FAVE-aligner generate TextGrids for files in two formats that PolyglotDB can parse. The first format
is for files with a single speaker. These files will have two tiers, one for words (named
and one for phones (named
The second format is for files with multiple speakers, where each speaker will have a pair of tiers for words (formatted as
Speaker name - words)
and phones (formatted as
Speaker name - phones).
TextGrids generated from Web-MAUS have a single format with a tier for words (named
ORT), a tier for the canonical
KAN) and a tier for phones (named
MAU). In parsing, just the tiers for words and
phones are used, as the transcription will be generated automatically.
Inspect LaBB-CAT formatted TextGrids¶
The LaBB-CAT system generates force-aligned TextGrids for files in a format that PolyglotDB can parse (though some editing may be
required due to issues in exporting single speakers in LaBB-CAT). As with the other supported aligner output formats,
PolyglotDB looks for word and phone tiers per speaker (or for just a single speaker depending on export options). The
parser will use
transcript to find the word tiers (i.e.
Speaker name - transcript) and
segment to find
the phone tiers (i.e.,
Speaker name - phones).
See LaBB-CAT parser for full API of the LaBB-CAT Parser
Inspect Buckeye Corpus¶
The Buckeye Corpus is stored in an idiosyncratic format that has two text files per sound file (i.e.,
s0101a.wav), one detailing information
about words (i.e.,
s0101a.words) and one detailing information about surface phones (i.e.
s0101a.phones). The PolyglotDB
parser extracts label, begin and end for each phone. Words have type properties for their underlying transcription and
token properties for their part of speech and begin/end.
See Buckeye parser for full API of the Buckeye Parser
Inspect TIMIT Corpus¶
The TIMIT corpus is stored in an idiosyncratic format that has two text files per sound file (i.e.,
sa1.wav), one detailing information
about words (i.e.,
sa1.WRD) and one detailing information about surface phones (i.e.
sa1.PHN). The PolyglotDB
parser extracts label, begin and end for each phone and each word. Time stamps are converted from samples in the original text files
to seconds for use in PolyglotDB.
See TIMIT parser for full API of the Buckeye Parser
Modifying aspects of parsing¶
Additional properties for linguistic units can be imported as well through the use of extra interval tiers when using a TextGrid parser (see Inspect TextGrids), as in the following TextGrid:
Here we have properties for each word’s part of speech (POS tier) and transcription. The transcription tier will overwrite the automatic calculation of transcription based on contained segments. Each of these will be properties will be type properties by default (see Neo4j implementation for more details). If these properties are meant to be token level properties (i.e., the part of speech of a word varies depending on the token produced), it can changed as follows:
from polyglotdb import CorpusContext import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory') parser.annotation_tiers.type_property = False # The index of the TextGrid tier for POS is 2 # ... code that uses the parser to import data
If the content of a tier should be ignored (i.e., if it contains information not related to any annotations in particular), then it can be manually marked to be ignored as follows:
from polyglotdb import CorpusContext import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory') parser.annotation_tiers.ignored = True # Index of 0 if the first tier should be ignored # ... code that uses the parser to import data
Parsers created through other inspect functions (i.e. Buckeye) can be modified in similar ways, though the TextGrid parser is necessarily the most flexible.
There are two currently implemented schemes for parsing speaker names from a file path. The first is the Filename Speaker Parser,
which takes a number of characters in the base file name (without the extension) starting either from the left or right. For
instance, the path
/path/to/buckeye/s0101a.words for a Buckeye file would return the speaker
s01 using 3 characters from the left.
The other speaker parser is the Directory Speaker Parser, which parses speakers from the directory that contains
the specified path. For instance, given the path
/path/to/buckeye/s01/s0101a.words would return
s01 because the containing
folder of the file is named
Loading of discourses is done via a CorpusContext’s
import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid') with CorpusContext(config) as c: c.load(parser, '/path/to/textgrid.TextGrid')
load_discourse can be used with the same arguments.
load function automatically determines whether the input path to
be loaded is a single file or a folder, and proceeds accordingly.
As stated above, a CorpusContext’s
load function will import a directory of
files as well as a single file, but the
load_directory can be explicitly
called as well:
import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrids') with CorpusContext(config) as c: c.load_directory(parser, '/path/to/textgrids')
Writing new parsers¶
New parsers can be created through extending either the Base parser class or one of the more specialized
parser classes. There are in general three aspects that need to be implemented. First, the
_extensions property should
be updated to reflect the file extensions that the parser will find and attempt to parse. This property should be an iterable,
even if only one extension is to be used.
__init__ function should be implemented if anything above and beyond the based class init function is required
(i.e., special speaker parsing).
parse_discourse function should be overwritten to implement some way of populating data on the annotation tiers
from the source data files and ultimately create a
DiscourseData object (intermediate data representation for straight-forward importing
into the Polyglot databases).
Creating new parsers for forced aligned TextGrids requires simply extending the
and overwriting the
phone_label class properties. The
name property should also be
set to something descriptive, and the
speaker_first should be set to False if speakers follow word/phone labels in
the TextGrid tiers (i.e.,
words -Speaker name rather than
Speaker name - words). See
polyglotdb.io.parsers.labbcat.LabbcatParser for examples.