PolyglotDB I/O

In addition to documenting the IO module of PolyglotDB, this document should serve as a guide for implementing future importers for additional formats.

Import pipeline

Importing a corpus consists of several steps. First, a file must be inspected with the relevant inspect function (i.e., inspect_textgrid or inspect_buckeye). These functions generate Parsers for a given format that allow annotations across many tiers to be coalesced into linguistic types (word, segments, etc).

As an example, suppose a TextGrid has an interval tier for word labels, an interval tier for phone labels, tiers for annotating stop information (closure duration, bursts, VOT, etc). In this case, our parser would want to associate the stop information annotations with the phones (or rather a subset of the phones), and not have them as a separate linguistic type.

Following inspection, the file can be imported easily using a CorpusContext’s load function. Under the hood, what happens is the Parser object creates standardized linguistic annotations from the annotations in the text file, which are then imported into the database.

Currently the following formats are supported:

Praat TextGrids (Inspect TextGrids)
TextGrid output from forced aligners (Montreal Forced Aligner, FAVE-align, and Web-MAUS)
Output from other corpus management software (LaBB-CAT)
BAS Partitur format
Corpus-specific formats
- Buckeye
- TIMIT

Inspect

Inspect functions (i.e., inspect_textgrid) return a guess for how to parse the annotations present in a given file (or files in a given directory). They return a parser of the respective type (i.e., TextgridParser) with an attribute for the annotation_tiers detected. For instance, the inspect function for TextGrids will return a parser with annotation types for each interval and point tier in the TextGrid.

Inspect TextGrids

Note

See TextGrid parser for full API of the TextGrid Parser

Consider the following TextGrid with interval tiers for words and phones:

Image cannot be displayed in your browser

Running the inspect_textgrid function for this file will return two annotation types. From bottom to top, it will generate a phone annotation type and a word annotation type. Words and phones are two special linguistic types in PolyglotDB. Other linguistic types can be defined in a TextGrid (i.e., grouping words into utterances or phones into syllables, though functionality exists for computing both of those automatically), but word and phone tiers must be defined.

Note

Which tier corresponds to which special word and phone type is done via heuristics. The first and most reliable is whether the tier name contains “word” or “phone” in their tier name. The second is done by using cutoffs for mean and SD of word and phone durations in the Buckeye corpus to determine if the intervals are more likely to be word or phones. For reference, the mean and SD of words used is 0.2465409 and 0.03175723, and those used for phones is 0.08327773 and 0.03175723.

From the above TextGrid, phones will have a label property (i.e., “Y”), a begin property (i.e., 0.15), and a end property (i.e., 0.25). Words will have a label property (i.e., “you”), a begin property (i.e., 0.15), and a end property (i.e., 0.25), as well as a computed transcription property made of up of all of the included phones based on timings (i.e., “Y.UW1”). Any empty intervals will result in “words” that have the label of “<SIL>”, which can then be marked as pause later in corpus processing (see Encoding non-speech elements for more details).

Note

The computation of transcription uses the midpoints of phones and whether they are between the begin and end time points of words.

Inspect forced aligned TextGrids

Both the Montreal Forced Aligner and FAVE-aligner generate TextGrids for files in two formats that PolyglotDB can parse. The first format is for files with a single speaker. These files will have two tiers, one for words (named words or word) and one for phones (named phones or phone). The second format is for files with multiple speakers, where each speaker will have a pair of tiers for words (formatted as Speaker name - words) and phones (formatted as Speaker name - phones).

TextGrids generated from Web-MAUS have a single format with a tier for words (named ORT), a tier for the canonical transcription (named KAN) and a tier for phones (named MAU). In parsing, just the tiers for words and phones are used, as the transcription will be generated automatically.

Note

See MFA for full API of the MFA Parser, FAVE for full API of the FAVE Parser, and MAUS for the full API of the MAUS Parser.

Inspect LaBB-CAT formatted TextGrids

The LaBB-CAT system generates force-aligned TextGrids for files in a format that PolyglotDB can parse (though some editing may be required due to issues in exporting single speakers in LaBB-CAT). As with the other supported aligner output formats, PolyglotDB looks for word and phone tiers per speaker (or for just a single speaker depending on export options). The parser will use transcript to find the word tiers (i.e. Speaker name - transcript) and segment to find the phone tiers (i.e., Speaker name - phones).

Note

See LaBB-CAT parser for full API of the LaBB-CAT Parser

Inspect Buckeye Corpus

The Buckeye Corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., s0101a.wav), one detailing information about words (i.e., s0101a.words) and one detailing information about surface phones (i.e. s0101a.phones). The PolyglotDB parser extracts label, begin and end for each phone. Words have type properties for their underlying transcription and token properties for their part of speech and begin/end.

Note

See Buckeye parser for full API of the Buckeye Parser

Inspect TIMIT Corpus

The TIMIT corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., sa1.wav), one detailing information about words (i.e., sa1.WRD) and one detailing information about surface phones (i.e. sa1.PHN). The PolyglotDB parser extracts label, begin and end for each phone and each word. Time stamps are converted from samples in the original text files to seconds for use in PolyglotDB.

Note

See TIMIT parser for full API of the Buckeye Parser

Modifying aspects of parsing

Additional properties for linguistic units can be imported as well through the use of extra interval tiers when using a TextGrid parser (see Inspect TextGrids), as in the following TextGrid:

Here we have properties for each word’s part of speech (POS tier) and transcription. The transcription tier will overwrite the automatic calculation of transcription based on contained segments. Each of these will be properties will be type properties by default (see Neo4j implementation for more details). If these properties are meant to be token level properties (i.e., the part of speech of a word varies depending on the token produced), it can changed as follows:

from polyglotdb import CorpusContext
import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[2].type_property = False # The index of the TextGrid tier for POS is 2

# ... code that uses the parser to import data

If the content of a tier should be ignored (i.e., if it contains information not related to any annotations in particular), then it can be manually marked to be ignored as follows:

from polyglotdb import CorpusContext
import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
parser.annotation_tiers[0].ignored = True # Index of 0 if the first tier should be ignored

# ... code that uses the parser to import data

Parsers created through other inspect functions (i.e. Buckeye) can be modified in similar ways, though the TextGrid parser is necessarily the most flexible.

Speaker parsers

There are two currently implemented schemes for parsing speaker names from a file path. The first is the Filename Speaker Parser, which takes a number of characters in the base file name (without the extension) starting either from the left or right. For instance, the path /path/to/buckeye/s0101a.words for a Buckeye file would return the speaker s01 using 3 characters from the left.

The other speaker parser is the Directory Speaker Parser, which parses speakers from the directory that contains the specified path. For instance, given the path /path/to/buckeye/s01/s0101a.words would return s01 because the containing folder of the file is named s01.

Load discourse

Loading of discourses is done via a CorpusContext’s load function:

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')

with CorpusContext(config) as c:
    c.load(parser, '/path/to/textgrid.TextGrid')

Alternatively, load_discourse can be used with the same arguments. The load function automatically determines whether the input path to be loaded is a single file or a folder, and proceeds accordingly.

Load directory

As stated above, a CorpusContext’s load function will import a directory of files as well as a single file, but the load_directory can be explicitly called as well:

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrids')

with CorpusContext(config) as c:
    c.load_directory(parser, '/path/to/textgrids')

Writing new parsers

New parsers can be created through extending either the Base parser class or one of the more specialized parser classes. There are in general three aspects that need to be implemented. First, the _extensions property should be updated to reflect the file extensions that the parser will find and attempt to parse. This property should be an iterable, even if only one extension is to be used.

Second, the __init__ function should be implemented if anything above and beyond the based class init function is required (i.e., special speaker parsing).

Finally, the parse_discourse function should be overwritten to implement some way of populating data on the annotation tiers from the source data files and ultimately create a DiscourseData object (intermediate data representation for straight-forward importing into the Polyglot databases).

Creating new parsers for forced aligned TextGrids requires simply extending the polyglotdb.io.parsers.aligner.AlignerParser and overwriting the word_label and phone_label class properties. The name property should also be set to something descriptive, and the speaker_first should be set to False if speakers follow word/phone labels in the TextGrid tiers (i.e., words -Speaker name rather than Speaker name - words). See polyglotdb.io.parsers.mfa.MfaParser, polyglotdb.io.parsers.fave.FaveParser, polyglotdb.io.parsers.maus.MausParser, and polyglotdb.io.parsers.labbcat.LabbcatParser for examples.

Exporters

Under development.