Parser Classes

Base parser

class polyglotdb.io.parsers.base.BaseParser(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]

Base parser, extend this class for new parsers.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

make_transcription : bool, defaults to True

If true, create a word attribute for transcription based on segments that are contained by the word

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

match_extension(filename)[source]

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(name, types_only=False)[source]

Parse annotations for later importing.

Parameters:
name : str

Name of the discourse

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data

parse_information(path, corpus_name)[source]

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

TextGrid parser

class polyglotdb.io.parsers.textgrid.TextgridParser(annotation_tiers, hierarchy, make_transcription=True, make_label=False, stop_check=None, call_back=None)[source]

Parser for Praat TextGrid files.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

make_transcription : bool, defaults to True

If true, create a word attribute for transcription based on segments that are contained by the word

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

load_textgrid(path)[source]

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

parse_discourse(path, types_only=False)[source]

Parse a TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

Forced alignment output parser

class polyglotdb.io.parsers.aligner.AlignerParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Base class for parsing TextGrid output from forced aligners.

Parameters:
annotation_tiers : list

List of the annotation tiers to store data from the TextGrid

hierarchy : Hierarchy

Basic hierarchy of the TextGrid

make_transcription : bool

Flag for whether to add a transcription property to words based on phones they contain

stop_check : callable

Function to check for whether parsing should stop

call_back : callable

Function to report progress in parsing

Attributes:
word_label : str

Label identifying word tiers

phone_label : str

Label identifying phone tiers

name : str

Name of the aligner the TextGrids are from

speaker_first : bool

Whether speaker names precede tier types in the TextGrid when multiple speakers are present

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)[source]

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

MFA

class polyglotdb.io.parsers.mfa.MfaParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the Montreal Forced Aligner.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

FAVE

class polyglotdb.io.parsers.fave.FaveParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the FAVE-align.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

MAUS

class polyglotdb.io.parsers.maus.MausParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids generated by the Web-MAUS aligner.

load_textgrid(path)

Load a TextGrid file

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

match_extension(filename)

Ensures that filename ends with acceptable extension

Parameters:
filename : str

the filename of the file being checked

Returns:
boolean

True if filename is acceptable extension, false otherwise

parse_discourse(path, types_only=False)

Parse a forced aligned TextGrid file for later importing.

Parameters:
path : str

Path to TextGrid file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

parse_information(path, corpus_name)

Parses types out of a corpus

Parameters:
path : str

a path to the corpus

corpus_name : str

name of the corpus

Returns:
data.types : list

a list of data types

TIMIT parser

class polyglotdb.io.parsers.timit.TimitParser(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]

Parser for the TIMIT corpus.

Has annotation types for word labels and surface transcription labels.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

parse_discourse(word_path, types_only=False)[source]

Parse a TIMIT file for later importing.

Parameters:
word_path : str

Path to TIMIT .wrd file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data from the file

Buckeye parser

class polyglotdb.io.parsers.buckeye.BuckeyeParser(annotation_tiers, hierarchy, stop_check=None, call_back=None)[source]

Parser for the Buckeye corpus.

Has annotation types for word labels, word transcription, word part of speech, and surface transcription labels.

Parameters:
annotation_tiers: list

Annotation types of the files to parse

hierarchy : Hierarchy

Details of how linguistic types relate to one another

stop_check : callable, optional

Function to check whether to halt parsing

call_back : callable, optional

Function to output progress messages

parse_discourse(word_path, types_only=False)[source]

Parse a Buckeye file for later importing.

Parameters:
word_path : str

Path to Buckeye .words file

types_only : bool

Flag for whether to only save type information, ignoring the token information

Returns:
DiscourseData

Parsed data

LaBB-CAT parser

class polyglotdb.io.parsers.labbcat.LabbCatParser(annotation_tiers, hierarchy, make_transcription=True, stop_check=None, call_back=None)[source]

Parser for TextGrids exported from LaBB-CAT

Parameters:
annotation_tiers : list

List of the annotation tiers to store data from the TextGrid

hierarchy : Hierarchy

Basic hierarchy of the TextGrid

make_transcription : bool

Flag for whether to add a transcription property to words based on phones they contain

stop_check : callable

Function to check for whether parsing should stop

call_back : callable

Function to report progress in parsing

load_textgrid(path)[source]

Load a TextGrid file. Additionally ignore duplicated tier names as they can sometimes be exported erroneously from LaBB-CAT.

Parameters:
path : str

Path to the TextGrid file

Returns:
TextGrid

TextGrid object

Speaker parsers

Filename Speaker Parser

class polyglotdb.io.parsers.speaker.FilenameSpeakerParser(number_of_characters, left_orientation=True)[source]

Class for parsing a speaker name from a path that gets a specified number of characters from either the left or the right of the base file name.

Parameters:
number_of_characters : int

Number of characters to include in the speaker designation, set to 0 to get the full file name

left_orientation : bool

Whether to pull characters from the left or right of the base file name, defaults to True

parse_path(path)[source]

Parses a file path and returns a speaker name

Parameters:
path : str

File path

Returns:
str

Substring of path that is the speaker name

Directory Speaker Parser

class polyglotdb.io.parsers.speaker.DirectorySpeakerParser[source]

Class for parsing a speaker name from a path that gets the directory immediately containing the file and uses its name as the speaker name

parse_path(path)[source]

Parses a file path and returns a speaker name

Parameters:
path : str

File path

Returns:
str

Directory that is the name of the speaker