Corpus API

Corpus classes

Base corpus

class polyglotdb.corpus.BaseContext(*args, **kwargs)[source]

Base CorpusContext class. Inherit from this and extend to create more functionality.

Parameters:
*args

If the first argument is not a CorpusConfig object, it is the name of the corpus

**kwargs

If a CorpusConfig object is not specified, all arguments and keyword arguments are passed to a CorpusConfig object

annotation_types

Get a list of all the annotation types in the corpus’s Hierarchy

Returns:
list

Annotation types

cache_hierarchy()[source]

Save corpus Hierarchy to the disk

cypher_safe_name

Escape the corpus name for use in Cypher queries

Returns:
str

Corpus name made safe for Cypher

discourses

Gets a list of discourses in the corpus

Returns:
list

Discourse names in the corpus

encode_type_subset(annotation_type, annotation_labels, subset_label)[source]

Encode a type subset from labels of annotations

Parameters:
annotation_type : str

Annotation type of labels

annotation_labels : list

a list of labels of annotations to subset together

subset_label : str

the label for the subset

execute_cypher(statement, **parameters)[source]

Executes a cypher query

Parameters:
statement : str

the cypher statement

parameters : kwargs

keyword arguments to execute a cypher statement

Returns:
BoltStatementResult

Result of Cypher query

exists()[source]

Check whether the corpus has a Hierarchy schema in the Neo4j database

Returns:
bool

True if the corpus Hierarchy has been saved to the database

hierarchy_path

Get the path to cached hierarchy information

Returns:
str

Path to the cached hierarchy data on disk

load_hierarchy()[source]

Load Hierarchy object from the cached version

lowest_annotation

Returns the annotation type that is the lowest in the Hierarchy.

Returns:
str

Lowest annotation type in the Hierarchy

phone_name

Gets the phone label

Returns:
str

phone name

phones

Get a list of all phone labels in the corpus.

Returns:
list

All phone labels in the corpus

query_discourses()[source]

Start a query over discourses in the corpus

Returns:
DiscourseQuery

DiscourseQuery object

query_graph(annotation_node)[source]

Start a query over the tokens of a specified annotation type (i.e. corpus.word)

Parameters:
annotation_node : polyglotdb.query.attributes.AnnotationNode

The type of annotation to look for in the corpus

Returns:
SplitQuery

SplitQuery object

query_lexicon(annotation_node)[source]

Start a query over types of a specified annotation type (i.e. corpus.lexicon_word)

Parameters:
annotation_node : polyglotdb.query.attributes.AnnotationNode

The type of annotation to look for in the corpus’s lexicon

Returns:
LexiconQuery

LexiconQuery object

query_speakers()[source]

Start a query over speakers in the corpus

Returns:
SpeakerQuery

SpeakerQuery object

remove_discourse(name)[source]

Remove the nodes and relationships associated with a single discourse in the corpus.

Parameters:
name : str

Name of the discourse to remove

reset(call_back=None, stop_check=None)[source]

Reset the Neo4j and InfluxDB databases for a corpus

Parameters:
call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

reset_graph(call_back=None, stop_check=None)[source]

Remove all nodes and relationships in the corpus.

reset_type_subset(annotation_type, subset_label)[source]

Reset and remove a type subset

Parameters:
annotation_type : str

Annotation type of the subset

subset_label : str

the label for the subset

speakers

Gets a list of speakers in the corpus

Returns:
list

Speaker names in the corpus

word_name

Gets the word label

Returns:
str

word name

words

Get a list of all word labels in the corpus.

Returns:
list

All word labels in the corpus

Phonological functionality

class polyglotdb.corpus.PhonologicalContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with phones

encode_class(phones, label)[source]

encodes phone classes

Parameters:
phones : list

a list of phones

label : str

the label for the class

encode_features(feature_dict)[source]

gets the phone if it exists, queries for each phone and sets type to kwargs (?)

Parameters:
feature_dict : dict

features to encode

enrich_features(feature_data, type_data=None)[source]

Sets the data type and feature data, initializes importers for feature data, adds features to hierarchy for a phone

Parameters:
feature_data : dict

the enrichment data

type_data : dict

By default None

enrich_inventory_from_csv(path)[source]

Enriches corpus from a csv file

Parameters:
path : str

the path to the csv file

remove_pattern(pattern='[0-2]')[source]

removes a stress or tone pattern from all phones

Parameters:
pattern : str

the regular expression for the pattern to remove Defaults to ‘[0-2]’

reset_class(label)[source]

Reset and remove a subset

Parameters:
label : str

Subset name to remove

reset_features(feature_names)[source]

resets features

Parameters:
feature_names : list

list of names of features to remove

reset_inventory_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

reset_to_old_label()[source]

Reset phones back to their old labels which include stress and tone

Syllabic functionality

class polyglotdb.corpus.SyllabicContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with syllables

encode_stress_from_word_property(word_property_name)[source]

Use a property on words formatted like “0-1-0” to encode stress on syllables.

The number of syllables and the position of syllables within a word will also be encoded as a result of this function.

Parameters:
word_property_name : str

Property name of words that contains the stress pattern

encode_stress_to_syllables(regex=None, clean_phone_label=True)[source]

Use numbers (0-9) in phone labels as stress property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:
regex : str

Regular expression character set for finding stress in the phone label

clean_phone_label : bool

Flag for removing regular expression from the phone labels

encode_syllabic_segments(phones)[source]

Encode a list of phones as ‘syllabic’

Parameters:
phones : list

A list of vowels and syllabic consonants

encode_syllables(algorithm='maxonset', syllabic_label='syllabic', call_back=None, stop_check=None)[source]

Encodes syllables to a corpus

Parameters:
algorithm : str, defaults to ‘maxonset’

determines which algorithm will be used to encode syllables

syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

encode_tone_to_syllables(regex=None, clean_phone_label=True)[source]

Use numbers (0-9) in phone labels as tone property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:
regex : str

Regular expression character set for finding tone in the phone label

clean_phone_label : bool

Flag for removing regular expression from the phone labels

enrich_syllables(syllable_data, type_data=None)[source]

Sets the data type and syllable data, initializes importers for syllable data, adds features to hierarchy for a phone

Parameters:
syllable_data : dict

the enrichment data

type_data : dict

By default None

find_codas(syllabic_label='syllabic')[source]

Gets syllable codas across the corpus

Parameters:
syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

Returns:
data : dict

A dictionary with coda values as keys and frequency values as values

find_onsets(syllabic_label='syllabic')[source]

Gets syllable onsets across the corpus

Parameters:
syllabic_label : str

Subset to use for syllabic segments (i.e., nuclei)

Returns:
data : dict

A dictionary with onset values as keys and frequency values as values

has_syllabics

Check whether there is a phone subset named syllabic

Returns:
bool

True if syllabic is found as a phone subset

has_syllables

Check whether the corpus has syllables encoded

Returns:
bool

True if the syllables are in the Hierarchy

reset_syllables(call_back=None, stop_check=None)[source]

Resets syllables, removes syllable annotation, removes onset, coda, and nucleus labels

Parameters:
call_back : callable

Function to monitor progress

stop_check : callable

Function the check whether the process should terminate early

Lexical functionality

class polyglotdb.corpus.LexicalContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with words

enrich_lexicon(lexicon_data, type_data=None, case_sensitive=False)[source]

adds properties to lexicon, adds properties to hierarchy

Parameters:
lexicon_data : dict

the data in the lexicon

type_data : dict

default to None

case_sensitive : bool

default to False

enrich_lexicon_from_csv(path, case_sensitive=False)[source]

Enriches lexicon from a CSV file

Parameters:
path : str

the path to the csv file

case_sensitive : boolean

Defaults to false

reset_lexicon_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

Pause functionality

class polyglotdb.corpus.PauseContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with non-speech elements

encode_pauses(pause_words, call_back=None, stop_check=None)[source]

Set words to be pauses, as opposed to speech.

Parameters:
pause_words : str, list, tuple, or set

Either a list of words that are pauses or a string containing a regular expression that specifies pause words

call_back : callable

Function to monitor progress

stop_check : callable

Function to check whether process should be terminated early

has_pauses

Check whether corpus has encoded pauses

Returns:
bool

True if pause is in the subsets available for words

reset_pauses()[source]

Revert all words marked as pauses to regular words marked as speech

Utterance functionality

class polyglotdb.corpus.UtteranceContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with utterances

encode_speech_rate(subset_label, call_back=None, stop_check=None)[source]

Encodes speech rate

Parameters:
subset_label : str

the name of the subset to encode

encode_utterance_position(call_back=None, stop_check=None)[source]

Encodes position_in_utterance for a word

encode_utterances(min_pause_length=0.5, min_utterance_length=0, call_back=None, stop_check=None)[source]

Encode utterance annotations based on minimum pause length and minimum utterance length. See get_pauses for more information about the algorithm.

Once this function is run, utterances will be queryable like other annotation types.

Parameters:
min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

enrich_utterances(utterance_data, type_data=None)[source]

adds properties to lexicon, adds properties to hierarchy

Parameters:
utterance_data : dict

the data to enrich with

type_data : dict

default to None

get_utterance_ids(discourse, min_pause_length=0.5, min_utterance_length=0)[source]

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:
discourse : str

String identifier for a discourse

min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

get_utterances(discourse, min_pause_length=0.5, min_utterance_length=0)[source]

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:
discourse : str

String identifier for a discourse

min_pause_length : float, defaults to 0.5

Time in seconds that is the minimum duration of a pause to count as an utterance boundary

min_utterance_length : float, defaults to 0.0

Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

reset_speech_rate()[source]

resets speech_rate

reset_utterance_position()[source]

resets position_in_utterance

reset_utterances()[source]

Remove all utterance annotations.

Audio functionality

class polyglotdb.corpus.AudioContext(*args, **kwargs)[source]

Class that contains methods for dealing with audio files for corpora

acoustic_client()[source]

Generate a client to connect to the InfluxDB for the corpus

Returns:
InfluxDBClient

Client through which to run queries and writes

analyze_formant_tracks(source='praat', stop_check=None, call_back=None, multiprocessing=True, vowel_label=None)[source]

Compute formant tracks and save them to the database

See polyglotdb.acoustics.formants.base.analyze_formant_tracks() for more details.

Parameters:
source : str

Program to compute formants

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

vowel_label : str, optional

Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.

analyze_intensity(source='praat', stop_check=None, call_back=None, multiprocessing=True)[source]

Compute intensity tracks and save them to the database

See polyglotdb.acoustics.intensity..analyze_intensity() for more details.

Parameters:
source : str

Program to compute intensity (only praat is supported)

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

analyze_pitch(source='praat', algorithm='base', absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, call_back=None, multiprocessing=True)[source]

Analyze pitch tracks and save them to the database.

See polyglotdb.acoustics.pitch.base.analyze_pitch() for more details.

Parameters:
source : str

Program to use for analyzing pitch, either praat or reaper

algorithm : str

Algorithm to use, base, gendered, or speaker_adjusted

absolute_min_pitch : int

Absolute pitch floor

absolute_max_pitch : int

Absolute pitch ceiling

adjusted_octaves : int

How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling

stop_check : callable

Function to check whether processing should stop early

call_back : callable

Function to report progress

multiprocessing : bool

Flag whether to use multiprocessing or threading

analyze_script(phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]

Use a Praat script to analyze annotation types in the corpus. The Praat script must return properties per phone (i.e., point measures, not a track), and these properties will be saved to the Neo4j database.

See polyglotdb.acoustics.other..analyze_script() for more details.

Parameters:
phone_class : str

DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run

subset : str, optional

the name of an already encoded subset of an annotation type, on which the analysis will be run

annotation_type : str

the type of annotation that the analysis will go over

script_path : str

Path to the Praat script

duration_threshold : float

Minimum duration that phones should be to be analyzed

arguments : list

Arguments to pass to the Praat script

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

Returns:
list

List of the names of newly added properties to the Neo4j database

analyze_track_script(acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]

Use a Praat script to analyze phones in the corpus. The Praat script must return a track, and these tracks will be saved to the InfluxDB database.

See polyglotdb.acoustics.other..analyze_track_script() for more details.

Parameters:
acoustic_name : str

Name of the acoustic measure

properties : list

List of tuples of the form (property_name, Type)

script_path : str

Path to the Praat script

duration_threshold : float

Minimum duration that phones should be to be analyzed

phone_class : str

Name of the phone subset to analyze

arguments : list

Arguments to pass to the Praat script

stop_check : callable

Function to check whether to terminate early

call_back : callable

Function to report progress

multiprocessing : bool

Flag to use multiprocessing, defaults to True, if False uses threading

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

analyze_utterance_pitch(utterance, source='praat', **kwargs)[source]

Analyze a single utterance’s pitch track.

See polyglotdb.acoustics.pitch.base.analyze_utterance_pitch() for more details.

Parameters:
utterance : str

Utterance ID from Neo4j

source : str

Program to use for analyzing pitch, either praat or reaper

kwargs

Additional settings to use in analyzing pitch

Returns:
Track

Pitch track

analyze_vot(classifier, stop_label='stops', stop_check=None, call_back=None, multiprocessing=False, overwrite_edited=False, vot_min=5, vot_max=100, window_min=-30, window_max=30)[source]

Compute VOTs for stops and save them to the database.

See polyglotdb.acoustics.vot.base.analyze_vot() for more details.

Parameters:
classifier : str

Path to an AutoVOT classifier model

stop_label : str

Label of subset to analyze

vot_min : int

Minimum VOT in ms

vot_max : int

Maximum VOT in ms

window_min : int

Window minimum in ms

window_max : int

Window maximum in Ms

overwrite_edited : bool

Overwrite VOTs with the “edited” property set to true, if this is true

call_back : callable

call back function, optional

stop_check : callable

stop check function, optional

multiprocessing : bool

Flag to use multiprocessing, otherwise will use threading

discourse_audio_directory(discourse)[source]

Return the directory for the stored audio files for a discourse

discourse_has_acoustics(acoustic_name, discourse)[source]

Return whether a discourse has any specific acoustic values associated with it

Parameters:
acoustic_name : str

Name of the acoustic type

discourse : str

Name of the discourse

Returns:
bool
discourse_sound_file(discourse)[source]

Get details for the audio file paths for a specified discourse.

Parameters:
discourse : str

Name of the audio file in the corpus

Returns:
dict

Information for the audio file path

encode_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]

Computes and saves as type properties summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:
acoustic_name : str

Name of the acoustic type

statistic : str

One of mean, median, stddev, sum, mode, count

by_speaker : bool, defaults to True

Flag for calculating summary statistic by speaker

by_phone : bool, defaults to False

Flag for calculating summary statistic by phone

execute_influxdb(query)[source]

Execute an InfluxDB query for the corpus

Parameters:
query : str

Query to run

Returns:
influxdb.resultset.ResultSet

Results of the query

genders()[source]

Gets all values of speaker property named gender in the Neo4j database

Returns:
list

List of gender values

generate_spectrogram(discourse, file_type='consonant', begin=None, end=None)[source]

Generate a spectrogram from an audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

begin : float

Timestamp in seconds

end : float

Timestamp in seconds

Returns:
numpy.array

Spectrogram information

float

Time step between each window

float

Frequency step between each frequency bin

get_acoustic_measure(acoustic_name, discourse, begin, end, channel=0, relative_time=False, **kwargs)[source]

Get acoustic for a given discourse and time range

Parameters:
acoustic_name : str

Name of acoustic track

discourse : str

Name of the discourse

begin : float

Beginning of time range

end : float

End of time range

channel : int, defaults to 0

Channel of the audio file

relative_time : bool, defaults to False

Flag for retrieving relative time instead of absolute time

kwargs : kwargs

Tags to filter on

Returns:
polyglotdb.acoustics.classes.Track

Track object

get_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]

Computes summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:
acoustic_name : str

Name of the acoustic type

statistic : str

One of mean, median, stddev, sum, mode, count

by_speaker : bool, defaults to True

Flag for calculating summary statistic by speaker

by_phone : bool, defaults to False

Flag for calculating summary statistic by phone

Returns:
dict

Dictionary where keys are phone/speaker/phone-speaker pairs and values are the summary statistic of the acoustic measure

get_utterance_acoustics(acoustic_name, utterance_id, discourse, speaker)[source]

Get acoustic for a given utterance and time range

Parameters:
acoustic_name : str

Name of acoustic track

utterance_id : str

ID of the utterance from the Neo4j database

discourse : str

Name of the discourse

speaker : str

Name of the speaker

Returns:
polyglotdb.acoustics.classes.Track

Track object

has_all_sound_files()[source]

Check whether all discourses have a sound file

Returns:
bool

True if a sound file exists for each discourse name in corpus, False otherwise

has_sound_files

Check whether any discourses have a sound file

Returns:
bool

True if there are any sound files at all, false if there aren’t

load_audio(discourse, file_type)[source]

Loads a given audio file at the specified sampling rate type (consonant, vowel or low_freq). Consonant files have a sampling rate of 16 kHz, vowel files a sampling rate of 11 kHz, and low frequency files a sampling rate of 1.2 kHz.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

Returns:
numpy.array

Audio signal

int

Sampling rate of the file

load_waveform(discourse, file_type='consonant', begin=None, end=None)[source]

Loads a segment of a larger audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:
discourse : str

Name of the audio file to load

file_type : str

One of consonant, vowel or low_freq

begin : float, optional

Timestamp in seconds

end : float, optional

Timestamp in seconds

Returns:
numpy.array

Audio signal

int

Sampling rate of the file

reassess_utterances(acoustic_name)[source]

Update utterance IDs in InfluxDB for more efficient querying if utterances have been re-encoded after acoustic measures were encoded

Parameters:
acoustic_name : str

Name of the measure for which to update utterance IDs

relativize_acoustic_measure(acoustic_name, by_speaker=True, by_phone=False)[source]

Relativize acoustic tracks by taking the z-score of the points (using by speaker or by phone means and standard deviations, or both by-speaker, by phone) and save them as separate measures, i.e., F0_relativized from F0.

Parameters:
acoustic_name : str

Name of the acoustic measure

by_speaker : bool, defaults to True

Flag for relativizing by speaker

by_phone : bool, defaults to False

Flag for relativizing by phone

reset_acoustic_measure(acoustic_type)[source]

Reset a given acoustic measure

Parameters:
acoustic_type : str

Name of the acoustic measurement to reset

reset_acoustics()[source]

Reset all acoustic measures currently encoded

reset_formant_points()[source]

Reset formant point measures encoded in the corpus

reset_relativized_acoustic_measure(acoustic_name)[source]

Reset any relativized measures that have been encoded for a specified type of acoustics

Parameters:
acoustic_name : str

Name of the acoustic type

reset_vot()[source]

Reset all VOT measurements in the corpus

save_acoustic_track(acoustic_name, discourse, track, **kwargs)[source]

Save an acoustic track for a sound file

Parameters:
acoustic_name : str

Name of the acoustic type

discourse : str

Name of the discourse

track : Track

Track to save

kwargs: kwargs

Tags to save for acoustic measurements

save_acoustic_tracks(acoustic_name, tracks, speaker)[source]

Save multiple acoustic tracks for a collection of analyzed segments

Parameters:
acoustic_name : str

Name of the acoustic type

tracks : iterable

Iterable of Track objects to save

speaker : str

Name of the speaker of the tracks

update_utterance_pitch_track(utterance, new_track)[source]

Save a pitch track for the specified utterance.

See polyglotdb.acoustics.pitch.base.update_utterance_pitch_track() for more details.

Parameters:
utterance : str

Utterance ID from Neo4j

new_track : list or Track

Pitch track

Returns:
int

Time stamp of update

utterance_sound_file(utterance_id, file_type='consonant')[source]

Generate an audio file just for a single utterance in an audio file.

Parameters:
utterance_id : str

Utterance ID from Neo4j

file_type : str

Sampling rate type to use, one of consonant, vowel, or low_freq

Returns:
str

Path to the generated sound file

Summarization functionality

class polyglotdb.corpus.SummarizedContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with summary measures for linguistic items

average_speech_rate()[source]

Get the average speech rate for each speaker in a corpus

Returns:
result: list

the average speech rate by speaker

baseline_duration(annotation, speaker=None)[source]

Get the baseline duration of each word in corpus. Baseline duration is determined by summing the average durations of constituent phones for a word. If there is no underlying transcription available, the longest duration is considered the baseline.

Parameters:
speaker : str

a speaker name, if desired (defaults to None)

Returns:
word_totals : dict

a dictionary of words and baseline durations

encode_baseline(annotation_type, property_name, by_speaker=False)[source]

Encode a baseline measure of a property, that is, the expected value of a higher annotation given the average property value of the phones that make it up. For instance, the expected duration of a word or syllable given its phonological content.

Parameters:
annotation_type : str

Name of annotation type to compute for

property_name : str

Property of phones to compute based off of (i.e., duration)

by_speaker : bool

Flag for whether to use by-speaker means

encode_measure(property_name, statistic, annotation_type, by_speaker=False)[source]

Compute and save an aggregate measure for annotation types

Available statistic names:

  • mean/average/avg
  • sd/stdev
Parameters:
property_name : str

Name of the property

statistic : str

Name of the statistic to use for aggregation

annotation_type : str

Name of the annotation type

by_speaker : bool

Flag for whether to compute aggregation by speaker

encode_relativized(annotation_type, property_name, by_speaker=False)[source]

Compute and save to the database a relativized measure (i.e., the property value z-scored using a mean and standard deviation computed from the corpus). The computation of means and standard deviations can be by-speaker.

Parameters:
annotation_type : str

Name of the annotation type

property_name : str

Name of the property to relativize

by_speaker : bool

Flag to use by-speaker means and standard deviations

get_measure(data_name, statistic, annotation_type, by_speaker=False, speaker=None)[source]

abstract function to get statistic for the data_name of an annotation_type

Parameters:
data_name : str

the aspect to summarize (duration, pitch, formants, etc)

statistic : str

how to summarize (mean, stdev, median, etc)

annotation_type : str

the annotation to summarize

by_speaker : boolean

whether to summarize by speaker or not

speaker : str

the specific speaker to encode baseline duration for (only for baseline duration)

make_dict(data, speaker=False, label=None)[source]

turn data results into a dictionary for encoding

Parameters:
data : list

a list returned by cypher

Returns:
finaldict : dict

a dictionary in the format for enrichment

Spoken functionality

class polyglotdb.corpus.SpokenContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with speaker and sound file metadata

enrich_discourses(discourse_data, type_data=None)[source]

Add properties about discourses to the corpus, allowing them to be queryable.

Parameters:
discourse_data : dict

the data about the discourse to add

type_data : dict

Specifies the type of the data to be added, defaults to None

enrich_discourses_from_csv(path)[source]

Enriches discourses from a csv file

Parameters:
path : str

the path to the csv file

enrich_speakers(speaker_data, type_data=None)[source]

Add properties about speakers to the corpus, allowing them to be queryable.

Parameters:
speaker_data : dict

the data about the speakers to add

type_data : dict

Specifies the type of the data to be added, defaults to None

enrich_speakers_from_csv(path)[source]

Enriches speakers from a csv file

Parameters:
path : str

the path to the csv file

get_channel_of_speaker(speaker, discourse)[source]

Get the channel that the speaker is in

Parameters:
speaker : str

Speaker to query

discourse : str

Discourse to query

Returns:
int

Channel of audio that speaker is in

get_discourses_of_speaker(speaker)[source]

Get a list of all discourses that a given speaker spoke in

Parameters:
speaker : str

Speaker to query over

Returns:
list

All discourses the speaker spoke in

get_speakers_in_discourse(discourse)[source]

Get a list of all speakers that spoke in a given discourse

Parameters:
discourse : str

Audio file to query over

Returns:
list

All speakers who spoke in the discourse

make_speaker_annotations_dict(data, speaker, property)[source]

helper function to turn dict of {} format to {speaker :{property :{data}}}

Parameters:
data : dict

annotations and values

property : str

the name of the property being encoded

speaker : str

the name of the speaker

reset_discourse_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

reset_speaker_csv(path)[source]

Remove properties that were encoded via a CSV file

Parameters:
path : str

CSV file to get property names from

Structured functionality

class polyglotdb.corpus.StructuredContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with metadata for the corpus

encode_count(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes the rate of the lower type in the higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

encode_hierarchy()[source]

Sync the current Hierarchy to the Neo4j database and to the disk

encode_position(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes position of lower type in higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

encode_rate(higher_annotation_type, lower_annotation_type, name, subset=None)[source]

Encodes the rate of the lower type in the higher type

Parameters:
higher_annotation_type : str

what the higher annotation is (utterance, word)

lower_annotation_type : str

what the lower annotation is (word, phone, syllable)

name : str

the column name

subset : str

the annotation subset

generate_hierarchy()[source]

Get hierarchy schema information from the Neo4j database

Returns:
Hierarchy

the structure of the corpus

query_metadata(annotation)[source]

Start a query over metadata

Parameters:
annotation : Node
Returns:
MetaDataQuery

MetaDataQuery object

refresh_hierarchy()[source]

Save the Neo4j database schema to the disk

reset_hierarchy()[source]

Delete the Hierarchy schema in the Neo4j database

reset_property(annotation_type, name)[source]

Removes property from hierarchy

Parameters:
annotation_type : str

what is being removed

name : str

the column name

Annotation functionality

class polyglotdb.corpus.AnnotatedContext(*args, **kwargs)[source]

Class that contains methods for dealing specifically with annotations on linguistic items (termed “subannotations” in PolyglotDB

Omnibus class

class polyglotdb.corpus.CorpusContext(*args, **kwargs)[source]

Main corpus context, inherits from the more specialized contexts.

Parameters:
args : args

Either a CorpusConfig object or sequence of arguments to be passed to a CorpusConfig object

kwargs : kwargs

sequence of keyword arguments to be passed to a CorpusConfig object

Corpus structure class

class polyglotdb.structure.Hierarchy(data=None, corpus_name=None)[source]

Class containing information about how a corpus is structured.

Hierarchical data is stored in the form of a dictionary with keys for linguistic types, and values for the linguistic type that contains them. If no other type contains a given type, its value is None.

Subannotation data is stored in the form of a dictionary with keys for linguistic types, and values of sets of types of subannotations.

Parameters:
data : dict

Information about the hierarchy of linguistic types

corpus_name : str

Name of the corpus

acoustics

Get all currently encoded acoustic measurements in the corpus

Returns:
list

All encoded acoustic measures

add_acoustic_properties(corpus_context, acoustic_type, properties)[source]

Add acoustic properties to an encoded acoustic measure. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

acoustic_type : str

Acoustic measure to add properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_annotation_type(annotation_type, above=None, below=None)[source]

Adds an annotation type to the Hierarchy object along with default type and token properties for the new annotation type

Parameters:
annotation_type : str

Annotation type to add

above : str

Annotation type that is contained by the new annotation type, leave out if new annotation type is at the bottom of the hierarchy

below : str

Annotation type that contains the new annotation type, leave out if new annotation type is at the top of the hierarchy

add_discourse_properties(corpus_context, properties)[source]

Adds discourse properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_speaker_properties(corpus_context, properties)[source]

Adds speaker properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_subannotation_properties(corpus_context, subannotation_type, properties)[source]

Adds properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Name of the subannotation type

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_subannotation_type(corpus_context, annotation_type, subannotation_type, properties=None)[source]

Adds subannotation type for a given annotation type to the Hierarchy object and syncs it to a Neo4j database. The list of optional properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add a subannotation to

subannotation_type : str

Name of the subannotation type

properties : iterable

Optional iterable of tuples of the form (property_name, Type)

add_token_properties(corpus_context, annotation_type, properties)[source]

Adds token properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add token properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_token_subsets(corpus_context, annotation_type, subsets)[source]

Adds token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to add subsets for

subsets : iterable

List of subsets to add for the annotation tokens

add_type_properties(corpus_context, annotation_type, properties)[source]

Adds type properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to add type properties for

properties : iterable

Iterable of tuples of the form (property_name, Type)

add_type_subsets(corpus_context, annotation_type, subsets)[source]

Adds type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to add subsets for

subsets : iterable

List of subsets to add for the annotation type

annotation_types

Get a list of all the annotation types in the hierarchy

Returns:
list

All annotation types in the hierarchy

from_json(json)[source]

Set all properties from a dictionary deserialized from JSON

Parameters:
json : dict

Object information

get_depth(lower_type, higher_type)[source]

Get the distance between two annotation types in the hierarchy

Parameters:
lower_type : str

Name of the lower type

higher_type : str

Name of the higher type

Returns:
int

Distance between the two types

get_higher_types(annotation_type)[source]

Get all annotation types that are higher than the specified annotation type

Parameters:
annotation_type : str

Annotation type from which to get higher annotation types

Returns:
list

List of all annotation types that are higher the specified annotation type

get_lower_types(annotation_type)[source]

Get all annotation types that are lower than the specified annotation type

Parameters:
annotation_type : str

Annotation type from which to get lower annotation types

Returns:
list

List of all annotation types that are lower the specified annotation type

has_discourse_property(key)[source]

Check for whether discourses have a given property

Parameters:
key : str

Property to check for

Returns:
bool

True if discourses have the given property

has_speaker_property(key)[source]

Check for whether speakers have a given property

Parameters:
key : str

Property to check for

Returns:
bool

True if speakers have the given property

has_subannotation_property(subannotation_type, property_name)[source]

Check whether the Hierarchy has a property associated with a subannotation type

Parameters:
subannotation_type : str

Name of subannotation to check

property_name : str

Name of the property to check for

Returns:
bool

True if subannotation type has the given property name

has_subannotation_type(subannotation_type)[source]

Check whether the Hierarchy has a subannotation type

Parameters:
subannotation_type : str

Name of subannotation to check for

Returns:
bool

True if subannotation type is present

has_token_property(annotation_type, key)[source]

Check whether a given annotation type has a given token property.

Parameters:
annotation_type : str

Annotation type to check for the given token property

key : str

Property to check for

Returns:
bool

True if the annotation type has the given token property

has_token_subset(annotation_type, key)[source]

Check whether a given annotation type has a given token subset.

Parameters:
annotation_type : str

Annotation type to check for the given token subset

key : str

Subset to check for

Returns:
bool

True if the annotation type has the given token subset

has_type_property(annotation_type, key)[source]

Check whether a given annotation type has a given type property.

Parameters:
annotation_type : str

Annotation type to check for the given type property

key : str

Property to check for

Returns:
bool

True if the annotation type has the given type property

has_type_subset(annotation_type, key)[source]

Check whether a given annotation type has a given type subset.

Parameters:
annotation_type : str

Annotation type to check for the given type subset

key : str

Subset to check for

Returns:
bool

True if the annotation type has the given type subset

highest

Get the highest annotation type of the Hierarchy

Returns:
str

Highest annotation type

highest_to_lowest

Get a list of annotation types sorted from highest to lowest

Returns:
list

Annotation types from highest to lowest

items()[source]

Key/value pairs for the hierarchy.

Returns:
generator

Items of the hierarchy

keys()[source]

Keys (linguistic types) of the hierarchy.

Returns:
generator

Keys of the hierarchy

lowest

Get the lowest annotation type of the Hierarchy

Returns:
str

Lowest annotation type

lowest_to_highest

Get a list of annotation types sorted from lowest to highest

Returns:
list

Annotation types from lowest to highest

phone_name

Alias function for getting the lowest annotation type

Returns:
str

Name of the lowest annotation type

remove_acoustic_properties(corpus_context, acoustic_type, properties)[source]

Remove acoustic properties to an encoded acoustic measure.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

acoustic_type : str

Acoustic measure to remove properties for

properties : iterable

List of property names

remove_annotation_type(annotation_type)[source]

Removes an annotation type from the hierarchy

Parameters:
annotation_type : str

Annotation type to remove

remove_discourse_properties(corpus_context, properties)[source]

Removes discourse properties and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

List of property names to remove

remove_speaker_properties(corpus_context, properties)[source]

Removes speaker properties and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

properties : iterable

List of property names to remove

remove_subannotation_properties(corpus_context, subannotation_type, properties)[source]

Removes properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Name of the subannotation type

properties : iterable

List of property names to remove

remove_subannotation_type(corpus_context, subannotation_type)[source]

Remove a subannotation type from the Hierarchy object and sync it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

subannotation_type : str

Subannotation type to remove

remove_token_properties(corpus_context, annotation_type, properties)[source]

Removes token properties for an annotation type and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to remove token properties for

properties : iterable

List of property names to remove

remove_token_subsets(corpus_context, annotation_type, subsets)[source]

Removes token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to remove subsets for

subsets : iterable

List of subsets to remove for the annotation tokens

remove_type_properties(corpus_context, annotation_type, properties)[source]

Removes type properties for an annotation type and syncs it to a Neo4j database.

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type : str

Annotation type to remove type properties for

properties : iterable

List of property names to remove

remove_type_subsets(corpus_context, annotation_type, subsets)[source]

Removes type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:
corpus_context : CorpusContext

CorpusContext to use for updating Neo4j database

annotation_type: str

Annotation type to remove subsets for

subsets : iterable

List of subsets to remove for the annotation type

to_json()[source]

Convert the Hierarchy object to a dictionary for JSON serialization

Returns:
dict

All necessary information for the Hierarchy object

update(other)[source]

Merge Hierarchies together. If other is a dictionary, then only the hierarchical data is updated.

Parameters:
other : Hierarchy or dict

Data to be merged in

values()[source]

Values (containing types) of the hierarchy.

Returns:
generator

Values of the hierarchy

word_name

Shortcut for returning the annotation type matching “word”

Returns:
str or None

Annotation type that begins with “word”

Corpus config class

class polyglotdb.config.CorpusConfig(corpus_name, data_dir=None, **kwargs)[source]

Class for storing configuration information about a corpus.

Parameters:
corpus_name : str

Identifier for the corpus

kwargs : keyword arguments

All keywords will be converted to attributes of the object

Attributes:
corpus_name : str

Identifier of the corpus

graph_user : str

Username for connecting to the graph database

graph_password : str

Password for connecting to the graph database

graph_host : str

Host for the graph database

graph_port : int

Port for connecting to the graph database

engine : str

Type of SQL database

base_dir : str

Base directory to store information and temporary files for the corpus defaults to “.pgdb” under the current user’s home directory

acoustic_connection_kwargs

Return connection parameters to use for connecting to an InfluxDB database

Returns:
dict

Connection parameters

graph_connection_string

Construct a connection string to use for Neo4j

Returns:
str

Connection string

temporary_directory(name)[source]

Create a temporary directory for use in the corpus, and return the path name.

All temporary directories deleted upon successful exit of the context manager.

Returns:
str:

Full path to temporary directory