Corpus API¶

Corpus classes
Corpus structure class
Corpus config class

Corpus classes ¶

Base corpus ¶

class polyglotdb.corpus.BaseContext(*args, **kwargs)[source]¶

Base CorpusContext class. Inherit from this and extend to create more functionality.

Parameters:	args If the first argument is not a `CorpusConfig` object, it is the name of the corpus *kwargs If a `CorpusConfig` object is not specified, all arguments and keyword arguments are passed to a CorpusConfig object

annotation_types¶

Get a list of all the annotation types in the corpus’s Hierarchy

Returns:	list Annotation types

cache_hierarchy()[source]¶: Save corpus Hierarchy to the disk

cypher_safe_name¶

Escape the corpus name for use in Cypher queries

Returns:	str Corpus name made safe for Cypher

discourses¶

Gets a list of discourses in the corpus

Returns:	list Discourse names in the corpus

encode_type_subset(annotation_type, annotation_labels, subset_label)[source]¶

Encode a type subset from labels of annotations

Parameters:	annotation_type : str Annotation type of labels annotation_labels : list a list of labels of annotations to subset together subset_label : str the label for the subset

execute_cypher(statement, **parameters)[source]¶

Executes a cypher query

Parameters:	statement : str the cypher statement parameters : kwargs keyword arguments to execute a cypher statement
Returns:	`BoltStatementResult` Result of Cypher query

exists()[source]¶

Check whether the corpus has a Hierarchy schema in the Neo4j database

Returns:	bool True if the corpus Hierarchy has been saved to the database

hierarchy_path¶

Get the path to cached hierarchy information

Returns:	str Path to the cached hierarchy data on disk

load_hierarchy()[source]¶: Load Hierarchy object from the cached version

lowest_annotation¶

Returns the annotation type that is the lowest in the Hierarchy.

Returns:	str Lowest annotation type in the Hierarchy

phone_name¶

Gets the phone label

Returns:	str phone name

phones¶

Get a list of all phone labels in the corpus.

Returns:	list All phone labels in the corpus

query_discourses()[source]¶

Start a query over discourses in the corpus

Returns:	`DiscourseQuery` DiscourseQuery object

query_graph(annotation_node)[source]¶

Start a query over the tokens of a specified annotation type (i.e. corpus.word)

Parameters:	annotation_node : `polyglotdb.query.attributes.AnnotationNode` The type of annotation to look for in the corpus
Returns:	`SplitQuery` SplitQuery object

query_lexicon(annotation_node)[source]¶

Start a query over types of a specified annotation type (i.e. corpus.lexicon_word)

Parameters:	annotation_node : `polyglotdb.query.attributes.AnnotationNode` The type of annotation to look for in the corpus’s lexicon
Returns:	`LexiconQuery` LexiconQuery object

query_speakers()[source]¶

Start a query over speakers in the corpus

Returns:	`SpeakerQuery` SpeakerQuery object

remove_discourse(name)[source]¶

Remove the nodes and relationships associated with a single discourse in the corpus.

Parameters:	name : str Name of the discourse to remove

reset(call_back=None, stop_check=None)[source]¶

Reset the Neo4j and InfluxDB databases for a corpus

Parameters:	call_back : callable Function to monitor progress stop_check : callable Function the check whether the process should terminate early

reset_graph(call_back=None, stop_check=None)[source]¶: Remove all nodes and relationships in the corpus.

reset_type_subset(annotation_type, subset_label)[source]¶

Reset and remove a type subset

Parameters:	annotation_type : str Annotation type of the subset subset_label : str the label for the subset

speakers¶

Gets a list of speakers in the corpus

Returns:	list Speaker names in the corpus

word_name¶

Gets the word label

Returns:	str word name

words¶

Get a list of all word labels in the corpus.

Returns:	list All word labels in the corpus

Phonological functionality ¶

class polyglotdb.corpus.PhonologicalContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with phones

encode_class(phones, label)[source]¶

encodes phone classes

Parameters:	phones : list a list of phones label : str the label for the class

encode_features(feature_dict)[source]¶

gets the phone if it exists, queries for each phone and sets type to kwargs (?)

Parameters:	feature_dict : dict features to encode

enrich_features(feature_data, type_data=None)[source]¶

Sets the data type and feature data, initializes importers for feature data, adds features to hierarchy for a phone

Parameters:	feature_data : dict the enrichment data type_data : dict By default None

enrich_inventory_from_csv(path)[source]¶

Enriches corpus from a csv file

Parameters:	path : str the path to the csv file

remove_pattern(pattern='[0-2]')[source]¶

removes a stress or tone pattern from all phones

Parameters:	pattern : str the regular expression for the pattern to remove Defaults to ‘[0-2]’

reset_class(label)[source]¶

Reset and remove a subset

Parameters:	label : str Subset name to remove

reset_features(feature_names)[source]¶

resets features

Parameters:	feature_names : list list of names of features to remove

reset_inventory_csv(path)[source]¶

Remove properties that were encoded via a CSV file

Parameters:	path : str CSV file to get property names from

reset_to_old_label()[source]¶: Reset phones back to their old labels which include stress and tone

Syllabic functionality ¶

class polyglotdb.corpus.SyllabicContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with syllables

encode_stress_from_word_property(word_property_name)[source]¶

Use a property on words formatted like “0-1-0” to encode stress on syllables.

The number of syllables and the position of syllables within a word will also be encoded as a result of this function.

Parameters:	word_property_name : str Property name of words that contains the stress pattern

encode_stress_to_syllables(regex=None, clean_phone_label=True)[source]¶

Use numbers (0-9) in phone labels as stress property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:	regex : str Regular expression character set for finding stress in the phone label clean_phone_label : bool Flag for removing regular expression from the phone labels

encode_syllabic_segments(phones)[source]¶

Encode a list of phones as ‘syllabic’

Parameters:	phones : list A list of vowels and syllabic consonants

encode_syllables(algorithm='maxonset', syllabic_label='syllabic', call_back=None, stop_check=None)[source]¶

Encodes syllables to a corpus

Parameters:	algorithm : str, defaults to ‘maxonset’ determines which algorithm will be used to encode syllables syllabic_label : str Subset to use for syllabic segments (i.e., nuclei) call_back : callable Function to monitor progress stop_check : callable Function the check whether the process should terminate early

encode_tone_to_syllables(regex=None, clean_phone_label=True)[source]¶

Use numbers (0-9) in phone labels as tone property for syllables. If clean_phone_label is True, the numbers will be removed from the phone labels.

Parameters:	regex : str Regular expression character set for finding tone in the phone label clean_phone_label : bool Flag for removing regular expression from the phone labels

enrich_syllables(syllable_data, type_data=None)[source]¶

Sets the data type and syllable data, initializes importers for syllable data, adds features to hierarchy for a phone

Parameters:	syllable_data : dict the enrichment data type_data : dict By default None

find_codas(syllabic_label='syllabic')[source]¶

Gets syllable codas across the corpus

Parameters:	syllabic_label : str Subset to use for syllabic segments (i.e., nuclei)
Returns:	data : dict A dictionary with coda values as keys and frequency values as values

find_onsets(syllabic_label='syllabic')[source]¶

Gets syllable onsets across the corpus

Parameters:	syllabic_label : str Subset to use for syllabic segments (i.e., nuclei)
Returns:	data : dict A dictionary with onset values as keys and frequency values as values

has_syllabics¶

Check whether there is a phone subset named syllabic

Returns:	bool True if `syllabic` is found as a phone subset

has_syllables¶

Check whether the corpus has syllables encoded

Returns:	bool True if the syllables are in the Hierarchy

reset_syllables(call_back=None, stop_check=None)[source]¶

Resets syllables, removes syllable annotation, removes onset, coda, and nucleus labels

Parameters:	call_back : callable Function to monitor progress stop_check : callable Function the check whether the process should terminate early

Lexical functionality ¶

class polyglotdb.corpus.LexicalContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with words

enrich_lexicon(lexicon_data, type_data=None, case_sensitive=False)[source]¶

adds properties to lexicon, adds properties to hierarchy

Parameters:	lexicon_data : dict the data in the lexicon type_data : dict default to None case_sensitive : bool default to False

enrich_lexicon_from_csv(path, case_sensitive=False)[source]¶

Enriches lexicon from a CSV file

Parameters:	path : str the path to the csv file case_sensitive : boolean Defaults to false

reset_lexicon_csv(path)[source]¶

Remove properties that were encoded via a CSV file

Parameters:	path : str CSV file to get property names from

Pause functionality ¶

class polyglotdb.corpus.PauseContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with non-speech elements

encode_pauses(pause_words, call_back=None, stop_check=None)[source]¶

Set words to be pauses, as opposed to speech.

Parameters:	pause_words : str, list, tuple, or set Either a list of words that are pauses or a string containing a regular expression that specifies pause words call_back : callable Function to monitor progress stop_check : callable Function to check whether process should be terminated early

has_pauses¶

Check whether corpus has encoded pauses

Returns:	bool True if pause is in the subsets available for words

reset_pauses()[source]¶: Revert all words marked as pauses to regular words marked as speech

Utterance functionality ¶

class polyglotdb.corpus.UtteranceContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with utterances

encode_speech_rate(subset_label, call_back=None, stop_check=None)[source]¶

Encodes speech rate

Parameters:	subset_label : str the name of the subset to encode

encode_utterance_position(call_back=None, stop_check=None)[source]¶: Encodes position_in_utterance for a word

encode_utterances(min_pause_length=0.5, min_utterance_length=0, call_back=None, stop_check=None)[source]¶

Encode utterance annotations based on minimum pause length and minimum utterance length. See get_pauses for more information about the algorithm.

Once this function is run, utterances will be queryable like other annotation types.

Parameters:	min_pause_length : float, defaults to 0.5 Time in seconds that is the minimum duration of a pause to count as an utterance boundary min_utterance_length : float, defaults to 0.0 Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

enrich_utterances(utterance_data, type_data=None)[source]¶

adds properties to lexicon, adds properties to hierarchy

Parameters:	utterance_data : dict the data to enrich with type_data : dict default to None

get_utterance_ids(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:	discourse : str String identifier for a discourse min_pause_length : float, defaults to 0.5 Time in seconds that is the minimum duration of a pause to count as an utterance boundary min_utterance_length : float, defaults to 0.0 Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

get_utterances(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶

Algorithm to find utterance boundaries in a discourse.

Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.

Parameters:	discourse : str String identifier for a discourse min_pause_length : float, defaults to 0.5 Time in seconds that is the minimum duration of a pause to count as an utterance boundary min_utterance_length : float, defaults to 0.0 Time in seconds that is the minimum duration of a stretch of speech to count as an utterance

reset_speech_rate()[source]¶: resets speech_rate

reset_utterance_position()[source]¶: resets position_in_utterance

reset_utterances()[source]¶: Remove all utterance annotations.

Audio functionality ¶

class polyglotdb.corpus.AudioContext(*args, **kwargs)[source]¶

Class that contains methods for dealing with audio files for corpora

acoustic_client()[source]¶

Generate a client to connect to the InfluxDB for the corpus

Returns:	InfluxDBClient Client through which to run queries and writes

analyze_formant_tracks(source='praat', stop_check=None, call_back=None, multiprocessing=True, vowel_label=None)[source]¶

Compute formant tracks and save them to the database

See polyglotdb.acoustics.formants.base.analyze_formant_tracks() for more details.

Parameters:

source : str: Program to compute formants
stop_check : callable: Function to check whether to terminate early
call_back : callable: Function to report progress
multiprocessing : bool: Flag to use multiprocessing, defaults to True, if False uses threading
vowel_label : str, optional: Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.

analyze_intensity(source='praat', stop_check=None, call_back=None, multiprocessing=True)[source]¶

Compute intensity tracks and save them to the database

See polyglotdb.acoustics.intensity..analyze_intensity() for more details.

Parameters:	source : str Program to compute intensity (only `praat` is supported) stop_check : callable Function to check whether to terminate early call_back : callable Function to report progress multiprocessing : bool Flag to use multiprocessing, defaults to True, if False uses threading

analyze_pitch(source='praat', algorithm='base', absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, call_back=None, multiprocessing=True)[source]¶

Analyze pitch tracks and save them to the database.

See polyglotdb.acoustics.pitch.base.analyze_pitch() for more details.

Parameters:

source : str: Program to use for analyzing pitch, either praat or reaper
algorithm : str: Algorithm to use, base, gendered, or speaker_adjusted
absolute_min_pitch : int: Absolute pitch floor
absolute_max_pitch : int: Absolute pitch ceiling
adjusted_octaves : int: How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling
stop_check : callable: Function to check whether processing should stop early
call_back : callable: Function to report progress
multiprocessing : bool: Flag whether to use multiprocessing or threading

analyze_script(phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶

Use a Praat script to analyze annotation types in the corpus. The Praat script must return properties per phone (i.e., point measures, not a track), and these properties will be saved to the Neo4j database.

See polyglotdb.acoustics.other..analyze_script() for more details.

Parameters:

phone_class : str: DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run
subset : str, optional: the name of an already encoded subset of an annotation type, on which the analysis will be run
annotation_type : str: the type of annotation that the analysis will go over
script_path : str: Path to the Praat script
duration_threshold : float: Minimum duration that phones should be to be analyzed
arguments : list: Arguments to pass to the Praat script
stop_check : callable: Function to check whether to terminate early
call_back : callable: Function to report progress
multiprocessing : bool: Flag to use multiprocessing, defaults to True, if False uses threading
file_type : str: Sampling rate type to use, one of consonant, vowel, or low_freq

Returns:

list: List of the names of newly added properties to the Neo4j database

analyze_track_script(acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶

Use a Praat script to analyze phones in the corpus. The Praat script must return a track, and these tracks will be saved to the InfluxDB database.

See polyglotdb.acoustics.other..analyze_track_script() for more details.

Parameters:

acoustic_name : str: Name of the acoustic measure
properties : list: List of tuples of the form (property_name, Type)
script_path : str: Path to the Praat script
duration_threshold : float: Minimum duration that phones should be to be analyzed
phone_class : str: Name of the phone subset to analyze
arguments : list: Arguments to pass to the Praat script
stop_check : callable: Function to check whether to terminate early
call_back : callable: Function to report progress
multiprocessing : bool: Flag to use multiprocessing, defaults to True, if False uses threading
file_type : str: Sampling rate type to use, one of consonant, vowel, or low_freq

analyze_utterance_pitch(utterance, source='praat', **kwargs)[source]¶

Analyze a single utterance’s pitch track.

See polyglotdb.acoustics.pitch.base.analyze_utterance_pitch() for more details.

Parameters:	utterance : str Utterance ID from Neo4j source : str Program to use for analyzing pitch, either `praat` or `reaper` kwargs Additional settings to use in analyzing pitch
Returns:	`Track` Pitch track

analyze_vot(classifier, stop_label='stops', stop_check=None, call_back=None, multiprocessing=False, overwrite_edited=False, vot_min=5, vot_max=100, window_min=-30, window_max=30)[source]¶

Compute VOTs for stops and save them to the database.

See polyglotdb.acoustics.vot.base.analyze_vot() for more details.

Parameters:

classifier : str: Path to an AutoVOT classifier model
stop_label : str: Label of subset to analyze
vot_min : int: Minimum VOT in ms
vot_max : int: Maximum VOT in ms
window_min : int: Window minimum in ms
window_max : int: Window maximum in Ms
overwrite_edited : bool: Overwrite VOTs with the “edited” property set to true, if this is true
call_back : callable: call back function, optional
stop_check : callable: stop check function, optional
multiprocessing : bool: Flag to use multiprocessing, otherwise will use threading

discourse_audio_directory(discourse)[source]¶: Return the directory for the stored audio files for a discourse

discourse_has_acoustics(acoustic_name, discourse)[source]¶

Return whether a discourse has any specific acoustic values associated with it

Parameters:	acoustic_name : str Name of the acoustic type discourse : str Name of the discourse
Returns:	bool

discourse_sound_file(discourse)[source]¶

Get details for the audio file paths for a specified discourse.

Parameters:	discourse : str Name of the audio file in the corpus
Returns:	dict Information for the audio file path

encode_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶

Computes and saves as type properties summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:	acoustic_name : str Name of the acoustic type statistic : str One of mean, median, stddev, sum, mode, count by_speaker : bool, defaults to True Flag for calculating summary statistic by speaker by_phone : bool, defaults to False Flag for calculating summary statistic by phone

execute_influxdb(query)[source]¶

Execute an InfluxDB query for the corpus

Parameters:	query : str Query to run
Returns:	`influxdb.resultset.ResultSet` Results of the query

genders()[source]¶

Gets all values of speaker property named gender in the Neo4j database

Returns:	list List of gender values

generate_spectrogram(discourse, file_type='consonant', begin=None, end=None)[source]¶

Generate a spectrogram from an audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:	discourse : str Name of the audio file to load file_type : str One of `consonant`, `vowel` or `low_freq` begin : float Timestamp in seconds end : float Timestamp in seconds
Returns:	numpy.array Spectrogram information float Time step between each window float Frequency step between each frequency bin

get_acoustic_measure(acoustic_name, discourse, begin, end, channel=0, relative_time=False, **kwargs)[source]¶

Get acoustic for a given discourse and time range

Parameters:	acoustic_name : str Name of acoustic track discourse : str Name of the discourse begin : float Beginning of time range end : float End of time range channel : int, defaults to 0 Channel of the audio file relative_time : bool, defaults to False Flag for retrieving relative time instead of absolute time kwargs : kwargs Tags to filter on
Returns:	`polyglotdb.acoustics.classes.Track` Track object

get_acoustic_statistic(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶

Computes summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.

Parameters:	acoustic_name : str Name of the acoustic type statistic : str One of mean, median, stddev, sum, mode, count by_speaker : bool, defaults to True Flag for calculating summary statistic by speaker by_phone : bool, defaults to False Flag for calculating summary statistic by phone
Returns:	dict Dictionary where keys are phone/speaker/phone-speaker pairs and values are the summary statistic of the acoustic measure

get_utterance_acoustics(acoustic_name, utterance_id, discourse, speaker)[source]¶

Get acoustic for a given utterance and time range

Parameters:	acoustic_name : str Name of acoustic track utterance_id : str ID of the utterance from the Neo4j database discourse : str Name of the discourse speaker : str Name of the speaker
Returns:	`polyglotdb.acoustics.classes.Track` Track object

has_all_sound_files()[source]¶

Check whether all discourses have a sound file

Returns:	bool True if a sound file exists for each discourse name in corpus, False otherwise

has_sound_files¶

Check whether any discourses have a sound file

Returns:	bool True if there are any sound files at all, false if there aren’t

load_audio(discourse, file_type)[source]¶

Loads a given audio file at the specified sampling rate type (consonant, vowel or low_freq). Consonant files have a sampling rate of 16 kHz, vowel files a sampling rate of 11 kHz, and low frequency files a sampling rate of 1.2 kHz.

Parameters:	discourse : str Name of the audio file to load file_type : str One of `consonant`, `vowel` or `low_freq`
Returns:	numpy.array Audio signal int Sampling rate of the file

load_waveform(discourse, file_type='consonant', begin=None, end=None)[source]¶

Loads a segment of a larger audio file. If begin is unspecified, the segment will start at the beginning of the audio file, and if end is unspecified, the segment will end at the end of the audio file.

Parameters:	discourse : str Name of the audio file to load file_type : str One of `consonant`, `vowel` or `low_freq` begin : float, optional Timestamp in seconds end : float, optional Timestamp in seconds
Returns:	numpy.array Audio signal int Sampling rate of the file

reassess_utterances(acoustic_name)[source]¶

Update utterance IDs in InfluxDB for more efficient querying if utterances have been re-encoded after acoustic measures were encoded

Parameters:	acoustic_name : str Name of the measure for which to update utterance IDs

relativize_acoustic_measure(acoustic_name, by_speaker=True, by_phone=False)[source]¶

Relativize acoustic tracks by taking the z-score of the points (using by speaker or by phone means and standard deviations, or both by-speaker, by phone) and save them as separate measures, i.e., F0_relativized from F0.

Parameters:	acoustic_name : str Name of the acoustic measure by_speaker : bool, defaults to True Flag for relativizing by speaker by_phone : bool, defaults to False Flag for relativizing by phone

reset_acoustic_measure(acoustic_type)[source]¶

Reset a given acoustic measure

Parameters:	acoustic_type : str Name of the acoustic measurement to reset

reset_acoustics()[source]¶: Reset all acoustic measures currently encoded

reset_formant_points()[source]¶: Reset formant point measures encoded in the corpus

reset_relativized_acoustic_measure(acoustic_name)[source]¶

Reset any relativized measures that have been encoded for a specified type of acoustics

Parameters:	acoustic_name : str Name of the acoustic type

reset_vot()[source]¶: Reset all VOT measurements in the corpus

save_acoustic_track(acoustic_name, discourse, track, **kwargs)[source]¶

Save an acoustic track for a sound file

Parameters:	acoustic_name : str Name of the acoustic type discourse : str Name of the discourse track : `Track` Track to save kwargs: kwargs Tags to save for acoustic measurements

save_acoustic_tracks(acoustic_name, tracks, speaker)[source]¶

Save multiple acoustic tracks for a collection of analyzed segments

Parameters:	acoustic_name : str Name of the acoustic type tracks : iterable Iterable of `Track` objects to save speaker : str Name of the speaker of the tracks

update_utterance_pitch_track(utterance, new_track)[source]¶

Save a pitch track for the specified utterance.

See polyglotdb.acoustics.pitch.base.update_utterance_pitch_track() for more details.

Parameters:	utterance : str Utterance ID from Neo4j new_track : list or `Track` Pitch track
Returns:	int Time stamp of update

utterance_sound_file(utterance_id, file_type='consonant')[source]¶

Generate an audio file just for a single utterance in an audio file.

Parameters:	utterance_id : str Utterance ID from Neo4j file_type : str Sampling rate type to use, one of `consonant`, `vowel`, or `low_freq`
Returns:	str Path to the generated sound file

Summarization functionality ¶

class polyglotdb.corpus.SummarizedContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with summary measures for linguistic items

average_speech_rate()[source]¶

Get the average speech rate for each speaker in a corpus

Returns:	result: list the average speech rate by speaker

baseline_duration(annotation, speaker=None)[source]¶

Get the baseline duration of each word in corpus. Baseline duration is determined by summing the average durations of constituent phones for a word. If there is no underlying transcription available, the longest duration is considered the baseline.

Parameters:	speaker : str a speaker name, if desired (defaults to None)
Returns:	word_totals : dict a dictionary of words and baseline durations

encode_baseline(annotation_type, property_name, by_speaker=False)[source]¶

Encode a baseline measure of a property, that is, the expected value of a higher annotation given the average property value of the phones that make it up. For instance, the expected duration of a word or syllable given its phonological content.

Parameters:	annotation_type : str Name of annotation type to compute for property_name : str Property of phones to compute based off of (i.e., `duration`) by_speaker : bool Flag for whether to use by-speaker means

encode_measure(property_name, statistic, annotation_type, by_speaker=False)[source]¶

Compute and save an aggregate measure for annotation types

Available statistic names:

mean/average/avg
sd/stdev

Parameters:	property_name : str Name of the property statistic : str Name of the statistic to use for aggregation annotation_type : str Name of the annotation type by_speaker : bool Flag for whether to compute aggregation by speaker

encode_relativized(annotation_type, property_name, by_speaker=False)[source]¶

Compute and save to the database a relativized measure (i.e., the property value z-scored using a mean and standard deviation computed from the corpus). The computation of means and standard deviations can be by-speaker.

Parameters:	annotation_type : str Name of the annotation type property_name : str Name of the property to relativize by_speaker : bool Flag to use by-speaker means and standard deviations

get_measure(data_name, statistic, annotation_type, by_speaker=False, speaker=None)[source]¶

abstract function to get statistic for the data_name of an annotation_type

Parameters:	data_name : str the aspect to summarize (duration, pitch, formants, etc) statistic : str how to summarize (mean, stdev, median, etc) annotation_type : str the annotation to summarize by_speaker : boolean whether to summarize by speaker or not speaker : str the specific speaker to encode baseline duration for (only for baseline duration)

make_dict(data, speaker=False, label=None)[source]¶

turn data results into a dictionary for encoding

Parameters:	data : list a list returned by cypher
Returns:	finaldict : dict a dictionary in the format for enrichment

Spoken functionality ¶

class polyglotdb.corpus.SpokenContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with speaker and sound file metadata

enrich_discourses(discourse_data, type_data=None)[source]¶

Add properties about discourses to the corpus, allowing them to be queryable.

Parameters:	discourse_data : dict the data about the discourse to add type_data : dict Specifies the type of the data to be added, defaults to None

enrich_discourses_from_csv(path)[source]¶

Enriches discourses from a csv file

Parameters:	path : str the path to the csv file

enrich_speakers(speaker_data, type_data=None)[source]¶

Add properties about speakers to the corpus, allowing them to be queryable.

Parameters:	speaker_data : dict the data about the speakers to add type_data : dict Specifies the type of the data to be added, defaults to None

enrich_speakers_from_csv(path)[source]¶

Enriches speakers from a csv file

Parameters:	path : str the path to the csv file

get_channel_of_speaker(speaker, discourse)[source]¶

Get the channel that the speaker is in

Parameters:	speaker : str Speaker to query discourse : str Discourse to query
Returns:	int Channel of audio that speaker is in

get_discourses_of_speaker(speaker)[source]¶

Get a list of all discourses that a given speaker spoke in

Parameters:	speaker : str Speaker to query over
Returns:	list All discourses the speaker spoke in

get_speakers_in_discourse(discourse)[source]¶

Get a list of all speakers that spoke in a given discourse

Parameters:	discourse : str Audio file to query over
Returns:	list All speakers who spoke in the discourse

make_speaker_annotations_dict(data, speaker, property)[source]¶

helper function to turn dict of {} format to {speaker :{property :{data}}}

Parameters:	data : dict annotations and values property : str the name of the property being encoded speaker : str the name of the speaker

reset_discourse_csv(path)[source]¶

Remove properties that were encoded via a CSV file

Parameters:	path : str CSV file to get property names from

reset_speaker_csv(path)[source]¶

Remove properties that were encoded via a CSV file

Parameters:	path : str CSV file to get property names from

Structured functionality ¶

class polyglotdb.corpus.StructuredContext(*args, **kwargs)[source]¶

Class that contains methods for dealing specifically with metadata for the corpus

encode_count(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶

Encodes the rate of the lower type in the higher type

Parameters:	higher_annotation_type : str what the higher annotation is (utterance, word) lower_annotation_type : str what the lower annotation is (word, phone, syllable) name : str the column name subset : str the annotation subset

encode_hierarchy()[source]¶: Sync the current Hierarchy to the Neo4j database and to the disk

encode_position(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶

Encodes position of lower type in higher type

Parameters:	higher_annotation_type : str what the higher annotation is (utterance, word) lower_annotation_type : str what the lower annotation is (word, phone, syllable) name : str the column name subset : str the annotation subset

encode_rate(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶

Encodes the rate of the lower type in the higher type

Parameters:	higher_annotation_type : str what the higher annotation is (utterance, word) lower_annotation_type : str what the lower annotation is (word, phone, syllable) name : str the column name subset : str the annotation subset

generate_hierarchy()[source]¶

Get hierarchy schema information from the Neo4j database

Returns:	`Hierarchy` the structure of the corpus

query_metadata(annotation)[source]¶

Start a query over metadata

Parameters:	annotation : `Node`
Returns:	`MetaDataQuery` MetaDataQuery object

refresh_hierarchy()[source]¶: Save the Neo4j database schema to the disk

reset_hierarchy()[source]¶: Delete the Hierarchy schema in the Neo4j database

reset_property(annotation_type, name)[source]¶

Removes property from hierarchy

Parameters:	annotation_type : str what is being removed name : str the column name

Annotation functionality ¶

class polyglotdb.corpus.AnnotatedContext(*args, **kwargs)[source]¶: Class that contains methods for dealing specifically with annotations on linguistic items (termed “subannotations” in PolyglotDB

Omnibus class ¶

class polyglotdb.corpus.CorpusContext(*args, **kwargs)[source]¶

Main corpus context, inherits from the more specialized contexts.

Parameters:	args : args Either a CorpusConfig object or sequence of arguments to be passed to a CorpusConfig object kwargs : kwargs sequence of keyword arguments to be passed to a CorpusConfig object

Corpus structure class ¶

class polyglotdb.structure.Hierarchy(data=None, corpus_name=None)[source]¶

Class containing information about how a corpus is structured.

Hierarchical data is stored in the form of a dictionary with keys for linguistic types, and values for the linguistic type that contains them. If no other type contains a given type, its value is None.

Subannotation data is stored in the form of a dictionary with keys for linguistic types, and values of sets of types of subannotations.

Parameters:	data : dict Information about the hierarchy of linguistic types corpus_name : str Name of the corpus

acoustics¶

Get all currently encoded acoustic measurements in the corpus

Returns:	list All encoded acoustic measures

add_acoustic_properties(corpus_context, acoustic_type, properties)[source]¶

Add acoustic properties to an encoded acoustic measure. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database acoustic_type : str Acoustic measure to add properties for properties : iterable Iterable of tuples of the form (property_name, Type)

add_annotation_type(annotation_type, above=None, below=None)[source]¶

Adds an annotation type to the Hierarchy object along with default type and token properties for the new annotation type

Parameters:	annotation_type : str Annotation type to add above : str Annotation type that is contained by the new annotation type, leave out if new annotation type is at the bottom of the hierarchy below : str Annotation type that contains the new annotation type, leave out if new annotation type is at the top of the hierarchy

add_discourse_properties(corpus_context, properties)[source]¶

Adds discourse properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database properties : iterable Iterable of tuples of the form (property_name, Type)

add_speaker_properties(corpus_context, properties)[source]¶

Adds speaker properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database properties : iterable Iterable of tuples of the form (property_name, Type)

add_subannotation_properties(corpus_context, subannotation_type, properties)[source]¶

Adds properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database subannotation_type : str Name of the subannotation type properties : iterable Iterable of tuples of the form (property_name, Type)

add_subannotation_type(corpus_context, annotation_type, subannotation_type, properties=None)[source]¶

Adds subannotation type for a given annotation type to the Hierarchy object and syncs it to a Neo4j database. The list of optional properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type : str Annotation type to add a subannotation to subannotation_type : str Name of the subannotation type properties : iterable Optional iterable of tuples of the form (property_name, Type)

add_token_properties(corpus_context, annotation_type, properties)[source]¶

Adds token properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type : str Annotation type to add token properties for properties : iterable Iterable of tuples of the form (property_name, Type)

add_token_subsets(corpus_context, annotation_type, subsets)[source]¶

Adds token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type: str Annotation type to add subsets for subsets : iterable List of subsets to add for the annotation tokens

add_type_properties(corpus_context, annotation_type, properties)[source]¶

Adds type properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where property_name is a string and Type is a Python type class, like bool, str, list, or float.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type : str Annotation type to add type properties for properties : iterable Iterable of tuples of the form (property_name, Type)

add_type_subsets(corpus_context, annotation_type, subsets)[source]¶

Adds type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type: str Annotation type to add subsets for subsets : iterable List of subsets to add for the annotation type

annotation_types¶

Get a list of all the annotation types in the hierarchy

Returns:	list All annotation types in the hierarchy

from_json(json)[source]¶

Set all properties from a dictionary deserialized from JSON

Parameters:	json : dict Object information

get_depth(lower_type, higher_type)[source]¶

Get the distance between two annotation types in the hierarchy

Parameters:	lower_type : str Name of the lower type higher_type : str Name of the higher type
Returns:	int Distance between the two types

get_higher_types(annotation_type)[source]¶

Get all annotation types that are higher than the specified annotation type

Parameters:	annotation_type : str Annotation type from which to get higher annotation types
Returns:	list List of all annotation types that are higher the specified annotation type

get_lower_types(annotation_type)[source]¶

Get all annotation types that are lower than the specified annotation type

Parameters:	annotation_type : str Annotation type from which to get lower annotation types
Returns:	list List of all annotation types that are lower the specified annotation type

has_discourse_property(key)[source]¶

Check for whether discourses have a given property

Parameters:	key : str Property to check for
Returns:	bool True if discourses have the given property

has_speaker_property(key)[source]¶

Check for whether speakers have a given property

Parameters:	key : str Property to check for
Returns:	bool True if speakers have the given property

has_subannotation_property(subannotation_type, property_name)[source]¶

Check whether the Hierarchy has a property associated with a subannotation type

Parameters:	subannotation_type : str Name of subannotation to check property_name : str Name of the property to check for
Returns:	bool True if subannotation type has the given property name

has_subannotation_type(subannotation_type)[source]¶

Check whether the Hierarchy has a subannotation type

Parameters:	subannotation_type : str Name of subannotation to check for
Returns:	bool True if subannotation type is present

has_token_property(annotation_type, key)[source]¶

Check whether a given annotation type has a given token property.

Parameters:	annotation_type : str Annotation type to check for the given token property key : str Property to check for
Returns:	bool True if the annotation type has the given token property

has_token_subset(annotation_type, key)[source]¶

Check whether a given annotation type has a given token subset.

Parameters:	annotation_type : str Annotation type to check for the given token subset key : str Subset to check for
Returns:	bool True if the annotation type has the given token subset

has_type_property(annotation_type, key)[source]¶

Check whether a given annotation type has a given type property.

Parameters:	annotation_type : str Annotation type to check for the given type property key : str Property to check for
Returns:	bool True if the annotation type has the given type property

has_type_subset(annotation_type, key)[source]¶

Check whether a given annotation type has a given type subset.

Parameters:	annotation_type : str Annotation type to check for the given type subset key : str Subset to check for
Returns:	bool True if the annotation type has the given type subset

highest¶

Get the highest annotation type of the Hierarchy

Returns:	str Highest annotation type

highest_to_lowest¶

Get a list of annotation types sorted from highest to lowest

Returns:	list Annotation types from highest to lowest

items()[source]¶

Key/value pairs for the hierarchy.

Returns:	generator Items of the hierarchy

keys()[source]¶

Keys (linguistic types) of the hierarchy.

Returns:	generator Keys of the hierarchy

lowest¶

Get the lowest annotation type of the Hierarchy

Returns:	str Lowest annotation type

lowest_to_highest¶

Get a list of annotation types sorted from lowest to highest

Returns:	list Annotation types from lowest to highest

phone_name¶

Alias function for getting the lowest annotation type

Returns:	str Name of the lowest annotation type

remove_acoustic_properties(corpus_context, acoustic_type, properties)[source]¶

Remove acoustic properties to an encoded acoustic measure.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database acoustic_type : str Acoustic measure to remove properties for properties : iterable List of property names

remove_annotation_type(annotation_type)[source]¶

Removes an annotation type from the hierarchy

Parameters:	annotation_type : str Annotation type to remove

remove_discourse_properties(corpus_context, properties)[source]¶

Removes discourse properties and syncs it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database properties : iterable List of property names to remove

remove_speaker_properties(corpus_context, properties)[source]¶

Removes speaker properties and syncs it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database properties : iterable List of property names to remove

remove_subannotation_properties(corpus_context, subannotation_type, properties)[source]¶

Removes properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database subannotation_type : str Name of the subannotation type properties : iterable List of property names to remove

remove_subannotation_type(corpus_context, subannotation_type)[source]¶

Remove a subannotation type from the Hierarchy object and sync it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database subannotation_type : str Subannotation type to remove

remove_token_properties(corpus_context, annotation_type, properties)[source]¶

Removes token properties for an annotation type and syncs it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type : str Annotation type to remove token properties for properties : iterable List of property names to remove

remove_token_subsets(corpus_context, annotation_type, subsets)[source]¶

Removes token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type: str Annotation type to remove subsets for subsets : iterable List of subsets to remove for the annotation tokens

remove_type_properties(corpus_context, annotation_type, properties)[source]¶

Removes type properties for an annotation type and syncs it to a Neo4j database.

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type : str Annotation type to remove type properties for properties : iterable List of property names to remove

remove_type_subsets(corpus_context, annotation_type, subsets)[source]¶

Removes type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database

Parameters:	corpus_context : `CorpusContext` CorpusContext to use for updating Neo4j database annotation_type: str Annotation type to remove subsets for subsets : iterable List of subsets to remove for the annotation type

to_json()[source]¶

Convert the Hierarchy object to a dictionary for JSON serialization

Returns:	dict All necessary information for the Hierarchy object

update(other)[source]¶

Merge Hierarchies together. If other is a dictionary, then only the hierarchical data is updated.

Parameters:	other : Hierarchy or dict Data to be merged in

values()[source]¶

Values (containing types) of the hierarchy.

Returns:	generator Values of the hierarchy

word_name¶

Shortcut for returning the annotation type matching “word”

Returns:	str or None Annotation type that begins with “word”

Corpus config class ¶

class polyglotdb.config.CorpusConfig(corpus_name, data_dir=None, **kwargs)[source]¶

Class for storing configuration information about a corpus.

Parameters:

corpus_name : str: Identifier for the corpus
kwargs : keyword arguments: All keywords will be converted to attributes of the object

Attributes:

corpus_name : str: Identifier of the corpus
graph_user : str: Username for connecting to the graph database
graph_password : str: Password for connecting to the graph database
graph_host : str: Host for the graph database
graph_port : int: Port for connecting to the graph database
engine : str: Type of SQL database
base_dir : str: Base directory to store information and temporary files for the corpus defaults to “.pgdb” under the current user’s home directory

acoustic_connection_kwargs¶

Return connection parameters to use for connecting to an InfluxDB database

Returns:	dict Connection parameters

graph_connection_string¶

Construct a connection string to use for Neo4j

Returns:	str Connection string

temporary_directory(name)[source]¶

Create a temporary directory for use in the corpus, and return the path name.

All temporary directories deleted upon successful exit of the context manager.

Returns:	str: Full path to temporary directory