Corpus API¶
Corpus classes¶
Base corpus¶
-
class
polyglotdb.corpus.
BaseContext
(*args, **kwargs)[source]¶ Base CorpusContext class. Inherit from this and extend to create more functionality.
Parameters: - *args
If the first argument is not a
CorpusConfig
object, it is the name of the corpus- **kwargs
If a
CorpusConfig
object is not specified, all arguments and keyword arguments are passed to a CorpusConfig object
-
annotation_types
¶ Get a list of all the annotation types in the corpus’s Hierarchy
Returns: - list
Annotation types
-
cypher_safe_name
¶ Escape the corpus name for use in Cypher queries
Returns: - str
Corpus name made safe for Cypher
-
discourses
¶ Gets a list of discourses in the corpus
Returns: - list
Discourse names in the corpus
-
encode_type_subset
(annotation_type, annotation_labels, subset_label)[source]¶ Encode a type subset from labels of annotations
Parameters: - annotation_type : str
Annotation type of labels
- annotation_labels : list
a list of labels of annotations to subset together
- subset_label : str
the label for the subset
-
execute_cypher
(statement, **parameters)[source]¶ Executes a cypher query
Parameters: - statement : str
the cypher statement
- parameters : kwargs
keyword arguments to execute a cypher statement
Returns: BoltStatementResult
Result of Cypher query
-
exists
()[source]¶ Check whether the corpus has a Hierarchy schema in the Neo4j database
Returns: - bool
True if the corpus Hierarchy has been saved to the database
-
hierarchy_path
¶ Get the path to cached hierarchy information
Returns: - str
Path to the cached hierarchy data on disk
-
lowest_annotation
¶ Returns the annotation type that is the lowest in the Hierarchy.
Returns: - str
Lowest annotation type in the Hierarchy
-
phone_name
¶ Gets the phone label
Returns: - str
phone name
-
phones
¶ Get a list of all phone labels in the corpus.
Returns: - list
All phone labels in the corpus
-
query_discourses
()[source]¶ Start a query over discourses in the corpus
Returns: DiscourseQuery
DiscourseQuery object
-
query_graph
(annotation_node)[source]¶ Start a query over the tokens of a specified annotation type (i.e.
corpus.word
)Parameters: - annotation_node :
polyglotdb.query.attributes.AnnotationNode
The type of annotation to look for in the corpus
Returns: SplitQuery
SplitQuery object
- annotation_node :
-
query_lexicon
(annotation_node)[source]¶ Start a query over types of a specified annotation type (i.e.
corpus.lexicon_word
)Parameters: - annotation_node :
polyglotdb.query.attributes.AnnotationNode
The type of annotation to look for in the corpus’s lexicon
Returns: LexiconQuery
LexiconQuery object
- annotation_node :
-
query_speakers
()[source]¶ Start a query over speakers in the corpus
Returns: SpeakerQuery
SpeakerQuery object
-
remove_discourse
(name)[source]¶ Remove the nodes and relationships associated with a single discourse in the corpus.
Parameters: - name : str
Name of the discourse to remove
-
reset
(call_back=None, stop_check=None)[source]¶ Reset the Neo4j and InfluxDB databases for a corpus
Parameters: - call_back : callable
Function to monitor progress
- stop_check : callable
Function the check whether the process should terminate early
-
reset_graph
(call_back=None, stop_check=None)[source]¶ Remove all nodes and relationships in the corpus.
-
reset_type_subset
(annotation_type, subset_label)[source]¶ Reset and remove a type subset
Parameters: - annotation_type : str
Annotation type of the subset
- subset_label : str
the label for the subset
-
speakers
¶ Gets a list of speakers in the corpus
Returns: - list
Speaker names in the corpus
-
word_name
¶ Gets the word label
Returns: - str
word name
-
words
¶ Get a list of all word labels in the corpus.
Returns: - list
All word labels in the corpus
Phonological functionality¶
-
class
polyglotdb.corpus.
PhonologicalContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with phones
-
encode_class
(phones, label)[source]¶ encodes phone classes
Parameters: - phones : list
a list of phones
- label : str
the label for the class
-
encode_features
(feature_dict)[source]¶ gets the phone if it exists, queries for each phone and sets type to kwargs (?)
Parameters: - feature_dict : dict
features to encode
-
enrich_features
(feature_data, type_data=None)[source]¶ Sets the data type and feature data, initializes importers for feature data, adds features to hierarchy for a phone
Parameters: - feature_data : dict
the enrichment data
- type_data : dict
By default None
-
enrich_inventory_from_csv
(path)[source]¶ Enriches corpus from a csv file
Parameters: - path : str
the path to the csv file
-
remove_pattern
(pattern='[0-2]')[source]¶ removes a stress or tone pattern from all phones
Parameters: - pattern : str
the regular expression for the pattern to remove Defaults to ‘[0-2]’
-
reset_features
(feature_names)[source]¶ resets features
Parameters: - feature_names : list
list of names of features to remove
-
Syllabic functionality¶
-
class
polyglotdb.corpus.
SyllabicContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with syllables
-
encode_stress_from_word_property
(word_property_name)[source]¶ Use a property on words formatted like “0-1-0” to encode stress on syllables.
The number of syllables and the position of syllables within a word will also be encoded as a result of this function.
Parameters: - word_property_name : str
Property name of words that contains the stress pattern
-
encode_stress_to_syllables
(regex=None, clean_phone_label=True)[source]¶ Use numbers (0-9) in phone labels as stress property for syllables. If
clean_phone_label
is True, the numbers will be removed from the phone labels.Parameters: - regex : str
Regular expression character set for finding stress in the phone label
- clean_phone_label : bool
Flag for removing regular expression from the phone labels
-
encode_syllabic_segments
(phones)[source]¶ Encode a list of phones as ‘syllabic’
Parameters: - phones : list
A list of vowels and syllabic consonants
-
encode_syllables
(algorithm='maxonset', syllabic_label='syllabic', call_back=None, stop_check=None)[source]¶ Encodes syllables to a corpus
Parameters: - algorithm : str, defaults to ‘maxonset’
determines which algorithm will be used to encode syllables
- syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
- call_back : callable
Function to monitor progress
- stop_check : callable
Function the check whether the process should terminate early
-
encode_tone_to_syllables
(regex=None, clean_phone_label=True)[source]¶ Use numbers (0-9) in phone labels as tone property for syllables. If
clean_phone_label
is True, the numbers will be removed from the phone labels.Parameters: - regex : str
Regular expression character set for finding tone in the phone label
- clean_phone_label : bool
Flag for removing regular expression from the phone labels
-
enrich_syllables
(syllable_data, type_data=None)[source]¶ Sets the data type and syllable data, initializes importers for syllable data, adds features to hierarchy for a phone
Parameters: - syllable_data : dict
the enrichment data
- type_data : dict
By default None
-
find_codas
(syllabic_label='syllabic')[source]¶ Gets syllable codas across the corpus
Parameters: - syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
Returns: - data : dict
A dictionary with coda values as keys and frequency values as values
-
find_onsets
(syllabic_label='syllabic')[source]¶ Gets syllable onsets across the corpus
Parameters: - syllabic_label : str
Subset to use for syllabic segments (i.e., nuclei)
Returns: - data : dict
A dictionary with onset values as keys and frequency values as values
-
has_syllabics
¶ Check whether there is a phone subset named
syllabic
Returns: - bool
True if
syllabic
is found as a phone subset
-
has_syllables
¶ Check whether the corpus has syllables encoded
Returns: - bool
True if the syllables are in the Hierarchy
-
Lexical functionality¶
-
class
polyglotdb.corpus.
LexicalContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with words
-
enrich_lexicon
(lexicon_data, type_data=None, case_sensitive=False)[source]¶ adds properties to lexicon, adds properties to hierarchy
Parameters: - lexicon_data : dict
the data in the lexicon
- type_data : dict
default to None
- case_sensitive : bool
default to False
-
Pause functionality¶
-
class
polyglotdb.corpus.
PauseContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with non-speech elements
-
encode_pauses
(pause_words, call_back=None, stop_check=None)[source]¶ Set words to be pauses, as opposed to speech.
Parameters: - pause_words : str, list, tuple, or set
Either a list of words that are pauses or a string containing a regular expression that specifies pause words
- call_back : callable
Function to monitor progress
- stop_check : callable
Function to check whether process should be terminated early
-
has_pauses
¶ Check whether corpus has encoded pauses
Returns: - bool
True if pause is in the subsets available for words
-
Utterance functionality¶
-
class
polyglotdb.corpus.
UtteranceContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with utterances
-
encode_speech_rate
(subset_label, call_back=None, stop_check=None)[source]¶ Encodes speech rate
Parameters: - subset_label : str
the name of the subset to encode
-
encode_utterance_position
(call_back=None, stop_check=None)[source]¶ Encodes position_in_utterance for a word
-
encode_utterances
(min_pause_length=0.5, min_utterance_length=0, call_back=None, stop_check=None)[source]¶ Encode utterance annotations based on minimum pause length and minimum utterance length. See get_pauses for more information about the algorithm.
Once this function is run, utterances will be queryable like other annotation types.
Parameters: - min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
enrich_utterances
(utterance_data, type_data=None)[source]¶ adds properties to lexicon, adds properties to hierarchy
Parameters: - utterance_data : dict
the data to enrich with
- type_data : dict
default to None
-
get_utterance_ids
(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶ Algorithm to find utterance boundaries in a discourse.
Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.
Parameters: - discourse : str
String identifier for a discourse
- min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
get_utterances
(discourse, min_pause_length=0.5, min_utterance_length=0)[source]¶ Algorithm to find utterance boundaries in a discourse.
Pauses with duration less than the minimum will not count as utterance boundaries. Utterances that are shorter than the minimum utterance length (such as ‘okay’ surrounded by silence) will be merged with the closest utterance.
Parameters: - discourse : str
String identifier for a discourse
- min_pause_length : float, defaults to 0.5
Time in seconds that is the minimum duration of a pause to count as an utterance boundary
- min_utterance_length : float, defaults to 0.0
Time in seconds that is the minimum duration of a stretch of speech to count as an utterance
-
Audio functionality¶
-
class
polyglotdb.corpus.
AudioContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing with audio files for corpora
-
acoustic_client
()[source]¶ Generate a client to connect to the InfluxDB for the corpus
Returns: - InfluxDBClient
Client through which to run queries and writes
-
analyze_formant_tracks
(source='praat', stop_check=None, call_back=None, multiprocessing=True, vowel_label=None)[source]¶ Compute formant tracks and save them to the database
See
polyglotdb.acoustics.formants.base.analyze_formant_tracks()
for more details.Parameters: - source : str
Program to compute formants
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- vowel_label : str, optional
Optional subset of phones to compute tracks over. If None, then tracks over utterances are computed.
-
analyze_intensity
(source='praat', stop_check=None, call_back=None, multiprocessing=True)[source]¶ Compute intensity tracks and save them to the database
See
polyglotdb.acoustics.intensity..analyze_intensity()
for more details.Parameters: - source : str
Program to compute intensity (only
praat
is supported)- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
-
analyze_pitch
(source='praat', algorithm='base', absolute_min_pitch=50, absolute_max_pitch=500, adjusted_octaves=1, stop_check=None, call_back=None, multiprocessing=True)[source]¶ Analyze pitch tracks and save them to the database.
See
polyglotdb.acoustics.pitch.base.analyze_pitch()
for more details.Parameters: - source : str
Program to use for analyzing pitch, either
praat
orreaper
- algorithm : str
Algorithm to use,
base
,gendered
, orspeaker_adjusted
- absolute_min_pitch : int
Absolute pitch floor
- absolute_max_pitch : int
Absolute pitch ceiling
- adjusted_octaves : int
How many octaves around the speaker’s mean pitch to set the speaker adjusted pitch floor and ceiling
- stop_check : callable
Function to check whether processing should stop early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag whether to use multiprocessing or threading
-
analyze_script
(phone_class=None, subset=None, annotation_type=None, script_path=None, duration_threshold=0.01, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶ Use a Praat script to analyze annotation types in the corpus. The Praat script must return properties per phone (i.e., point measures, not a track), and these properties will be saved to the Neo4j database.
See
polyglotdb.acoustics.other..analyze_script()
for more details.Parameters: - phone_class : str
DEPRECATED, the name of an already encoded subset of phones on which the analysis will be run
- subset : str, optional
the name of an already encoded subset of an annotation type, on which the analysis will be run
- annotation_type : str
the type of annotation that the analysis will go over
- script_path : str
Path to the Praat script
- duration_threshold : float
Minimum duration that phones should be to be analyzed
- arguments : list
Arguments to pass to the Praat script
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
Returns: - list
List of the names of newly added properties to the Neo4j database
-
analyze_track_script
(acoustic_name, properties, script_path, duration_threshold=0.01, phone_class=None, arguments=None, stop_check=None, call_back=None, multiprocessing=True, file_type='consonant')[source]¶ Use a Praat script to analyze phones in the corpus. The Praat script must return a track, and these tracks will be saved to the InfluxDB database.
See
polyglotdb.acoustics.other..analyze_track_script()
for more details.Parameters: - acoustic_name : str
Name of the acoustic measure
- properties : list
List of tuples of the form (
property_name
,Type
)- script_path : str
Path to the Praat script
- duration_threshold : float
Minimum duration that phones should be to be analyzed
- phone_class : str
Name of the phone subset to analyze
- arguments : list
Arguments to pass to the Praat script
- stop_check : callable
Function to check whether to terminate early
- call_back : callable
Function to report progress
- multiprocessing : bool
Flag to use multiprocessing, defaults to True, if False uses threading
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
-
analyze_utterance_pitch
(utterance, source='praat', **kwargs)[source]¶ Analyze a single utterance’s pitch track.
See
polyglotdb.acoustics.pitch.base.analyze_utterance_pitch()
for more details.Parameters: - utterance : str
Utterance ID from Neo4j
- source : str
Program to use for analyzing pitch, either
praat
orreaper
- kwargs
Additional settings to use in analyzing pitch
Returns: Track
Pitch track
-
analyze_vot
(classifier, stop_label='stops', stop_check=None, call_back=None, multiprocessing=False, overwrite_edited=False, vot_min=5, vot_max=100, window_min=-30, window_max=30)[source]¶ Compute VOTs for stops and save them to the database.
See
polyglotdb.acoustics.vot.base.analyze_vot()
for more details.Parameters: - classifier : str
Path to an AutoVOT classifier model
- stop_label : str
Label of subset to analyze
- vot_min : int
Minimum VOT in ms
- vot_max : int
Maximum VOT in ms
- window_min : int
Window minimum in ms
- window_max : int
Window maximum in Ms
- overwrite_edited : bool
Overwrite VOTs with the “edited” property set to true, if this is true
- call_back : callable
call back function, optional
- stop_check : callable
stop check function, optional
- multiprocessing : bool
Flag to use multiprocessing, otherwise will use threading
-
discourse_audio_directory
(discourse)[source]¶ Return the directory for the stored audio files for a discourse
-
discourse_has_acoustics
(acoustic_name, discourse)[source]¶ Return whether a discourse has any specific acoustic values associated with it
Parameters: - acoustic_name : str
Name of the acoustic type
- discourse : str
Name of the discourse
Returns: - bool
-
discourse_sound_file
(discourse)[source]¶ Get details for the audio file paths for a specified discourse.
Parameters: - discourse : str
Name of the audio file in the corpus
Returns: - dict
Information for the audio file path
-
encode_acoustic_statistic
(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶ Computes and saves as type properties summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.
Parameters: - acoustic_name : str
Name of the acoustic type
- statistic : str
One of mean, median, stddev, sum, mode, count
- by_speaker : bool, defaults to True
Flag for calculating summary statistic by speaker
- by_phone : bool, defaults to False
Flag for calculating summary statistic by phone
-
execute_influxdb
(query)[source]¶ Execute an InfluxDB query for the corpus
Parameters: - query : str
Query to run
Returns: influxdb.resultset.ResultSet
Results of the query
-
genders
()[source]¶ Gets all values of speaker property named
gender
in the Neo4j databaseReturns: - list
List of gender values
-
generate_spectrogram
(discourse, file_type='consonant', begin=None, end=None)[source]¶ Generate a spectrogram from an audio file. If
begin
is unspecified, the segment will start at the beginning of the audio file, and ifend
is unspecified, the segment will end at the end of the audio file.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
- begin : float
Timestamp in seconds
- end : float
Timestamp in seconds
Returns: - numpy.array
Spectrogram information
- float
Time step between each window
- float
Frequency step between each frequency bin
-
get_acoustic_measure
(acoustic_name, discourse, begin, end, channel=0, relative_time=False, **kwargs)[source]¶ Get acoustic for a given discourse and time range
Parameters: - acoustic_name : str
Name of acoustic track
- discourse : str
Name of the discourse
- begin : float
Beginning of time range
- end : float
End of time range
- channel : int, defaults to 0
Channel of the audio file
- relative_time : bool, defaults to False
Flag for retrieving relative time instead of absolute time
- kwargs : kwargs
Tags to filter on
Returns: polyglotdb.acoustics.classes.Track
Track object
-
get_acoustic_statistic
(acoustic_name, statistic, by_phone=True, by_speaker=False)[source]¶ Computes summary statistics on a by speaker or by phone basis (or both) for a given acoustic measure.
Parameters: - acoustic_name : str
Name of the acoustic type
- statistic : str
One of mean, median, stddev, sum, mode, count
- by_speaker : bool, defaults to True
Flag for calculating summary statistic by speaker
- by_phone : bool, defaults to False
Flag for calculating summary statistic by phone
Returns: - dict
Dictionary where keys are phone/speaker/phone-speaker pairs and values are the summary statistic of the acoustic measure
-
get_utterance_acoustics
(acoustic_name, utterance_id, discourse, speaker)[source]¶ Get acoustic for a given utterance and time range
Parameters: - acoustic_name : str
Name of acoustic track
- utterance_id : str
ID of the utterance from the Neo4j database
- discourse : str
Name of the discourse
- speaker : str
Name of the speaker
Returns: polyglotdb.acoustics.classes.Track
Track object
-
has_all_sound_files
()[source]¶ Check whether all discourses have a sound file
Returns: - bool
True if a sound file exists for each discourse name in corpus, False otherwise
-
has_sound_files
¶ Check whether any discourses have a sound file
Returns: - bool
True if there are any sound files at all, false if there aren’t
-
load_audio
(discourse, file_type)[source]¶ Loads a given audio file at the specified sampling rate type (
consonant
,vowel
orlow_freq
). Consonant files have a sampling rate of 16 kHz, vowel files a sampling rate of 11 kHz, and low frequency files a sampling rate of 1.2 kHz.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
Returns: - numpy.array
Audio signal
- int
Sampling rate of the file
-
load_waveform
(discourse, file_type='consonant', begin=None, end=None)[source]¶ Loads a segment of a larger audio file. If
begin
is unspecified, the segment will start at the beginning of the audio file, and ifend
is unspecified, the segment will end at the end of the audio file.Parameters: - discourse : str
Name of the audio file to load
- file_type : str
One of
consonant
,vowel
orlow_freq
- begin : float, optional
Timestamp in seconds
- end : float, optional
Timestamp in seconds
Returns: - numpy.array
Audio signal
- int
Sampling rate of the file
-
reassess_utterances
(acoustic_name)[source]¶ Update utterance IDs in InfluxDB for more efficient querying if utterances have been re-encoded after acoustic measures were encoded
Parameters: - acoustic_name : str
Name of the measure for which to update utterance IDs
-
relativize_acoustic_measure
(acoustic_name, by_speaker=True, by_phone=False)[source]¶ Relativize acoustic tracks by taking the z-score of the points (using by speaker or by phone means and standard deviations, or both by-speaker, by phone) and save them as separate measures, i.e., F0_relativized from F0.
Parameters: - acoustic_name : str
Name of the acoustic measure
- by_speaker : bool, defaults to True
Flag for relativizing by speaker
- by_phone : bool, defaults to False
Flag for relativizing by phone
-
reset_acoustic_measure
(acoustic_type)[source]¶ Reset a given acoustic measure
Parameters: - acoustic_type : str
Name of the acoustic measurement to reset
-
reset_relativized_acoustic_measure
(acoustic_name)[source]¶ Reset any relativized measures that have been encoded for a specified type of acoustics
Parameters: - acoustic_name : str
Name of the acoustic type
-
save_acoustic_track
(acoustic_name, discourse, track, **kwargs)[source]¶ Save an acoustic track for a sound file
Parameters: - acoustic_name : str
Name of the acoustic type
- discourse : str
Name of the discourse
- track :
Track
Track to save
- kwargs: kwargs
Tags to save for acoustic measurements
-
save_acoustic_tracks
(acoustic_name, tracks, speaker)[source]¶ Save multiple acoustic tracks for a collection of analyzed segments
Parameters: - acoustic_name : str
Name of the acoustic type
- tracks : iterable
Iterable of
Track
objects to save- speaker : str
Name of the speaker of the tracks
-
update_utterance_pitch_track
(utterance, new_track)[source]¶ Save a pitch track for the specified utterance.
See
polyglotdb.acoustics.pitch.base.update_utterance_pitch_track()
for more details.Parameters: - utterance : str
Utterance ID from Neo4j
- new_track : list or
Track
Pitch track
Returns: - int
Time stamp of update
-
utterance_sound_file
(utterance_id, file_type='consonant')[source]¶ Generate an audio file just for a single utterance in an audio file.
Parameters: - utterance_id : str
Utterance ID from Neo4j
- file_type : str
Sampling rate type to use, one of
consonant
,vowel
, orlow_freq
Returns: - str
Path to the generated sound file
-
Summarization functionality¶
-
class
polyglotdb.corpus.
SummarizedContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with summary measures for linguistic items
-
average_speech_rate
()[source]¶ Get the average speech rate for each speaker in a corpus
Returns: - result: list
the average speech rate by speaker
-
baseline_duration
(annotation, speaker=None)[source]¶ Get the baseline duration of each word in corpus. Baseline duration is determined by summing the average durations of constituent phones for a word. If there is no underlying transcription available, the longest duration is considered the baseline.
Parameters: - speaker : str
a speaker name, if desired (defaults to None)
Returns: - word_totals : dict
a dictionary of words and baseline durations
-
encode_baseline
(annotation_type, property_name, by_speaker=False)[source]¶ Encode a baseline measure of a property, that is, the expected value of a higher annotation given the average property value of the phones that make it up. For instance, the expected duration of a word or syllable given its phonological content.
Parameters: - annotation_type : str
Name of annotation type to compute for
- property_name : str
Property of phones to compute based off of (i.e.,
duration
)- by_speaker : bool
Flag for whether to use by-speaker means
-
encode_measure
(property_name, statistic, annotation_type, by_speaker=False)[source]¶ Compute and save an aggregate measure for annotation types
Available statistic names:
- mean/average/avg
- sd/stdev
Parameters: - property_name : str
Name of the property
- statistic : str
Name of the statistic to use for aggregation
- annotation_type : str
Name of the annotation type
- by_speaker : bool
Flag for whether to compute aggregation by speaker
-
encode_relativized
(annotation_type, property_name, by_speaker=False)[source]¶ Compute and save to the database a relativized measure (i.e., the property value z-scored using a mean and standard deviation computed from the corpus). The computation of means and standard deviations can be by-speaker.
Parameters: - annotation_type : str
Name of the annotation type
- property_name : str
Name of the property to relativize
- by_speaker : bool
Flag to use by-speaker means and standard deviations
-
get_measure
(data_name, statistic, annotation_type, by_speaker=False, speaker=None)[source]¶ abstract function to get statistic for the data_name of an annotation_type
Parameters: - data_name : str
the aspect to summarize (duration, pitch, formants, etc)
- statistic : str
how to summarize (mean, stdev, median, etc)
- annotation_type : str
the annotation to summarize
- by_speaker : boolean
whether to summarize by speaker or not
- speaker : str
the specific speaker to encode baseline duration for (only for baseline duration)
-
Spoken functionality¶
-
class
polyglotdb.corpus.
SpokenContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with speaker and sound file metadata
-
enrich_discourses
(discourse_data, type_data=None)[source]¶ Add properties about discourses to the corpus, allowing them to be queryable.
Parameters: - discourse_data : dict
the data about the discourse to add
- type_data : dict
Specifies the type of the data to be added, defaults to None
-
enrich_discourses_from_csv
(path)[source]¶ Enriches discourses from a csv file
Parameters: - path : str
the path to the csv file
-
enrich_speakers
(speaker_data, type_data=None)[source]¶ Add properties about speakers to the corpus, allowing them to be queryable.
Parameters: - speaker_data : dict
the data about the speakers to add
- type_data : dict
Specifies the type of the data to be added, defaults to None
-
enrich_speakers_from_csv
(path)[source]¶ Enriches speakers from a csv file
Parameters: - path : str
the path to the csv file
-
get_channel_of_speaker
(speaker, discourse)[source]¶ Get the channel that the speaker is in
Parameters: - speaker : str
Speaker to query
- discourse : str
Discourse to query
Returns: - int
Channel of audio that speaker is in
-
get_discourses_of_speaker
(speaker)[source]¶ Get a list of all discourses that a given speaker spoke in
Parameters: - speaker : str
Speaker to query over
Returns: - list
All discourses the speaker spoke in
-
get_speakers_in_discourse
(discourse)[source]¶ Get a list of all speakers that spoke in a given discourse
Parameters: - discourse : str
Audio file to query over
Returns: - list
All speakers who spoke in the discourse
-
make_speaker_annotations_dict
(data, speaker, property)[source]¶ helper function to turn dict of {} format to {speaker :{property :{data}}}
Parameters: - data : dict
annotations and values
- property : str
the name of the property being encoded
- speaker : str
the name of the speaker
-
Structured functionality¶
-
class
polyglotdb.corpus.
StructuredContext
(*args, **kwargs)[source]¶ Class that contains methods for dealing specifically with metadata for the corpus
-
encode_count
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes the rate of the lower type in the higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
encode_position
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes position of lower type in higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
encode_rate
(higher_annotation_type, lower_annotation_type, name, subset=None)[source]¶ Encodes the rate of the lower type in the higher type
Parameters: - higher_annotation_type : str
what the higher annotation is (utterance, word)
- lower_annotation_type : str
what the lower annotation is (word, phone, syllable)
- name : str
the column name
- subset : str
the annotation subset
-
generate_hierarchy
()[source]¶ Get hierarchy schema information from the Neo4j database
Returns: Hierarchy
the structure of the corpus
-
Omnibus class¶
-
class
polyglotdb.corpus.
CorpusContext
(*args, **kwargs)[source]¶ Main corpus context, inherits from the more specialized contexts.
Parameters: - args : args
Either a CorpusConfig object or sequence of arguments to be passed to a CorpusConfig object
- kwargs : kwargs
sequence of keyword arguments to be passed to a CorpusConfig object
Corpus structure class¶
-
class
polyglotdb.structure.
Hierarchy
(data=None, corpus_name=None)[source]¶ Class containing information about how a corpus is structured.
Hierarchical data is stored in the form of a dictionary with keys for linguistic types, and values for the linguistic type that contains them. If no other type contains a given type, its value is
None
.Subannotation data is stored in the form of a dictionary with keys for linguistic types, and values of sets of types of subannotations.
Parameters: - data : dict
Information about the hierarchy of linguistic types
- corpus_name : str
Name of the corpus
-
acoustics
¶ Get all currently encoded acoustic measurements in the corpus
Returns: - list
All encoded acoustic measures
-
add_acoustic_properties
(corpus_context, acoustic_type, properties)[source]¶ Add acoustic properties to an encoded acoustic measure. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- acoustic_type : str
Acoustic measure to add properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_annotation_type
(annotation_type, above=None, below=None)[source]¶ Adds an annotation type to the Hierarchy object along with default type and token properties for the new annotation type
Parameters: - annotation_type : str
Annotation type to add
- above : str
Annotation type that is contained by the new annotation type, leave out if new annotation type is at the bottom of the hierarchy
- below : str
Annotation type that contains the new annotation type, leave out if new annotation type is at the top of the hierarchy
-
add_discourse_properties
(corpus_context, properties)[source]¶ Adds discourse properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_speaker_properties
(corpus_context, properties)[source]¶ Adds speaker properties to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_subannotation_properties
(corpus_context, subannotation_type, properties)[source]¶ Adds properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Name of the subannotation type
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_subannotation_type
(corpus_context, annotation_type, subannotation_type, properties=None)[source]¶ Adds subannotation type for a given annotation type to the Hierarchy object and syncs it to a Neo4j database. The list of optional properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add a subannotation to
- subannotation_type : str
Name of the subannotation type
- properties : iterable
Optional iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_token_properties
(corpus_context, annotation_type, properties)[source]¶ Adds token properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add token properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_token_subsets
(corpus_context, annotation_type, subsets)[source]¶ Adds token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to add subsets for
- subsets : iterable
List of subsets to add for the annotation tokens
- corpus_context :
-
add_type_properties
(corpus_context, annotation_type, properties)[source]¶ Adds type properties for an annotation type and syncs it to a Neo4j database. The list of properties are tuples of the form (property_name, Type), where
property_name
is a string andType
is a Python type class, likebool
,str
,list
, orfloat
.Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to add type properties for
- properties : iterable
Iterable of tuples of the form (property_name, Type)
- corpus_context :
-
add_type_subsets
(corpus_context, annotation_type, subsets)[source]¶ Adds type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to add subsets for
- subsets : iterable
List of subsets to add for the annotation type
- corpus_context :
-
annotation_types
¶ Get a list of all the annotation types in the hierarchy
Returns: - list
All annotation types in the hierarchy
-
from_json
(json)[source]¶ Set all properties from a dictionary deserialized from JSON
Parameters: - json : dict
Object information
-
get_depth
(lower_type, higher_type)[source]¶ Get the distance between two annotation types in the hierarchy
Parameters: - lower_type : str
Name of the lower type
- higher_type : str
Name of the higher type
Returns: - int
Distance between the two types
-
get_higher_types
(annotation_type)[source]¶ Get all annotation types that are higher than the specified annotation type
Parameters: - annotation_type : str
Annotation type from which to get higher annotation types
Returns: - list
List of all annotation types that are higher the specified annotation type
-
get_lower_types
(annotation_type)[source]¶ Get all annotation types that are lower than the specified annotation type
Parameters: - annotation_type : str
Annotation type from which to get lower annotation types
Returns: - list
List of all annotation types that are lower the specified annotation type
-
has_discourse_property
(key)[source]¶ Check for whether discourses have a given property
Parameters: - key : str
Property to check for
Returns: - bool
True if discourses have the given property
-
has_speaker_property
(key)[source]¶ Check for whether speakers have a given property
Parameters: - key : str
Property to check for
Returns: - bool
True if speakers have the given property
-
has_subannotation_property
(subannotation_type, property_name)[source]¶ Check whether the Hierarchy has a property associated with a subannotation type
Parameters: - subannotation_type : str
Name of subannotation to check
- property_name : str
Name of the property to check for
Returns: - bool
True if subannotation type has the given property name
-
has_subannotation_type
(subannotation_type)[source]¶ Check whether the Hierarchy has a subannotation type
Parameters: - subannotation_type : str
Name of subannotation to check for
Returns: - bool
True if subannotation type is present
-
has_token_property
(annotation_type, key)[source]¶ Check whether a given annotation type has a given token property.
Parameters: - annotation_type : str
Annotation type to check for the given token property
- key : str
Property to check for
Returns: - bool
True if the annotation type has the given token property
-
has_token_subset
(annotation_type, key)[source]¶ Check whether a given annotation type has a given token subset.
Parameters: - annotation_type : str
Annotation type to check for the given token subset
- key : str
Subset to check for
Returns: - bool
True if the annotation type has the given token subset
-
has_type_property
(annotation_type, key)[source]¶ Check whether a given annotation type has a given type property.
Parameters: - annotation_type : str
Annotation type to check for the given type property
- key : str
Property to check for
Returns: - bool
True if the annotation type has the given type property
-
has_type_subset
(annotation_type, key)[source]¶ Check whether a given annotation type has a given type subset.
Parameters: - annotation_type : str
Annotation type to check for the given type subset
- key : str
Subset to check for
Returns: - bool
True if the annotation type has the given type subset
-
highest
¶ Get the highest annotation type of the Hierarchy
Returns: - str
Highest annotation type
-
highest_to_lowest
¶ Get a list of annotation types sorted from highest to lowest
Returns: - list
Annotation types from highest to lowest
-
lowest
¶ Get the lowest annotation type of the Hierarchy
Returns: - str
Lowest annotation type
-
lowest_to_highest
¶ Get a list of annotation types sorted from lowest to highest
Returns: - list
Annotation types from lowest to highest
-
phone_name
¶ Alias function for getting the lowest annotation type
Returns: - str
Name of the lowest annotation type
-
remove_acoustic_properties
(corpus_context, acoustic_type, properties)[source]¶ Remove acoustic properties to an encoded acoustic measure.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- acoustic_type : str
Acoustic measure to remove properties for
- properties : iterable
List of property names
- corpus_context :
-
remove_annotation_type
(annotation_type)[source]¶ Removes an annotation type from the hierarchy
Parameters: - annotation_type : str
Annotation type to remove
-
remove_discourse_properties
(corpus_context, properties)[source]¶ Removes discourse properties and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_speaker_properties
(corpus_context, properties)[source]¶ Removes speaker properties and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_subannotation_properties
(corpus_context, subannotation_type, properties)[source]¶ Removes properties for a subannotation type to the Hierarchy object and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Name of the subannotation type
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_subannotation_type
(corpus_context, subannotation_type)[source]¶ Remove a subannotation type from the Hierarchy object and sync it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- subannotation_type : str
Subannotation type to remove
- corpus_context :
-
remove_token_properties
(corpus_context, annotation_type, properties)[source]¶ Removes token properties for an annotation type and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to remove token properties for
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_token_subsets
(corpus_context, annotation_type, subsets)[source]¶ Removes token subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to remove subsets for
- subsets : iterable
List of subsets to remove for the annotation tokens
- corpus_context :
-
remove_type_properties
(corpus_context, annotation_type, properties)[source]¶ Removes type properties for an annotation type and syncs it to a Neo4j database.
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type : str
Annotation type to remove type properties for
- properties : iterable
List of property names to remove
- corpus_context :
-
remove_type_subsets
(corpus_context, annotation_type, subsets)[source]¶ Removes type subsets to the Hierarchy object for a corpus, and syncs it to the hierarchy schema in a Neo4j database
Parameters: - corpus_context :
CorpusContext
CorpusContext to use for updating Neo4j database
- annotation_type: str
Annotation type to remove subsets for
- subsets : iterable
List of subsets to remove for the annotation type
- corpus_context :
-
to_json
()[source]¶ Convert the Hierarchy object to a dictionary for JSON serialization
Returns: - dict
All necessary information for the Hierarchy object
-
update
(other)[source]¶ Merge Hierarchies together. If other is a dictionary, then only the hierarchical data is updated.
Parameters: - other : Hierarchy or dict
Data to be merged in
-
values
()[source]¶ Values (containing types) of the hierarchy.
Returns: - generator
Values of the hierarchy
-
word_name
¶ Shortcut for returning the annotation type matching “word”
Returns: - str or None
Annotation type that begins with “word”
Corpus config class¶
-
class
polyglotdb.config.
CorpusConfig
(corpus_name, data_dir=None, **kwargs)[source]¶ Class for storing configuration information about a corpus.
Parameters: - corpus_name : str
Identifier for the corpus
- kwargs : keyword arguments
All keywords will be converted to attributes of the object
Attributes: - corpus_name : str
Identifier of the corpus
- graph_user : str
Username for connecting to the graph database
- graph_password : str
Password for connecting to the graph database
- graph_host : str
Host for the graph database
- graph_port : int
Port for connecting to the graph database
- engine : str
Type of SQL database
- base_dir : str
Base directory to store information and temporary files for the corpus defaults to “.pgdb” under the current user’s home directory
-
acoustic_connection_kwargs
¶ Return connection parameters to use for connecting to an InfluxDB database
Returns: - dict
Connection parameters
-
graph_connection_string
¶ Construct a connection string to use for Neo4j
Returns: - str
Connection string