Encoding acoustic measures

PolyglotDB has some built-in functions to encode certain acoustic measures, and also supports encoding measures from a Praat script. All of these functions carry out the acoustic analysis and save the results into the database. It currently contains built-in functions to encode pitch, intensity, and formants.

In order to use any of them on your own computer, you must set your CorpusContext’s config.praat_path to point to Praat, and likewise config.reaper_path to Reaper if you want to use Reaper for pitch.

Encoding pitch

Pitch is encoded using the analyze_pitch function.

This is done as follows:

with CorpusContext(config) as c:
    c.analyze_pitch()

Note

The function analyze_pitch requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, pitch tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Pitch analysis can be configured in two ways, the source program of the measurement and the algorithm for fine tuning the pitch range.

Sources

The keyword argument source can be set to either 'praat' or 'reaper', depending on which program you would like PolyglotDB to use to measure pitch. The default source is Praat.

with CorpusContext(config) as c:
    c.analyze_pitch()
    # OR
    c.analyze_pitch(source='reaper')

If the source is praat, the Praat executable must be discoverable on the system path (i.e., a call of praat in a terminal works). Likewise, if the source is reaper, the Reaper executable must be on the path or the full path to the Reaper executable must be specified.

Algorithms

Similar to the source, attribute, the algorithm can be toggled between "base", "gendered", and "speaker_adapted".

with CorpusContext(config) as c:
    c.analyze_pitch()

    # OR

    c.analyze_pitch(algorithm='gendered')

    # OR

    c.analyze_pitch(algorithm='speaker_adapted')

The "base" algorithm uses a default minimum pitch of 50 Hz and a maximum pitch of 500 Hz, but these can be changed through the absolute_min_pitch and absolute_max_pitch parameters.

The "gendered" algorithm checks whether a Gender property is available for speakers. If a speaker has a property value that starts with f (i.e., female), utterances by that speakers will use a minimum pitch of 100 Hz and a maximum pitch of 500 Hz. If they have a property value of m (i.e., male), utterances by that speakers will use a minimum pitch of 50 Hz and a maximum pitch of 400 Hz.

The "speaker_adapted" algorithm does two passes of pitch estimation. The first is identical to "base" and uses a minimum pitch of 50 Hz and a maximum pitch of 500 Hz (or whatever the parameters have been set to). This first pass is used to estimate by-speaker means of F0. Speaker-specific pitch floors and ceilings are calculated by adding or subtracting the number of octaves that the adjusted_octaves parameter specifies. The default is 1, so the per-speaker pitch range will be one octave below and above the speaker’s mean pitch.

Encoding intensity

Intensity is encoded using analyze_intensity(), as follows:

with CorpusContext(config) as c:
    c.analyze_intensity()

Note

The function analyze_intensity requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, intensity tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Encoding formants

There are several ways of encoding formants. The first is encodes formant tracks similar to encoding pitch or intensity tracks (i.e., done over utterances). There is also support for encoding formants tracks just over specified vowel segments. Finally, point measures of formants can be encoded. Both formant tracks and points can be calculated using either just a simple one-pass algorithm or by using a multiple-pass refinement algorithm.

Basic formant tracks

Formant tracks over utterances are encoded using analyze_formant_tracks, as follows:

with CorpusContext(config) as c:
    c.analyze_formant_tracks()

Note

The function analyze_formant_tracks requires that utterances be encoded prior to being run. See Creating utterance units for further details on encoding utterances.

Following encoding, formant tracks and summary statistics will be available for export for every annotation type. See Querying acoustic tracks for more details.

Formant tracks can also be encoded just for specific phones by specifying a subset of phones:

with CorpusContext(config) as c:
    c.analyze_formant_tracks(vowel_label='vowel')

Note

This usage requires that a vowel subset of phone types be already encoded in the database. See Enrichment via queries for more details on creating subsets

These formant tracks do not do any specialised analysis to ensure that they are not false formants.

Basic formant point measurements

The analyze_formant_points function will generate measure for F1, F2, F3, B1, B2, and B3 at the time point 33% of the way through the vowel for every vowel specified.

with CorpusContext(config) as c:
    c.analyze_formant_points(vowel_label='vowel')

Note

The function analyze_formant_points requires that a vowel subset of phone types be already encoded in the database. See Enrichment via queries for more details on creating subsets

Refined formant points and tracks

The other function for generating both point and track measurements is the analyze_formant_points_refinement. This function computes formant measurementss for multiple values of n_formants from 4 to 7. To pick the best measurement, the function initializes per-vowel means and standard deviations with the F1, F2, F3, B1, B2, B3 values generated by n_formants=5. Then, it performs multiple iterations that select the new best track as the one that minimizes the Mahalanobis distance to the relevant prototype. In order to choose whether you wish to save tracks or points in the database, just change the output_tracks parameter to true if you would like tracks, and false otherwise. When operating over tracks, the algorithm still only evaluates the best parameters by using the 33% point.

with CorpusContext(config) as c:
    c.analyze_formant_points_refinement(vowel_label='vowel')

Following encoding, phone types that were analyzed will have properties for F1, F2, F3, B1, B2, and B3 available for query and export. See Querying acoustic point measures for more details.

Encoding Voice Onset Time(VOT)

Currently there is only one method to encode Voice Onset Times(VOTs) into PolyglotDB. This makes use of the AutoVOT program which automatically calculates VOTs based on various acoustic properties.

VOTs are encoded over a specific subset of phones using analyze_vot as follows:

with CorpusContext(config) as c:
    c.analyze_vot(self, classifier,
                stop_label="stops",
                vot_min=5,
                vot_max=100,
                window_min=-30,
                window_max=30):

Note

The function analyze_vot requires that utterances and any subsets be encoded prior to being run. See Creating utterance units for further details on encoding utterances and Subset enrichment for subsets.

Parameters

The analyze_vot function has a variety of parameters that are important for running the function properly. classifier is a string which has a paht to an AutoVOT classifier directory. A default classifier is available in /tests/data/classifier/sotc_classifiers.

stop_label refers to the name of the subset of phones that you intend to calculate VOTs for.

vot_min and vot_max refer to the minimum and maximum duration of any VOT that is calculated. The AutoVOT repo <https://github.com/mlml/autovot> has some sane defaults for English voiced and voiceless stops.

window_min and window_max refer to the edges of a given phone’s duration. So, a window_min of -30 means that AutoVOT will look up to 30 milliseconds before the start of a phone for the burst, and a window_max of 30 means that it will look up to 30 milliseconds after the end of a phone.

Encoding other measures using a Praat script

You can encode additional acoustic measures by passing a Praat script to either analyze_script or analyze_track_script. It is essential to follow the exact input and output format for your Praat script to ensure compatibility with the system.

  • analyze_script: Designed for single-point measurements. This function works for user-specific measurements that occur at exactly one point in time for any target annotation type (or a defined subset of that type) in the hierarchy, such as a predefined set of vowels within all phones.

  • analyze_track_script: Use this for continuous measurements or when measurements are required at multiple time points per annotation. This function allows you to configure your Praat script to output results for multiple time points.

analyze_script

There are two input formats available for designing your Praat script:

Format 1:

This is sufficient for most use cases and should be your default choice unless runtime efficiency is critical. In this format, the system generates temporary sound files, each containing one instance of your chosen annotation type.

Input Requirements:

  • One required input: the full path to the sound file. This input will be automatically filled by the system. You can define additional attributes as needed.

Example Praat script using Format 1 can be found here. This script computes the mean F0 (pitch) over a sound file.

Format 2 (for optimized analysis):

This format is more efficient as it reuses the same discourse sound file for all annotations in the same discourse, avoiding the creation of extra files.

Input Requirements:

  • Five required inputs:
    • Full path to the long sound file

    • begin time

    • end time

    • channel

    • padding

Do not assign values to these five fields; the system will populate them during processing. You may include additional attributes beyond these five, but ensure that values are passed as an array via the API.

Example Praat script using Format 2 can be found here. Similar to the previous example script, this script computes the mean F0 (pitch) over a sound file, but this time includes the extra four inputs.

Key Notes:

  • Always use Open long sound file to ensure compatibility with the system.

  • Must manually extract the segment within the script using the begin` and end inputs.

  • The padding field allows flexibility by extending the actual begin and end times of the segment (default is 0).

  • Channel indexing starts at 0 in the system, so increment by 1 for use in Praat (Praat uses 1-based indexing).

Output Requirements:

  • Print results to the Praat Info window in this format:
    • The first line contains space-separated column names (property names to be saved in the database).

    • The second line contains space-separated measurements for each property.

An example of the Praat output:

peak slope cog spread
5540.7376 24.3507 6744.0670 1562.1936

Output format if you are only taking one measure:

cog
6013.9

To run analyze_script, follow these steps:

  1. (Optional) Encode a subset for the annotation type you want to analyze.

  2. Call analyze_script with the annotation type, the subset name and the path to your script.

with CorpusContext(config) as c:
    # Defines a subset of phones called "sibilant"
    c.encode_type_subset('phone', ['S', 'Z', 'SH', 'ZH'], 'sibilant')

    # Uses a praat script that takes as input a filename and begin/end time, and outputs measures we'd like to take for sibilants
    # The analyze_script call then applies this script to every phone of type "sibilant" in the corpus.
    c.analyze_script(subset='sibilant', annotation_type="phone", script_path='path/to/script/sibilant.praat')

analyze_track_script

This function shares the same input formats and functionality as analyze_script. However, analyze_track_script is specifically designed for continuous measurements. Before using this functionality, you must add utterance encoding. When calling the API, you will need to specify an annotation type (e.g., phone, syllable, or word) to perform the analysis. The script will then run separately for each instance of the selected annotation type in a multiprocessing manner.

Output Requirements:

  • Print results to the Praat Info window in the following format:
    • The first line begins with time, followed by space-separated column names.

    • Subsequent lines contain timestamps and measurements for each property.

Example output:

time    H1_A1  H1_A2  H1_A3  H1_H2
0.242   1.378   -4.326  14.369  8.522
0.277   -3.169  -10.276 9.383   3.002
0.312   -0.217  -4.195  3.497   7.215
with CorpusContext(config) as c:
    script_path = 'voice_quality.praat'
    c.config.praat_path = '/path/to/your/praat/executable'
    props = [('H1_H2', float), ('H1_A1',float), ('H1_A2',float), ('H1_A3',float)]
    c.analyze_track_script('voice_quality', props, script_path, annotation_type='phone')

A detailed example of using this functionality for voice quality analysis, along with a sample Praat script, is provided in the tutorial. See (Tutorial 6: Custom Script) for more details.

Encoding acoustic tracks from CSV

Sometimes you may want to use external software to generate measurement tracks. Examples include:

  • F0 (pitch) tracks computed by an external library, across entire files

  • Voice quality tracks for each vowel, computed using VoiceSauce

  • Vowel formant tracks, e.g. using FastTrack.

If you have generated tracks using other software, you can import them into PolyglotDB using the functions save_track_from_csvs and save_track_from_csv as long as the files follow the expected structure.

CSV Format:

time, measurement1, measurement2, measurement3, ...

Additionally, the file name should match the name of the discourse for which the track should be saved.

Calling the function save_track_from_csv with the file path will save the track. You must also provide a list of the columns that the system should read. It is assumed that all columns are of type float.

To load multiple CSV files at once, pass a directory path to save_track_from_csvs.

Example (FastTrack output):

images/fasttrack_csvoutput.png

To load all the measures from the generated tracks:

with CorpusContext(config) as c:
    # loading one file
    c.save_track_from_csv('formants', '/path/to/csv', ['f1','b1','f2','b2','f3','b3','f1p','f2p','f3p','f0','intensity','harmonicity'])
    # loading multiple csv files
    c.save_track_from_csvs('formants', '/path/to/directory', ['f1','b1','f2','b2','f3','b3','f1p','f2p','f3p','f0','intensity','harmonicity'])

Encoding acoustic track statistics

After encoding an acoustic track measurement—either through the built-in algorithms or custom Praat scripts— you can perform statistical aggregation on these data tracks. The supported statistical measures are: mean, median, standard deviation (stddev), sum, mode, and count.

Aggregation can be performed on a specified annotation type, such as phones, words, or syllables (if syllable encoding is available). The aggregation is conducted for all annotations with the same label.

Aggregation can be performed by speaker, in which case the results will be grouped by speaker, and each (annotation_label, speaker) pair will have its corresponding statistical measure computed.

Once encoded, the computed statistics are stored and can be queried later.

with CorpusContext(config) as c:
    # Encode a statistic for an acoustic measure
    c.encode_acoustic_statistic('voice_quality', 'mean', by_annotation='phone', by_speaker=True)

    # Alternatively, call the get function directly; it will encode the statistic if not already available
    results = c.get_acoustic_statistic('voice_quality', 'mean', by_annotation='phone', by_speaker=True)
    # This would compute, save, and return the mean values for all voice quality measurements on a by speaker and by phone basis.
    # for example ('speaker1', 'AO1'): [1.4283178345991416, 5.21375241700153, 28.8672225446156, 18.57861883658481]