InfluxDB implementation

This section details how PolyglotDB saves and structures data within InfluxDB. InfluxDB is a NoSQL time series database, with a SQL-like query language.


This section assumes a bit of familiarity with the InfluxDB query language, which is largely based on SQL. See the InfluxDB documentation for more details and reference to other aspects of InfluxDB.

InfluxDB Schema

Each measurement encoded (i.e., pitch, intensity, formants) will have a separate table in InfluxDB, similar to SQL. When querying, the query will select columns from a a table (i.e., select * from "pitch"). Each row in InfluxDB minimally has a time field, as it is a time series database. In addition, each row will have queryable fields and tags, in InfluxDB parlance. Tags can function as separate tables, speeding up queries, while fields are simply values that are indexed. All InfluxDB tables will have three tags (these create different indices for the database and speed up queries) for speaker, discourse, and channel. The union of discourse (i.e., file name) and channel (usually 0, particularly for mono sound files) along with the time in seconds will always give a unique acoustic time point, and indexing by speaker is crucial for PolyglotDB’s algorithms.


The time resolution for PolyglotDB is at the millisecond level. In general, I think having measurements every 10ms is a balanced time resolution for acoustic measures. Increasing the time resolution will also increase the processing time for PolyglotDB algorithms, as well as the database size. Time resolution is generally a property of the analyses done, so greater time resolution than 10 ms is possible, but not greater than 1 ms, as millisecond time resolution is hardcoded in the current code. Any time point will be rounded/truncated to the nearest millisecond.

In addition to these tags, there are several queryable fields which are always present in addition to the measurement fields. First, the phone for the time point is saved to allow for efficient aggregation across phones. Second, the utterance_id for the time point is also saved. The utterance_id is used for general querying, where each utterance’s track for the requested acoustic property is queried once and then cached for any further results to use without needing to query the InfluxDB again. For instance, a query on phone formant tracks might return 2000 phones. Without the utterance_id, there would be 2000 look ups for formant tracks (each InfluxDB query would take about 0.15 seconds), but using the utterance-based caching, the number of hits to the InfluxDB database would be a fraction (though the queries themselves would take a little bit longer).


For performance reasons internal to InfluxDB, phone and utterance_id are fields rather than tags, because the cross of them with speaker, discourse, and channel would lead to an extremely large cross of possible tag combinations. This mix of tags and fields has been found to be the most performant.

Finally, there are the actual measurements that are saved. Each acoustic track (i.e., pitch, formants, intensity) can have multiple measurements. For instance, a formants track can have F1, F2, F3, B1, B2, and B3, which are all stored together on the same time point and accessed at the same time. These measures are kept in the corpus hierarchy in Neo4j. Each measurement track (i.e. pitch) will be a node linked to the corpus (see the example in Corpus hierarchy representation). That node will have each property listed along with its data type (i.e., F0 is a float).

Optimizations for acoustic measures

PolyglotDB has default functions for generating pitch, intensity, and formants tracks (see Reference functions for specific examples and Saving acoustics for more details on how they are implemented). For implementing future built in acoustic track analysis functions, one realm of optimization lays in the differently sampled files that PolyglotDB generates. On import, three files are generated per discourse at 1,200Hz, 11,000Hz, and 16,000Hz. The intended purpose of these files are for acoustic analysis of different kinds of segments/measurements. The file at 1,200Hz is ideal for pitch analysis (maximum pitch of 600Hz), the file at 11,000Hz is ideal for formant analysis (maximum formant frequency of 5,500Hz). The file at 16,000Hz is intended for consonantal analysis (i.e., fricative spectral analysis) or any other algorithm requiring higher frequency information. The reason these three files are generated is that analysis functions generally include the resampling to these frequencies as part of the analysis, so performing it ahead of time can speed up the analysis. Some programs also don’t necessarily include resampling (i.e., pitch estimation in REAPER), so using the appropriate file can lead to massive speed ups.

Query implementation

Given a PolyglotDB query like the following:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.word)
    q = q.filter(c.word.label == 'some_word')
    q = q.columns(c.word.label.column_name('word'), c.word.pitch.track)
    results = q.all()

Once the Cypher query completes and returns results for a matching word, that information is used to create an InfluxDB query. The inclusion of an acoustic column like the pitch track also ensures that necessary information like the utterance ID and begin and end time points of the word are returned. The above query would result in several queries like the following being run:

SELECT "time", "F0" from "pitch"
WHERE "discourse" = 'some_discourse'
AND "utterance_id" = 'some_utterance_id'
AND "speaker" = 'some_speaker'

The above query will get all pitch points for the utterance of the word in question, and create Python objects for the track (polyglotdb.acoustics.classes.Track) and each time point (polyglotdb.acoustics.classes.TimePoint). With the begin and end properties of the word, a slice of the track is added to the output row.


Unlike for aggregation of properties in the Neo4j database (see Aggregation functions), aggregation of acoustic properties occurs in Python rather than being implemented in a query to InfluxDB, for the same performance reasons above. By caching utterance tracks as needed, and then performing aggregation over necessary slices (i.e., words or phones), the overall query is much faster.

Low level implementation

Saving acoustics

The general pipeline for generating and saving acoustic measures is as follows:

  • Acoustic analysis using Conch’s analysis functions
  • Format output from Conch into InfluxDB format and fill in any needed information (phone labels)
  • Write points to InfluxDB
  • Update the Corpus hierarchy with information about acoustic properties

Acoustic analysis is first performed in Conch, a Python package for processing sound files into acoustic and auditory representations. To do so, segments are created in PolyglotDB through calls to polyglotdb.acoustics.segments.generate_segments() and related functions. The generated SegmentMapping object from Conch is an iterable of Segment objects. Each Segment minimally has a path to a sound file, the begin time stamp, the end time stamp, and the channel. With these four pieces of information, the waveform signal can be extracted and acoustic analysis can be performed. Segment objects can also have other properties associated with them, so that the SegmentMapping can be grouped into sensible bits of analysis (SegmentMapping.grouped_mapping(). This is done in PolyglotDB to split analysis by speakers, for instance.

SegmentMapping and those returned by the grouped_mapping can then be passed to analyze_segments, which in addition to a SegmentMapping take a callable function that takes the minimal set of arguments above (file path, begin, end, and channel) and return some sort of track or point measure from the signal segment. Below for a list of generator functions that return a callable to be used with analyze_segments. The analyze_segments function uses multiprocessing to apply the callable function to each segment, allowing for speed ups for the number of available cores on the machine.

Once the Conch analysis function completes, the tracks are saved via polyglotdb.corpus.AudioContext.save_acoustic_tracks(). In addition to the discourse, speaker, channel, and utterance_id, phone label information is also added to each time point’s measurements. These points are then saved using the write_points function of the InfluxDBClient, returned from the acoustic_client() function.

Querying acoustics

In general, the pipeline for querying is as follows:

  • Construct InfluxDB query string from function arguments
  • Pass this query string to an InfluxDBClient
  • Iterate over results and construct a polyglotdb.acoustics.classes.Track object

All audio functions, and hence all interface with InfluxDB, is handled through the polyglotdb.corpus.AudioContext parent class for the CorpusContext. Any constructed InfluxDB queries will get executed through an InfluxDBClient, constructed in the polyglotdb.corpus.AudioContext.acoustic_client() function, which uses the InfluxDB connection parameters from the CorpusContext. As an example, see polyglotdb.corpus.AudioContext.get_utterance_acoustics. First, a InfluxDB client is constructed, then a query string is formatted from the relevant arguments passed to get_utterance_acoustics, and the relevant property names for the acoustic measure (i.e., F1, F2 and F3 for formants, see InfluxDB Schema for more details). This query string is then run via the query method of the InfluxDBClient. The results are iterated over and a polyglotdb.acoustics.classes.Track object is constructed from the results and then returned.