InfluxDB implementation
This section details how PolyglotDB saves and structures data within InfluxDB. InfluxDB is a NoSQL time series database, with a SQL-like query language.
Note
This section assumes a bit of familiarity with the InfluxDB query language, which is largely based on SQL. See the InfluxDB documentation for more details and reference to other aspects of InfluxDB.
InfluxDB Schema
Each measurement encoded (i.e., pitch, intensity, formants) will have a separate table in InfluxDB, similar to SQL.
When querying, the query will select
columns from a a table (i.e., select * from "pitch"
). Each row in InfluxDB
minimally has a time
field, as it is a time series database. In addition, each row will have queryable fields and tags, in InfluxDB parlance.
Tags can function as separate tables, speeding up queries, while fields are simply values that are indexed.
All InfluxDB tables will have three tags (these create different indices for the database and speed up queries) for
speaker
, discourse
, and channel
. The union of discourse
(i.e., file name) and channel
(usually 0, particularly for mono sound files)
along with the time
in seconds will always give a unique acoustic time point, and indexing by speaker
is crucial for PolyglotDB’s algorithms.
Note
The time resolution for PolyglotDB is at the millisecond level. In general, I think having measurements every 10ms is a balanced time resolution for acoustic measures. Increasing the time resolution will also increase the processing time for PolyglotDB algorithms, as well as the database size. Time resolution is generally a property of the analyses done, so greater time resolution than 10 ms is possible, but not greater than 1 ms, as millisecond time resolution is hardcoded in the current code. Any time point will be rounded/truncated to the nearest millisecond.
In addition to these tags, there are several queryable fields which are always present in addition to the measurement fields.
First, the phone
for the time point is saved to allow for efficient aggregation across phones. Second, the utterance_id
for the time point is also saved. The utterance_id
is used for general querying, where each utterance’s track for the
requested acoustic property is queried once and then cached for any further results to use without needing to query the
InfluxDB again. For instance, a query on phone formant tracks might return 2000 phones. Without the utterance_id
, there
would be 2000 look ups for formant tracks (each InfluxDB query would take about 0.15 seconds), but using the utterance-based caching,
the number of hits to the InfluxDB database would be a fraction (though the queries themselves would take a little bit longer).
Note
For performance reasons internal to InfluxDB, phone
and utterance_id
are fields
rather than tags
, because
the cross of them with speaker
, discourse
, and channel
would lead to an extremely large cross of possible tag
combinations. This mix of tags and fields has been found to be the most performant.
Finally, there are the actual measurements that are saved. Each acoustic track (i.e., pitch
, formants
, intensity
)
can have multiple measurements. For instance, a formants
track can have F1
, F2
, F3
, B1
, B2
, and B3
,
which are all stored together on the same time point and accessed at the same time. These measures are kept in the corpus
hierarchy in Neo4j. Each measurement track (i.e. pitch
) will be a node linked to the corpus (see the example in Corpus hierarchy representation).
That node will have each property listed along with its data type (i.e., F0
is a float
).
Optimizations for acoustic measures
PolyglotDB has default functions for generating pitch
, intensity
, and formants
tracks (see Reference functions for specific examples
and Saving acoustics for more details on how they are implemented). For implementing
future built in acoustic track analysis functions, one realm of optimization lays in the differently sampled files that
PolyglotDB generates. On import, three files are generated per discourse at 1,200Hz, 11,000Hz, and 16,000Hz. The intended
purpose of these files are for acoustic analysis of different kinds of segments/measurements. The file at 1,200Hz is ideal
for pitch analysis (maximum pitch of 600Hz), the file at 11,000Hz is ideal for formant analysis (maximum formant frequency
of 5,500Hz). The file at 16,000Hz is intended for consonantal analysis (i.e., fricative spectral analysis) or any other
algorithm requiring higher frequency information. The reason these three files are generated is that analysis functions
generally include the resampling to these frequencies as part of the analysis, so performing it ahead of time can speed up
the analysis. Some programs also don’t necessarily include resampling (i.e., pitch estimation in REAPER), so using the
appropriate file can lead to massive speed ups.
Query implementation
Given a PolyglotDB query like the following:
with CorpusContext('corpus') as c:
q = c.query_graph(c.word)
q = q.filter(c.word.label == 'some_word')
q = q.columns(c.word.label.column_name('word'), c.word.pitch.track)
results = q.all()
Once the Cypher query completes and returns results for a matching word, that information is used to create an InfluxDB query. The inclusion of an acoustic column like the pitch track also ensures that necessary information like the utterance ID and begin and end time points of the word are returned. The above query would result in several queries like the following being run:
SELECT "time", "F0" from "pitch"
WHERE "discourse" = 'some_discourse'
AND "utterance_id" = 'some_utterance_id'
AND "speaker" = 'some_speaker'
The above query will get all pitch points for the utterance of the word in question, and create Python objects for the
track (polyglotdb.acoustics.classes.Track
) and each time point (polyglotdb.acoustics.classes.TimePoint
).
With the begin
and end
properties of the word, a slice of the track is added to the output row.
Aggregation
Unlike for aggregation of properties in the Neo4j database (see Aggregation functions), aggregation of acoustic properties occurs in Python rather than being implemented in a query to InfluxDB, for the same performance reasons above. By caching utterance tracks as needed, and then performing aggregation over necessary slices (i.e., words or phones), the overall query is much faster.
Low level implementation
Saving acoustics
The general pipeline for generating and saving acoustic measures is as follows:
Acoustic analysis using Conch’s analysis functions
Format output from Conch into InfluxDB format and fill in any needed information (phone labels)
Write points to InfluxDB
Update the Corpus hierarchy with information about acoustic properties
Acoustic analysis is first performed in Conch, a Python package for processing sound files into acoustic and auditory
representations. To do so, segments are created in PolyglotDB through calls to polyglotdb.acoustics.segments.generate_segments()
and related functions. The generated SegmentMapping
object from Conch is an iterable of Segment
objects. Each Segment
minimally
has a path to a sound file, the begin time stamp, the end time stamp, and the channel. With these four pieces of information,
the waveform signal can be extracted and acoustic analysis can be performed. Segment
objects can also have other
properties associated with them, so that the SegmentMapping
can be grouped into sensible bits of analysis (SegmentMapping.grouped_mapping()
.
This is done in PolyglotDB to split analysis by speakers, for instance.
SegmentMapping
and those returned by the grouped_mapping
can then be passed to analyze_segments
, which in addition
to a SegmentMapping
take a callable function that takes the minimal set of arguments above (file path, begin, end, and channel)
and return some sort of track or point measure from the signal segment. Below for a list of generator functions that return
a callable to be used with analyze_segments
. The analyze_segments
function uses multiprocessing to apply the callable
function to each segment, allowing for speed ups for the number of available cores on the machine.
Once the Conch analysis function completes, the tracks are saved via polyglotdb.corpus.AudioContext.save_acoustic_tracks()
.
In addition to the discourse
, speaker
, channel
, and utterance_id
, phone
label information is also added to each time
point’s measurements. These points are then saved using the write_points
function of the InfluxDBClient
, returned
from the acoustic_client()
function.
Reference functions
Hard-coded functions for saving acoustics are:
Additionally, point measure acoustics analysis functions that don’t involve InfluxDB (point measures are saved as Neo4j properties):
Generator functions for Conch analysis:
Querying acoustics
In general, the pipeline for querying is as follows:
Construct InfluxDB query string from function arguments
Pass this query string to an
InfluxDBClient
Iterate over results and construct a
polyglotdb.acoustics.classes.Track
object
All audio functions, and hence all interface with InfluxDB, is handled through the polyglotdb.corpus.AudioContext
parent class for the CorpusContext. Any constructed InfluxDB queries will get executed through an InfluxDBClient
, constructed
in the polyglotdb.corpus.AudioContext.acoustic_client()
function, which uses the InfluxDB connection parameters
from the CorpusContext. As an example, see
polyglotdb.corpus.AudioContext.get_utterance_acoustics
. First, a InfluxDB client is constructed, then a query
string is formatted from the relevant arguments passed to get_utterance_acoustics
, and the relevant property names for the acoustic
measure (i.e., F1
, F2
and F3
for formants
, see InfluxDB Schema for more details). This query string is then run via the
query
method of the InfluxDBClient. The results are iterated over and a polyglotdb.acoustics.classes.Track
object
is constructed from the results and then returned.
Reference functions
polyglotdb.corpus.AudioContext.get_utterance_acoustics()
polyglotdb.corpus.AudioContext.get_acoustic_measure()