Querying annotations

The main way of finding specific annotations is through the query_graph method of CorpusContext objects.

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label == 'are')
    results = q.all()
    print(results)

The above code will find and print all instances of word annotations that are labeled with ‘are’. The method query_graph takes one argument, which is an attribute of the context manager corresponding to the name of the annotation type.

The primary function for queries is filter. This function takes one or more conditional expressions on attributes of annotations. In the above example, word annotations have an attribute label which corresponds to the orthography.

Conditional expressions can take on any normal Python conditional (==, !=, <, <=, >, >=). The Python operator in does not work; a special pattern has to be used:

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    results = q.all()
    print(results)

The in_ conditional function can take any iterable, including another query:

with CorpusContext(config) as c:
    sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
    results = q.all()
    print(results)

In this case, it will find all phone annotations that are in the words listed. Using the id attribute will use unique identifiers for the filter. In this particular instance, it does not matter, but it does in the following:

with CorpusContext(config) as c:
    sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    sub_q = sub_q.filter_right_aligned(c.word.line)
    q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
    results = q.all()
    print(results)

The above query will find all instances of the three words, but only where they are right-aligned with a line annotation.

Note

Queries are lazy evaluated. In the above example, sub_q is not evaluated until q.all() is called. This means that filters can be chained across multiple lines without a performance hit.

Following and previous annotations

Filters can reference the surrounding local context. For instance:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    results = q.all()
    print(results)

The above query will find all the ‘aa’ phones that are followed by an ‘r’ phone. Similarly, c.phone.previous would provide access to filtering on preceding phones.

Subsetting annotations

In linguistics, it’s often useful to specify subsets of symbols as particular classes. For instance, phonemes are grouped together by whether they are syllabic, their manner/place of articulation, and vowel height/backness/rounding, and words are grouped by their parts of speech.

Suppose a subset has been created as in Subset enrichment, so that the phones ‘aa’ and ‘ih’ have been marked as syllabic. Once this category is encoded in the database, it can be used in filters.

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.subset=='syllabic')
    results = q.all()
    print(results)

Note

The results returned by the above query will be identical to the similar query:

with CorpusContext('corpus') as c:
    q = c.query_graph(c.phone)
    q = q.filter(c.phone.label.in_(['aa', 'ih']))
    results = q.all()
    print(results)

The primary benefits of using subsets are performance based due to the inner workings of Neo4j. See Neo4j implementation for more details.

Another way to specify subsets is on the phone annotations themselves, as follows:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone.filter_by_subset('syllabic'))
    results = q.all()
    print(results)

Both of these queries are identical and will return all instances of ‘aa’ and ‘ih’ phones. The benefit of filter_by_subset is generally for use in Hierarchical queries.

Note

Using repeated subsets repeatedly in queries can make them overly verbose. The objects that the queries use are normal Python objects and can therefore be assigned to variables for easier use.

  with CorpusContext(config) as c:
      syl = c.phone.filter_by_subset('syllabic')
      q = c.query_graph(syl)
      q = q.filter(syl.end == syl.word.end)
      results = q.all()
      print(results)

The above query would find all phones marked by '+syllabic' that are
at the ends of words.

Hierarchical queries

A key facet of language is that it is hierarchical. Words contain phones, and can be contained in larger utterances. There are several ways to query hierarchical information. If we want to find all aa phones in the word dogs, then we can perform the following query:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.word.label == 'dogs')
    results = q.all()
    print(results)

Starting from the word level, we might want to know what phones each word contains.

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.label.column('phones'))
    results = q.all()
    print(results)

In the output of the above query, there would be a column labeled phones that contains a list of the labels of phones that belong to the word (['d', 'aa', 'g', 'z']). Any property of phones can be queried this way (i.e., begin, end, duration, etc).

Going down the hierarchy, we can also find all words that contain a certain phone.

with CorpusContext(config) as c:
    q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
    q = q.filter(c.word.phone.label == 'aa')
    results = q.all()
    print(results)

In this example, it will find all instances of the three words that contain an aa phone.

Special keywords exist for these containment columns. The keyword rate will return the elements per second for the word (i.e., phones per second). The keyword count will return the number of elements.

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.rate.column('phones_per_second'))
    q = q.columns(c.word.phone.count.column('num_phones'))
    results = q.all()
    print(results)

These keywords can also leverage subsets, as above:

with CorpusContext(config) as c:
    q = c.query_graph(c.word)
    q = q.columns(c.word.phone.rate.column('phones_per_second'))
    q = q.columns(c.word.phone.filter_by_subset('+syllabic').count.column('num_syllabic_phones'))
    q = q.columns(c.word.phone.count.column('num_phones'))
    results = q.all()
    print(results)

Additionally, there is a special keyword can be used to query the position of a contained element in a containing one.

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.word.label == 'dogs')
    q = q.columns(c.word.phone.position.column_name('position_in_word'))
    results = q.all()
    print(results)

The above query should return 2 for the value of position_in_word, as the aa phone would be the second phone.

Subannotation queries

Annotations can have subannotations associated with them. Subannotations are not independent linguistic types, but have more information associated with them than just a single property. For instance, voice onset time (VOT) would be a subannotation of stops (as it has a begin time and an end time that are of interest). For mor information on subannotations, see Subannotation enrichment. Querying such subannotations would be performed as follows:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.columns(c.phone.vot.duration.column_name('vot'))
    results = q.all()
    print(results)

In some cases, it may be desirable to have more than one subannotation of the same type associated with a single annotation. For instance, voicing during the closure of a stop can take place at both the beginning and end of closure, with an unvoiced period in the middle. Using a similar query as above would get the durations of each of these (in the order of their begin time):

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    q = q.columns(c.phone.voicing_during_closure.duration.column_name('voicing'))
    results = q.all()
    print(results)

In some cases, we might like to know the total duration of such subannotations, rather than the individual durations. To query that information, we can use an aggregate:

with CorpusContext(config) as c:
    q = c.query_graph(c.phone)
    results = q.aggregate(Sum(c.phone.voicing_during_closure.duration).column_name('total_voicing'))
    print(results)

Miscellaneous

Aggregates and groups

Aggregate functions are available in polyglotdb.query.base.func. Aggregate functions available are:

  • Average
  • Count
  • Max
  • Min
  • Stdev
  • Sum

In general, these functions take a numeric attribute as an argument. The only one that does not follow this pattern is Count.

from polyglotdb.query.base.func import Count
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    result = q.aggregate(Count())
    print(result)

Like the all function, aggregate triggers evaluation of the query. Instead of returning rows, it will return a single number, which is the number of rows matching this query.

from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label == 'r')
    result = q.aggregate(Average(c.phone.duration))
    print(result)

The above aggregate function will return the average duration for all ‘aa’ phones followed by ‘r’ phones.

Aggregates are particularly useful with grouping. For instance:

from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label.in_(['r','l']))
    q = q.group_by(c.phone.following.label.column_name('following_label'))
    result = q.aggregate(Average(c.phone.duration), Count())
    print(result)

The above query will return the average duration and the count of ‘aa’ phones grouped by whether they’re followed by an ‘r’ or an ‘l’.

Note

In the above example, the group_by attribute is supplied with an alias for output. In the print statment and in the results, the column will be called ‘following_label’ instead of the default (more opaque) one.

Ordering

The order_by function is used to provide an ordering to the results of a query.

with CorpusContext(config) as c:
    q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
    q = q.filter(c.phone.following.label.in_(['r','l']))
    q = q.filter(c.phone.discourse == 'a_discourse')
    q = q.order_by(c.phone.begin)
    results = q.all()
    print(results)

The results for the above query will be ordered by the timepoint of the annotation. Ordering by time is most useful for when looking at single discourses (as including multiple discourses in a query would invalidate the ordering).

Note

In grouped aggregate queries, ordering is by default by the first group_by attribute. This can be changed by calling order_by before evaluating with aggregate.