Querying annotations¶
The main way of finding specific annotations is through the query_graph
method of
CorpusContext
objects.
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label == 'are')
results = q.all()
print(results)
The above code will find and print all instances of word
annotations that are
labeled with ‘are’. The method query_graph
takes one argument, which is
an attribute of the context manager corresponding to the name of the
annotation type.
The primary function for queries is filter
. This function takes one or more
conditional expressions on attributes of annotations. In the above example,
word
annotations have an attribute label
which corresponds to the
orthography.
Conditional expressions can take on any normal Python conditional (==
,
!=
, <
, <=
, >
, >=
). The Python
operator in
does not work; a special pattern has to be used:
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
results = q.all()
print(results)
The in_
conditional function can take any iterable, including another query:
with CorpusContext(config) as c:
sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
results = q.all()
print(results)
In this case, it will find all phone
annotations that are in the words
listed. Using the id
attribute will use unique identifiers for the filter.
In this particular instance, it does not matter, but it does in the following:
with CorpusContext(config) as c:
sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
sub_q = sub_q.filter_right_aligned(c.word.line)
q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q))
results = q.all()
print(results)
The above query will find all instances of the three words, but only where
they are right-aligned with a line
annotation.
Note
Queries are lazy evaluated. In the above example, sub_q
is
not evaluated until q.all()
is called. This means that filters
can be chained across multiple lines without a performance hit.
Following and previous annotations¶
Filters can reference the surrounding local context. For instance:
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
results = q.all()
print(results)
The above query will find all the ‘aa’ phones that are followed by an ‘r’
phone. Similarly, c.phone.previous
would provide access to filtering on
preceding phones.
Subsetting annotations¶
In linguistics, it’s often useful to specify subsets of symbols as particular classes. For instance, phonemes are grouped together by whether they are syllabic, their manner/place of articulation, and vowel height/backness/rounding, and words are grouped by their parts of speech.
Suppose a subset has been created as in Subset enrichment, so that the phones ‘aa’ and ‘ih’ have been marked as syllabic. Once this category is encoded in the database, it can be used in filters.
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset=='syllabic')
results = q.all()
print(results)
Note
The results returned by the above query will be identical to the similar query:
with CorpusContext('corpus') as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.label.in_(['aa', 'ih']))
results = q.all()
print(results)
The primary benefits of using subsets are performance based due to the inner workings of Neo4j. See Neo4j implementation for more details.
Another way to specify subsets is on the phone annotations themselves, as follows:
with CorpusContext(config) as c:
q = c.query_graph(c.phone.filter_by_subset('syllabic'))
results = q.all()
print(results)
Both of these queries are identical and will return all instances of ‘aa’ and ‘ih’ phones. The benefit of filter_by_subset is generally for use in Hierarchical queries.
Note
Using repeated subsets repeatedly in queries can make them overly verbose. The objects that the queries use are normal Python objects and can therefore be assigned to variables for easier use.
with CorpusContext(config) as c:
syl = c.phone.filter_by_subset('syllabic')
q = c.query_graph(syl)
q = q.filter(syl.end == syl.word.end)
results = q.all()
print(results)
The above query would find all phones marked by '+syllabic' that are
at the ends of words.
Hierarchical queries¶
A key facet of language is that it is hierarchical. Words contain phones,
and can be contained in larger utterances. There are several ways to
query hierarchical information. If we want to find all aa
phones in the
word dogs
, then we can perform the following query:
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.word.label == 'dogs')
results = q.all()
print(results)
Starting from the word level, we might want to know what phones each word contains.
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.label.column('phones'))
results = q.all()
print(results)
In the output of the above query, there would be a column labeled phones
that contains a list of the labels of phones that belong to the word
(['d', 'aa', 'g', 'z']
). Any property of phones can be queried this
way (i.e., begin
, end
, duration
, etc).
Going down the hierarchy, we can also find all words that contain a certain phone.
with CorpusContext(config) as c:
q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am']))
q = q.filter(c.word.phone.label == 'aa')
results = q.all()
print(results)
In this example, it will find all instances of the three words that contain
an aa
phone.
Special keywords exist for these containment columns. The keyword rate
will return the elements per second for the word (i.e., phones per second).
The keyword count
will return the number of elements.
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.rate.column('phones_per_second'))
q = q.columns(c.word.phone.count.column('num_phones'))
results = q.all()
print(results)
These keywords can also leverage subsets, as above:
with CorpusContext(config) as c:
q = c.query_graph(c.word)
q = q.columns(c.word.phone.rate.column('phones_per_second'))
q = q.columns(c.word.phone.filter_by_subset('+syllabic').count.column('num_syllabic_phones'))
q = q.columns(c.word.phone.count.column('num_phones'))
results = q.all()
print(results)
Additionally, there is a special keyword can be used to query the position
of a contained element in a containing one.
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.word.label == 'dogs')
q = q.columns(c.word.phone.position.column_name('position_in_word'))
results = q.all()
print(results)
The above query should return 2
for the value of position_in_word
,
as the aa
phone would be the second phone.
Subannotation queries¶
Annotations can have subannotations associated with them. Subannotations are not independent linguistic types, but have more information associated with them than just a single property. For instance, voice onset time (VOT) would be a subannotation of stops (as it has a begin time and an end time that are of interest). For mor information on subannotations, see Subannotation enrichment. Querying such subannotations would be performed as follows:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.vot.duration.column_name('vot'))
results = q.all()
print(results)
In some cases, it may be desirable to have more than one subannotation of the same type associated with a single annotation. For instance, voicing during the closure of a stop can take place at both the beginning and end of closure, with an unvoiced period in the middle. Using a similar query as above would get the durations of each of these (in the order of their begin time):
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.columns(c.phone.voicing_during_closure.duration.column_name('voicing'))
results = q.all()
print(results)
In some cases, we might like to know the total duration of such subannotations,
rather than the individual durations. To query that information, we can
use an aggregate
:
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
results = q.aggregate(Sum(c.phone.voicing_during_closure.duration).column_name('total_voicing'))
print(results)
Miscellaneous¶
Aggregates and groups¶
Aggregate functions are available in polyglotdb.query.base.func
. Aggregate
functions available are:
Average
Count
Max
Min
Stdev
Sum
In general, these functions take a numeric attribute as an argument. The
only one that does not follow this pattern is Count
.
from polyglotdb.query.base.func import Count
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
result = q.aggregate(Count())
print(result)
Like the all
function, aggregate
triggers evaluation of the query.
Instead of returning rows, it will return a single number, which is the
number of rows matching this query.
from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label == 'r')
result = q.aggregate(Average(c.phone.duration))
print(result)
The above aggregate function will return the average duration for all ‘aa’ phones followed by ‘r’ phones.
Aggregates are particularly useful with grouping. For instance:
from polyglotdb.query.base.func import Average
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label.in_(['r','l']))
q = q.group_by(c.phone.following.label.column_name('following_label'))
result = q.aggregate(Average(c.phone.duration), Count())
print(result)
The above query will return the average duration and the count of ‘aa’ phones grouped by whether they’re followed by an ‘r’ or an ‘l’.
Note
In the above example, the group_by
attribute is supplied with
an alias for output. In the print statment and in the results, the column
will be called ‘following_label’ instead of the default (more opaque) one.
Ordering¶
The order_by
function is used to provide an ordering to the results of
a query.
with CorpusContext(config) as c:
q = c.query_graph(c.phone).filter(c.phone.label == 'aa')
q = q.filter(c.phone.following.label.in_(['r','l']))
q = q.filter(c.phone.discourse == 'a_discourse')
q = q.order_by(c.phone.begin)
results = q.all()
print(results)
The results for the above query will be ordered by the timepoint of the annotation. Ordering by time is most useful for when looking at single discourses (as including multiple discourses in a query would invalidate the ordering).
Note
In grouped aggregate queries, ordering is by default by the
first group_by
attribute. This can be changed by calling order_by
before evaluating with aggregate
.