.. _annotation_queries: ******************** Querying annotations ******************** The main way of finding specific annotations is through the :code:`query_graph` method of :code:`CorpusContext` objects. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word).filter(c.word.label == 'are') results = q.all() print(results) The above code will find and print all instances of :code:`word` annotations that are labeled with 'are'. The method :code:`query_graph` takes one argument, which is an attribute of the context manager corresponding to the name of the annotation type. The primary function for queries is :code:`filter`. This function takes one or more conditional expressions on attributes of annotations. In the above example, :code:`word` annotations have an attribute :code:`label` which corresponds to the orthography. Conditional expressions can take on any normal Python conditional (:code:`==`, :code:`!=`, :code:`<`, :code:`<=`, :code:`>`, :code:`>=`). The Python operator :code:`in` does not work; a special pattern has to be used: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am'])) results = q.all() print(results) The :code:`in_` conditional function can take any iterable, including another query: .. code-block:: python with CorpusContext(config) as c: sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am'])) q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q)) results = q.all() print(results) In this case, it will find all :code:`phone` annotations that are in the words listed. Using the :code:`id` attribute will use unique identifiers for the filter. In this particular instance, it does not matter, but it does in the following: .. code-block:: python with CorpusContext(config) as c: sub_q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am'])) sub_q = sub_q.filter_right_aligned(c.word.line) q = c.query_graph(c.phone).filter(c.phone.word.id.in_(sub_q)) results = q.all() print(results) The above query will find all instances of the three words, but only where they are right-aligned with a :code:`line` annotation. .. note:: Queries are lazy evaluated. In the above example, :code:`sub_q` is not evaluated until :code:`q.all()` is called. This means that filters can be chained across multiple lines without a performance hit. .. _following_previous: Following and previous annotations ---------------------------------- Filters can reference the surrounding local context. For instance: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.following.label == 'r') results = q.all() print(results) The above query will find all the 'aa' phones that are followed by an 'r' phone. Similarly, :code:`c.phone.previous` would provide access to filtering on preceding phones. .. _query_annotation_subset: Subsetting annotations ---------------------- In linguistics, it's often useful to specify subsets of symbols as particular classes. For instance, phonemes are grouped together by whether they are syllabic, their manner/place of articulation, and vowel height/backness/rounding, and words are grouped by their parts of speech. Suppose a subset has been created as in :ref:`enrichment_subsets`, so that the phones 'aa' and 'ih' have been marked as `syllabic`. Once this category is encoded in the database, it can be used in filters. .. code-block:: python with CorpusContext('corpus') as c: q = c.query_graph(c.phone) q = q.filter(c.phone.subset=='syllabic') results = q.all() print(results) .. note:: The results returned by the above query will be identical to the similar query: .. code-block:: python with CorpusContext('corpus') as c: q = c.query_graph(c.phone) q = q.filter(c.phone.label.in_(['aa', 'ih'])) results = q.all() print(results) The primary benefits of using subsets are performance based due to the inner workings of Neo4j. See :ref:`neo4j_implementation` for more details. Another way to specify subsets is on the phone annotations themselves, as follows: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone.filter_by_subset('syllabic')) results = q.all() print(results) Both of these queries are identical and will return all instances of 'aa' and 'ih' phones. The benefit of `filter_by_subset` is generally for use in :ref:`hierarchical_queries`. .. note:: Using repeated subsets repeatedly in queries can make them overly verbose. The objects that the queries use are normal Python objects and can therefore be assigned to variables for easier use. .. code-block:: python with CorpusContext(config) as c: syl = c.phone.filter_by_subset('syllabic') q = c.query_graph(syl) q = q.filter(syl.end == syl.word.end) results = q.all() print(results) The above query would find all phones marked by '+syllabic' that are at the ends of words. .. _hierarchical_queries: Hierarchical queries -------------------- A key facet of language is that it is hierarchical. Words contain phones, and can be contained in larger utterances. There are several ways to query hierarchical information. If we want to find all ``aa`` phones in the word ``dogs``, then we can perform the following query: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.word.label == 'dogs') results = q.all() print(results) Starting from the word level, we might want to know what phones each word contains. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word) q = q.columns(c.word.phone.label.column_name('phones')) results = q.all() print(results) In the output of the above query, there would be a column labeled ``phones`` that contains a list of the labels of phones that belong to the word (``['d', 'aa', 'g', 'z']``). Any property of phones can be queried this way (i.e., ``begin``, ``end``, ``duration``, etc). Going down the hierarchy, we can also find all words that contain a certain phone. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word).filter(c.word.label.in_(['are', 'is','am'])) q = q.filter(c.word.phone.label == 'aa') results = q.all() print(results) In this example, it will find all instances of the three words that contain an ``aa`` phone. Special keywords exist for these containment columns. The keyword ``rate`` will return the elements per second for the word (i.e., phones per second). The keyword ``count`` will return the number of elements. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word) q = q.columns(c.word.phone.rate.column_name('phones_per_second')) q = q.columns(c.word.phone.count.column_name('num_phones')) results = q.all() print(results) These keywords can also leverage subsets, as above: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.word) q = q.columns(c.word.phone.rate.column_name('phones_per_second')) q = q.columns(c.word.phone.filter_by_subset('+syllabic').count.column_name('num_syllabic_phones')) q = q.columns(c.word.phone.count.column_name('num_phones')) results = q.all() print(results) Additionally, there is a special keyword can be used to query the ``position`` of a contained element in a containing one. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.word.label == 'dogs') q = q.columns(c.word.phone.position.column_name('position_in_word')) results = q.all() print(results) The above query should return ``2`` for the value of ``position_in_word``, as the ``aa`` phone would be the second phone. .. _queries_subannotations: Subannotation queries --------------------- Annotations can have subannotations associated with them. Subannotations are not independent linguistic types, but have more information associated with them than just a single property. For instance, voice onset time (VOT) would be a subannotation of stops (as it has a begin time and an end time that are of interest). For mor information on subannotations, see :ref:`enrichment_subannotations`. Querying such subannotations would be performed as follows: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone) q = q.columns(c.phone.vot.duration.column_name('vot')) results = q.all() print(results) In some cases, it may be desirable to have more than one subannotation of the same type associated with a single annotation. For instance, voicing during the closure of a stop can take place at both the beginning and end of closure, with an unvoiced period in the middle. Using a similar query as above would get the durations of each of these (in the order of their begin time): .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone) q = q.columns(c.phone.voicing_during_closure.duration.column_name('voicing')) results = q.all() print(results) In some cases, we might like to know the total duration of such subannotations, rather than the individual durations. To query that information, we can use an ``aggregate``: .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone) results = q.aggregate(Sum(c.phone.voicing_during_closure.duration).column_name('total_voicing')) print(results) Miscellaneous ============= .. _aggregates_and_groups: Aggregates and groups --------------------- Aggregate functions are available in :code:`polyglotdb.query.base.func`. Aggregate functions available are: * Average * Count * Max * Min * Stdev * Sum In general, these functions take a numeric attribute as an argument. The only one that does not follow this pattern is :code:`Count`. .. code-block:: python from polyglotdb.query.base.func import Count with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.following.label == 'r') result = q.aggregate(Count()) print(result) Like the :code:`all` function, :code:`aggregate` triggers evaluation of the query. Instead of returning rows, it will return a single number, which is the number of rows matching this query. .. code-block:: python from polyglotdb.query.base.func import Average with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.following.label == 'r') result = q.aggregate(Average(c.phone.duration)) print(result) The above aggregate function will return the average duration for all 'aa' phones followed by 'r' phones. Aggregates are particularly useful with grouping. For instance: .. code-block:: python from polyglotdb.query.base.func import Average with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.following.label.in_(['r','l'])) q = q.group_by(c.phone.following.label.column_name('following_label')) result = q.aggregate(Average(c.phone.duration), Count()) print(result) The above query will return the average duration and the count of 'aa' phones grouped by whether they're followed by an 'r' or an 'l'. .. note:: In the above example, the :code:`group_by` attribute is supplied with an alias for output. In the print statment and in the results, the column will be called 'following_label' instead of the default (more opaque) one. .. _ordering: Ordering -------- The :code:`order_by` function is used to provide an ordering to the results of a query. .. code-block:: python with CorpusContext(config) as c: q = c.query_graph(c.phone).filter(c.phone.label == 'aa') q = q.filter(c.phone.following.label.in_(['r','l'])) q = q.filter(c.phone.discourse == 'a_discourse') q = q.order_by(c.phone.begin) results = q.all() print(results) The results for the above query will be ordered by the timepoint of the annotation. Ordering by time is most useful for when looking at single discourses (as including multiple discourses in a query would invalidate the ordering). .. note:: In grouped aggregate queries, ordering is by default by the first :code:`group_by` attribute. This can be changed by calling :code:`order_by` before evaluating with :code:`aggregate`.