.. _Montreal Forced Aligner: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner

.. _FAVE-align: https://github.com/JoFrhwld/FAVE

.. _Web-MAUS: https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSGeneral

.. _LaBB-CAT: http://labbcat.sourceforge.net/

.. _TIMIT: https://catalog.ldc.upenn.edu/LDC93S1

.. _Buckeye: https://buckeyecorpus.osu.edu/

.. _BAS Partitur: http://www.bas.uni-muenchen.de/forschung/publikationen/Granada-98-Partitur.pdf

.. _pgdb_io:

**************
PolyglotDB I/O
**************

In addition to documenting the IO module of PolyglotDB, this document
should serve as a guide for implementing future importers for additional
formats.

Import pipeline
===============

Importing a corpus consists of several steps.  First, a file must be
inspected with the relevant inspect function (i.e., ``inspect_textgrid`` or
``inspect_buckeye``).  These functions generate Parsers for a given format
that allow annotations across many tiers to be coalesced into linguistic
types (word, segments, etc).

As an example, suppose a TextGrid has an interval tier for word labels,
an interval tier for phone labels, tiers for annotating stop information
(closure duration, bursts, VOT, etc).  In this case, our parser would want
to associate the stop information annotations with the phones (or rather a
subset of the phones), and not have them as a separate linguistic type.

Following inspection, the file can be imported easily using a CorpusContext's
``load`` function.  Under the hood, what happens is the Parser object creates
standardized linguistic annotations from the annotations in the text file,
which are then imported into the database.

Currently the following formats are supported:

- Praat TextGrids (:ref:`inspect_textgrids`)
- TextGrid output from forced aligners (`Montreal Forced Aligner`_, `FAVE-align`_, and `Web-MAUS`_)
- Output from other corpus management software (`LaBB-CAT`_)
- `BAS Partitur`_ format
- Corpus-specific formats

  - `Buckeye`_
  - `TIMIT`_

Inspect
-------

Inspect functions (i.e., :code:`inspect_textgrid`) return a guess for
how to parse the annotations present in a given file (or files in a given
directory).  They return a parser of the respective type (i.e., :code:`TextgridParser`)
with an attribute for the :code:`annotation_tiers` detected.  For instance, the inspect function for TextGrids
will return a parser with annotation types for each interval and point tier in the TextGrid.


.. _inspect_textgrids:

Inspect TextGrids
`````````````````

.. note::

   See :ref:`io_tg_parser_api` for full API of the TextGrid Parser


Consider the following TextGrid with interval tiers for words and phones:

.. figure:: _static/img/io_example.png
    :align: center
    :alt: Image cannot be displayed in your browser

Running the :code:`inspect_textgrid` function for this file will return two annotation types. From bottom to top, it will
generate a :code:`phone` annotation type and a :code:`word` annotation type.  Words and phones are two special linguistic
types in PolyglotDB.  Other linguistic types can be defined in a TextGrid (i.e., grouping words into utterances or phones into syllables,
though functionality exists for computing both of those automatically), but word and phone tiers must be defined.

.. note::

   Which tier corresponds to which special :code:`word` and :code:`phone` type is done via heuristics.  The first and most
   reliable is whether the tier name contains "word" or "phone" in their tier name.  The second is done by using cutoffs
   for mean and SD of word and phone durations in the Buckeye corpus to determine if the intervals are more likely to be
   word or phones.  For reference, the mean and SD of words used is 0.2465409 and 0.03175723, and those used for phones
   is 0.08327773 and 0.03175723.

From the above TextGrid, phones will have a :code:`label` property (i.e., "Y"), a :code:`begin` property (i.e., 0.15),
and a :code:`end` property (i.e., 0.25).
Words will have a :code:`label` property (i.e., "you"), a :code:`begin` property (i.e., 0.15),
and a :code:`end` property (i.e., 0.25), as well as a computed :code:`transcription`
property
made of up of all of the included phones based on timings (i.e., "Y.UW1").  Any empty intervals will result in "words"
that have the :code:`label` of "<SIL>", which can then be marked as pause later in corpus processing
(see :ref:`encoding_pauses` for more details).

.. note::

   The computation of transcription uses the midpoints of phones and whether they are between the begin and end time
   points of words.


Inspect forced aligned TextGrids
````````````````````````````````

Both the Montreal Forced Aligner and FAVE-aligner generate TextGrids for files in two formats that PolyglotDB can parse.  The first format
is for files with a single speaker.  These files will have two tiers, one for words (named :code:`words` or :code:`word`)
and one for phones (named :code:`phones` or :code:`phone`).
The second format is for files with multiple speakers, where each speaker will have a pair of tiers for words (formatted as :code:`Speaker name - words`)
and phones (formatted as :code:`Speaker name - phones`).

TextGrids generated from `Web-MAUS`_ have a single format with a tier for words (named :code:`ORT`), a tier for the canonical
transcription (named :code:`KAN`) and a tier for phones (named :code:`MAU`).  In parsing, just the tiers for words and
phones are used, as the transcription will be generated automatically.

.. note::

   See :ref:`io_mfa_parser_api` for full API of the MFA Parser, :ref:`io_fave_parser_api` for full API of the FAVE Parser,
   and :ref:`io_maus_parser_api` for the full API of the MAUS Parser.


Inspect LaBB-CAT formatted TextGrids
````````````````````````````````````

The LaBB-CAT system generates force-aligned TextGrids for files in a format that PolyglotDB can parse (though some editing may be
required due to issues in exporting single speakers in LaBB-CAT).  As with the other supported aligner output formats,
PolyglotDB looks for word and phone tiers per speaker (or for just a single speaker depending on export options).  The
parser will use :code:`transcript` to find the word tiers (i.e. :code:`Speaker name - transcript`) and :code:`segment` to find
the phone tiers (i.e., :code:`Speaker name - phones`).

.. note::

   See :ref:`io_labbcat_parser_api` for full API of the LaBB-CAT Parser

Inspect Buckeye Corpus
``````````````````````

The `Buckeye`_ Corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., :code:`s0101a.wav`), one detailing information
about words (i.e., :code:`s0101a.words`) and one detailing information about surface phones (i.e. :code:`s0101a.phones`).  The PolyglotDB
parser extracts label, begin and end for each phone.  Words have type properties for their underlying transcription and
token properties for their part of speech and begin/end.

.. note::

   See :ref:`io_buckeye_parser_api` for full API of the Buckeye Parser

Inspect TIMIT Corpus
````````````````````

The `TIMIT`_ corpus is stored in an idiosyncratic format that has two text files per sound file (i.e., :code:`sa1.wav`), one detailing information
about words (i.e., :code:`sa1.WRD`) and one detailing information about surface phones (i.e. :code:`sa1.PHN`).  The PolyglotDB
parser extracts label, begin and end for each phone and each word.  Time stamps are converted from samples in the original text files
to seconds for use in PolyglotDB.

.. note::

   See :ref:`io_timit_parser_api` for full API of the Buckeye Parser

.. _modifying_parsers:

Modifying aspects of parsing
----------------------------

Additional properties for linguistic units can be imported as well through the use of extra interval tiers when using a
TextGrid parser (see :ref:`inspect_textgrids`), as in the following TextGrid:

.. figure:: _static/img/io_example_extra_word_props.png
    :align: center
    :alt: Image cannot be displayed in your browser

Here we have properties for each word's part of speech (POS tier) and transcription.  The transcription tier will overwrite
the automatic calculation of transcription based on contained segments.  Each of these will be properties will be type properties
by default (see :ref:`neo4j_implementation` for more details).  If these properties are meant to be token level properties (i.e.,
the part of speech of a word varies depending on the token produced), it can changed as follows:

.. code-block:: python

    from polyglotdb import CorpusContext
    import polyglotdb.io as pgio

    parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
    parser.annotation_tiers[2].type_property = False # The index of the TextGrid tier for POS is 2

    # ... code that uses the parser to import data

If the content of a tier should be ignored (i.e., if it contains information not related to any annotations in particular),
then it can be manually marked to be ignored as follows:

.. code-block:: python

    from polyglotdb import CorpusContext
    import polyglotdb.io as pgio

    parser = pgio.inspect_textgrid('/path/to/textgrid/file/or/directory')
    parser.annotation_tiers[0].ignored = True # Index of 0 if the first tier should be ignored

    # ... code that uses the parser to import data

Parsers created through other inspect functions (i.e. Buckeye) can be modified in similar ways, though the TextGrid parser
is necessarily the most flexible.

Speaker parsers
```````````````

There are two currently implemented schemes for parsing speaker names from a file path.  The first is the :ref:`filename_speaker_parser`,
which takes a number of characters in the base file name (without the extension) starting either from the left or right. For
instance, the path :code:`/path/to/buckeye/s0101a.words` for a Buckeye file would return the speaker :code:`s01` using 3 characters from the left.

The other speaker parser is the :ref:`directory_speaker_parser`, which parses speakers from the directory that contains
the specified path.  For instance, given the path :code:`/path/to/buckeye/s01/s0101a.words` would return :code:`s01` because the containing
folder of the file is named :code:`s01`.

Load discourse
--------------

Loading of discourses is done via a CorpusContext's ``load`` function:

.. code-block:: python

   import polyglotdb.io as pgio

   parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')

   with CorpusContext(config) as c:
       c.load(parser, '/path/to/textgrid.TextGrid')

Alternatively, ``load_discourse`` can be used with the same arguments.
The ``load`` function automatically determines whether the input path to
be loaded is a single file or a folder, and proceeds accordingly.

Load directory
--------------

As stated above, a CorpusContext's ``load`` function will import a directory of
files as well as a single file, but the ``load_directory`` can be explicitly
called as well:

.. code-block:: python

   import polyglotdb.io as pgio

   parser = pgio.inspect_textgrid('/path/to/textgrids')

   with CorpusContext(config) as c:
       c.load_directory(parser, '/path/to/textgrids')

Writing new parsers
-------------------

New parsers can be created through extending either the :ref:`io_base_parser_api` class or one of the more specialized
parser classes. There are in general three aspects that need to be implemented.  First, the :code:`_extensions` property should
be updated to reflect the file extensions that the parser will find and attempt to parse.  This property should be an iterable,
even if only one extension is to be used.

Second, the :code:`__init__` function should be implemented if anything above and beyond the based class init function is required
(i.e., special speaker parsing).

Finally, the :code:`parse_discourse` function should be overwritten to implement some way of populating data on the annotation tiers
from the source data files and ultimately create a :code:`DiscourseData` object (intermediate data representation for straight-forward importing
into the Polyglot databases).

Creating new parsers for forced aligned TextGrids requires simply extending the :class:`polyglotdb.io.parsers.aligner.AlignerParser`
and overwriting the :code:`word_label` and :code:`phone_label` class properties.  The :code:`name` property should also be
set to something descriptive, and the :code:`speaker_first` should be set to False if speakers follow word/phone labels in
the TextGrid tiers (i.e., :code:`words -Speaker name` rather than :code:`Speaker name - words`). See :class:`polyglotdb.io.parsers.mfa.MfaParser`,
:class:`polyglotdb.io.parsers.fave.FaveParser`, :class:`polyglotdb.io.parsers.maus.MausParser`, and
:class:`polyglotdb.io.parsers.labbcat.LabbcatParser` for examples.

Exporters
=========

Under development.