.. _Montreal Forced Aligner: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner .. _FAVE-align: https://github.com/JoFrhwld/FAVE .. _LaBB-CAT: http://labbcat.sourceforge.net/ .. _TIMIT: https://catalog.ldc.upenn.edu/LDC93S1 .. _Buckeye: https://buckeyecorpus.osu.edu/ .. _BAS Partitur: http://www.bas.uni-muenchen.de/forschung/publikationen/Granada-98-Partitur.pdf .. _importing: ***************** Importing corpora ***************** Corpora can be imported from several input formats. The list of currently supported formats is: * TextGrids from `Montreal Forced Aligner`_ or `FAVE-align`_ * Praat TextGrids * TextGrids exported from `LaBB-CAT`_ * `BAS Partitur`_ format * `TIMIT`_ * `Buckeye`_ Each format has a inspection function in the :code:`polyglot.io` submodule that will check that format of the specified directory or file matches the input format and return the appropriate parser. These functions would be used as follows: .. code-block:: python import polyglotdb.io as pgio corpus_directory = '/path/to/directory' parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids # OR parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids # OR parser = pgio.inspect_textgrid(corpus_directory) # OR parser = pgio.inspect_labbcat(corpus_directory) # OR parser = pgio.inspect_partitur(corpus_directory) # OR parser = pgio.inspect_timit(corpus_directory) # OR parser = pgio.inspect_buckeye(corpus_directory) .. note:: For more technical detail on the inspect functions and the parser objects they return, see :ref:`pgdb_io`. To import a corpus, the :code:`CorpusContext` context manager has to be imported from :code:`polyglotdb`: .. code-block:: python from polyglotdb import CorpusContext :code:`CorpusContext` is the primary way through which corpora can be interacted with. Before importing a corpus, you should ensure that a Neo4j server is running. Interacting with corpora requires submitting the connection details. The easiest way to do this is with a utility function :code:`ensure_local_database_running` (see :ref:`local` for more information): .. code-block:: python from polyglotdb.utils import ensure_local_database_running from polyglotdb import CorpusConfig with ensure_local_database_running('database_name') as connection_params: config = CorpusConfig('corpus_name', **connection_params) The above :code:`config` object contains all the configuration for the corpus. To import a file into a corpus (in this case a TextGrid): .. code-block:: python import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid') with ensure_local_database_running('database_name') as connection_params: config = CorpusConfig('my_corpus', **connection_params) with CorpusContext(config) as c: c.load(parser, '/path/to/textgrid.TextGrid') In the above code, the :code:`io` module is imported and provides access to all the importing and exporting functions. For every format, there is an inspect function to generate a parser for that file and other ones that are formatted the same. In the case of a TextGrid, the parser has annotation types correspond to interval and point tiers. The inspect function tries to guess the relevant attributes of each tier. .. note:: The discourse load function of :code:`Corpuscontext` objects takes a parser as the first argument. Parsers contain an attribute :code:`annotation_types`, which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from aligners, no modification is necessary. All interaction with the databases is via the :code:`CorpusContext` context manager. Further details on import arguments can be found in the API documentation. Once the above code is run, corpora can be queried and explored.