Corpora can be imported from several input formats. The list of currently supported formats is:
- TextGrids from Montreal Forced Aligner or FAVE-align
- Praat TextGrids
- TextGrids exported from LaBB-CAT
- BAS Partitur format
Each format has a inspection function in the
polyglot.io submodule that will check that format of the specified directory
or file matches the input format and return the appropriate parser.
These functions would be used as follows:
import polyglotdb.io as pgio corpus_directory = '/path/to/directory' parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids # OR parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids # OR parser = pgio.inspect_textgrid(corpus_directory) # OR parser = pgio.inspect_labbcat(corpus_directory) # OR parser = pgio.inspect_partitur(corpus_directory) # OR parser = pgio.inspect_timit(corpus_directory) # OR parser = pgio.inspect_buckeye(corpus_directory)
For more technical detail on the inspect functions and the parser objects they return, see PolyglotDB I/O.
To import a corpus, the
CorpusContext context manager has to be imported
from polyglotdb import CorpusContext
CorpusContext is the primary way through which corpora can be interacted
Before importing a corpus, you should ensure that a Neo4j server is running.
Interacting with corpora requires submitting the connection details. The
easiest way to do this is with a utility function
ensure_local_database_running (see Interacting with a local Polyglot database for more
from polyglotdb.utils import ensure_local_database_running from polyglotdb import CorpusConfig with ensure_local_database_running('database_name') as connection_params: config = CorpusConfig('corpus_name', **connection_params)
config object contains all the configuration for the corpus.
To import a file into a corpus (in this case a TextGrid):
import polyglotdb.io as pgio parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid') with ensure_local_database_running('database_name') as connection_params: config = CorpusConfig('my_corpus', **connection_params) with CorpusContext(config) as c: c.load(parser, '/path/to/textgrid.TextGrid')
In the above code, the
io module is imported and provides access to
all the importing and exporting functions. For every format, there is an
inspect function to generate a parser for that file and other ones that are
formatted the same. In the case of a TextGrid,
the parser has annotation types correspond to interval and point tiers.
The inspect function
tries to guess the relevant attributes of each tier.
The discourse load function of
Corpuscontext objects takes
a parser as the first argument. Parsers contain an attribute
which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from
aligners, no modification is necessary.
All interaction with the databases is via the
CorpusContext context manager.
Further details on import arguments can be found
in the API documentation.
Once the above code is run, corpora can be queried and explored.