Importing corpora
Corpora can be imported from several input formats. The list of currently supported formats is:
TextGrids from Montreal Forced Aligner or FAVE-align
Praat TextGrids
TextGrids exported from LaBB-CAT
BAS Partitur format
Each format has a inspection function in the polyglot.io
submodule that will check that format of the specified directory
or file matches the input format and return the appropriate parser.
These functions would be used as follows:
import polyglotdb.io as pgio
corpus_directory = '/path/to/directory'
parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids
# OR
parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids
# OR
parser = pgio.inspect_textgrid(corpus_directory)
# OR
parser = pgio.inspect_labbcat(corpus_directory)
# OR
parser = pgio.inspect_partitur(corpus_directory)
# OR
parser = pgio.inspect_timit(corpus_directory)
# OR
parser = pgio.inspect_buckeye(corpus_directory)
Note
For more technical detail on the inspect functions and the parser objects they return, see PolyglotDB I/O.
To import a corpus, the CorpusContext
context manager has to be imported
from polyglotdb
:
from polyglotdb import CorpusContext
CorpusContext
is the primary way through which corpora can be interacted
with.
Before importing a corpus, you should ensure that a Neo4j server is running.
Interacting with corpora requires submitting the connection details. The
easiest way to do this is with a utility function ensure_local_database_running
(see Interacting with a local Polyglot database for more
information):
from polyglotdb.utils import ensure_local_database_running
from polyglotdb import CorpusConfig
with ensure_local_database_running('database_name') as connection_params:
config = CorpusConfig('corpus_name', **connection_params)
The above config
object contains all the configuration for the corpus.
To import a file into a corpus (in this case a TextGrid):
import polyglotdb.io as pgio
parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')
with ensure_local_database_running('database_name') as connection_params:
config = CorpusConfig('my_corpus', **connection_params)
with CorpusContext(config) as c:
c.load(parser, '/path/to/textgrid.TextGrid')
In the above code, the io
module is imported and provides access to
all the importing and exporting functions. For every format, there is an
inspect function to generate a parser for that file and other ones that are
formatted the same. In the case of a TextGrid,
the parser has annotation types correspond to interval and point tiers.
The inspect function
tries to guess the relevant attributes of each tier.
Note
The discourse load function of Corpuscontext
objects takes
a parser as the first argument. Parsers contain an attribute annotation_types
,
which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from
aligners, no modification is necessary.
All interaction with the databases is via the CorpusContext
context manager.
Further details on import arguments can be found
in the API documentation.
Once the above code is run, corpora can be queried and explored.