Enrichment via CSV files

PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying a local CSV file to add information from. When constructing this CSV file, the first column should be the label used to identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus.

ID_column,property_one,property_two
first_item,first_item_value_one,first_item_value_two
second_item,,second_item_value_two

Enriching using this file would look up elements based on the ID_column, and the one matching first_item would get both property_one and property_two (with the respective values). The one matching second_item would only get a property_two (because the value for property_one is empty.

Enriching the lexicon

lexicon_csv_path = '/full/path/to/lexicon/data.csv'
with CorpusContext(config) as c:
    c.enrich_lexicon_from_csv(lexicon_csv_path)

Note

The function enrich_lexicon_from_csv accepts an optional keyword case_sensitive and defaults to False. Changing this will respect capitalization when looking up words.

Enriching the phonological inventory

The phone inventory can be enriched with arbitrary properties via:

inventory_csv_path = '/full/path/to/inventory/data.csv'
with CorpusContext(config) as c:
    c.enrich_inventory_from_csv(inventory_csv_path)

Enriching speaker information

Speaker information can be added via:

speaker_csv_path = '/full/path/to/speaker/data.csv'
with CorpusContext(config) as c:
    c.enrich_speakers_from_csv(speaker_csv_path)

Enriching discourse information

Metadata about the discourses or sound files can be added via:

discourse_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
    c.enrich_discourses_from_csv(discourse_csv_path)

Enriching arbitrary tokens

Often it’s necessary or useful to encode a new property on tokens of an annotation without directly interfacing the database. This could happen, for example, if you wanted to use a different language or tool for a certain phonetic analysis than Python. In this case, it is possible to enrich any type of token via CSV. This can be done using the corpus_context.enrich_tokens_with_csv function.

token_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
    c.enrich_tokens_from_csv(token_csv_path,
            annotation_type="phone",
            id_column="phone_id"
            properties=["measurement_1", "measurement_2"])

The only requirement for the CSV is that there is a column which contains the IDs of the tokens you wish to update. You can get these IDs (along with other parameters) by querying the tokens before hand, and exporting a CSV, see Export for token CSVs. The only columns from the CSV that will be added as token properties, are those which are included in the properties parameter. If this parameter is left as None, then all the columns of the CSV except the id_column will be included.