You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

The ETL command requires that all data be entered into SWIFT templates - one template per resource type. Typically, a "main" resource type in eagle-i (e.g. Plasmid) will require several secondary resource types (e.g. Person, Journal Article, Construct insert, ...).

Entering data correctly is key for the ETL process to succeed without creating duplicates or non-conforming data. To add data to a SWIFT template, you may add row by row, or use data that exists electronically, pre-process it (see below) and copy individual columns into the SWIFT template.


Understanding the original data and pre-processing it

Data that exists electronically will typically be stored in a relational database and accessible via a database dump, or accessible through an API (e.g. in JSON format). It is usually necessary to perform a few transformations on the original data in order to fit it into a SWIFT template. This step is highly dependent on the nature of the original data, and hence will need to be developed on a case by case basis. In mapping the data from its original schema to the eagle-i ontology, the following scenarios may be encountered:

  • Original data may need to be split into multiple eagle-i resource types
  • For a given type, there may be a one-to-one correspondence between the field (e.g. column) and a property in the eagle-i ontology
  • For a given type, a field in the original data may need to be split into multiple eagle-i ontology properties
  • For a given type, fields in the original data may need to be combined into one eagle-i ontology property
  • A controlled vocabulary field in the original data may need to be mapped to a term in one of eagle-i's referenced taxonomies

We usually write ad-hoc scripts that perform such transformations on the original data before copy-pasting individual columns into a SWIFT template. For controlled vocabulary fields, we produce mapping tables with the help of our domain experts and use them during these pre-processing steps.

Templates

SWIFT templates need to be generated using the toolkit version that corresponds to your eagle-i repository version (this is very important -if the versions don't match, you might end up with ontology terms that are not found during the ETL process).

When ETLing a "main" type, there are usually resources of other types that are related to it (e.g People, Organizations, Publications). It is necessary to enter information for these related types in a template of their own. For example, when ETLing a Monoclonal Antibody, you'll have separate files for related Hybridoma Cell Lines, People and Publications. 

Generate the following templates:

  • A template for your main resource - use the most specific type possible (e.g. Monoclonal Antibody and not Reagent)
  • Templates for the linked resources you need - consult the eagle-i ontology browser page of your main resource type to understand what types can be linked

Guidelines for filling the templates

Templates include different kinds of columns:

  • Plain text columns - you may enter any text. Do not include semi-colons, as they are used as the field separator (more on this below)
  • Resource columns - represent linked resources. 
    • A value that matches the label of a resource in the repository (or a new resource) is expected. If you know the resource's URI, you may enter it here and thus avoid a look-up operation during ETL. 
    • The main resource type template will contain references to instances from secondary resource type template - references in the primary file need to use the exact name (ignoring case) entered in the secondary file for the correct linkage to occur
    • Resource columns are followed by Resource type columns; sometimes the resource type column is omitted if there are no possible values, i.e. the resource type ahs no subclasses)
  • Referenced taxonomies - a value that matches an eagle-i ontology term or synonym is expected

Consult the tooltips in the header rows for more information about what is expected.

A few more guidelines and tips:

  • You must always enter the Organization to which the resource is associated, either by name or by URI
    • It is best to use the Sweet to add the Organization (e.g. lab) to which these resources are associated, and then reference this name or URI in the files.
  • If there is more than one value for a given column, enter values separated by ; (semicolon). Conversely, check your input file for the presence of ; in values that are not meant to be split and substitute for a different character.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them or the name of the Tabs.
  • Every resource (primary or secondary) needs to have a name and a type as a minimum. For simplification, the type column is omitted if there are no possible subclasses (e.g. Person, Human Subject). If the template has a type column, you must enter a value.

The actual ETL process

Name the files meaningfully (expand)

ETL first the secondary resource type files (i.e. linked resources), then the main resource type file.

Special rules for embedded instances

 

  • No labels