Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

SWIFT ETL commands require that all data be entered into SWIFT templates. One SWIFT template must be used for each resource type to be ETLd. Typically, the primary resource type you are ETLing i (e.g. Plasmid) will require several secondary resource types (e.g. Person, Journal Article, ...) to be fully described in eagle-i -- the secondary resources will be linked from the primary resources.

Entering data correctly into SWIFT templates is key for the ETL process to succeed without creating duplicates or non-conforming data. You may enter data into SWIFT templates in the following ways:

...

Data that exists electronically will typically be stored in a relational database and accessible via a database dump, or accessible through an API (e.g. in JSON format). It is usually necessary to perform a few transformations on the original data in order to fit it into a SWIFT template. This step is highly dependent on the nature of the original data, and hence the procedures will need to be developed on a case by case basis. In mapping the data from its original schema to the eagle-i ontology, the following scenarios may be encountered:

...

We usually write ad-hoc scripts that perform such transformations on the original data before copy-pasting individual columns into a SWIFT template. For controlled vocabulary fields, we produce mapping tables with the help of our domain experts and use them during these pre-processing steps.

Templates

SWIFT templates need to be generated using the toolkit version that corresponds to your eagle-i repository version (this is very important because ontology versions matter - if the versions don't match, you might end up with ontology terms that are not found during the ETL processnon-conforming or incomplete resource descriptions).

When ETLing a "main" typeprimary type, there are usually resources of other types that are related to it (e.g People, Organizations, Publications). It is necessary to enter information for these related types in a template of their own. For example, when ETLing a Monoclonal Antibody, you'll have separate files for related Hybridoma Cell Lines, People and Publications. 

Generate the following templates:

  • A template for your main primary resource type - use the most specific type possible (e.g. Monoclonal Antibody and not Reagent)
  • Templates for all the linked secondary resources you need - consult the eagle-i ontology browser page of your main primary resource type to understand what types can be linked

Adding your data to SWIFT templates

We recommend that before  adding resource data to a template, you rename the file such that the name is meaningful, since all resources ETLd from a file will be tagged with that file's name, We've found it useful to use a name that reflects the date of the ETL , e.g. 20150627-NYSCF-human-study.xls.

...

  • Plain text columns - you may enter any text. Do not include semi-colons, as they are used as the field separator (more on this below)
  • Resource columns - represent secondary, linked resources. 
    • A value that matches the label of a resource in the repository (or a new resource) is expected. If you know the resource's URI, you may enter it here and thus avoid a look-up operation during ETL - use this option with caution, as all checks are also by-passed (in particular, the ETL process does not verify that the resource exists). We have found that the URI option is most useful when all values in the column are the same - otherwise it makes it more difficult to visually inspect the data.
    • The primary The main resource type template will contain references to instances from a secondary resource type template - references in the primary file need to use the exact name (ignoring case) entered in the secondary file for the correct linkage to occur
    • Resource columns are followed by Resource type columns; sometimes the resource type column is omitted if there are no possible values , (i.e. the resource type ahs has no subclasses) - if the column exists, it must be used.
  • Referenced taxonomies - a value that matches an eagle-i ontology term or synonym is expected - if you know the term URI, you may enter it here and thus avoid a look-up operation during ETL - use this option with caution, as all checks are also by-passed (in particular, the ETL process does not verify that the term exists in the eagle-i ontology). We have found that the URI option is most useful when all values in the column are the same - otherwise it makes it more difficult to visually inspect the data.

Consult the tooltips in the header rows for more information about what is expected.

...

  • You must always enter the Organization to which the resource is associated, either by name or by URI
    • It is best to use the Sweet to add the Organization (e.g. lab) to which these resources are associated, and then reference this name or URI in the files.
  • If there is more than one value for a given column, enter values separated by ; (semicolon). Conversely, check your input file for the presence of ; in values that are not meant to be split and substitute for a different character.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them or the name of the Tabs.
  • Every resource (primary or secondary) needs to have a name and a type as a minimum. For simplification, the type column is omitted if there are no possible subclasses (e.g. Person, Human Subject). If the template has a type column, you must enter a value.

...

  • Run the ETL process using dedicated credentials (e.g. create a an "automated curator" user in your eagle-i repository), such that it is later possible to easily isolate resources that were bulk loaded.
  • ETL needs a Level 4  rolerole
  • ETL first the secondary resource type files (i.e. linked resources), then the main primary resource type file.

Special rules for embedded instances

Embedded resources are instances of Embedded Classes (see the eagle-i ontology browser). Embedded resources only make sense as part of a "containing" resource and as such they can only be created or updated as part of a creation or update operation for that resource. Examples of embedded classes are Construct Inserts (embedded in Constructs) or Diagnosis (embedded in Human Subjects).

...