Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The SWIFT ETL command requires commands require that all data be entered into SWIFT templates - one template per resource type. One SWIFT template must be used for each resource type to be ETLd. Typically, a "main" the primary resource type in eagle-you are ETLing i (e.g. Plasmid) will require several secondary resource types (e.g. Person, Journal Article, Construct insert, ...) to be fully described in eagle-i.

Entering data correctly into SWIFT templates is key for the ETL process to succeed without creating duplicates or non-conforming data.  To You may enter data into SWIFT templates in the following ways:

  • Manually add data

...

  • ,

...

  • row by row

...

  • If data

...

  • exists electronically: obtain a data dump, pre-process it (see below) and copy individual columns into the SWIFT template.

Understanding the original data and pre-processing it

Data that exists electronically will typically be stored in a relational database and accessible via a database dump, or accessible through an API (e.g. in JSON format). It is usually necessary to perform a few transformations on the original data in order to fit it into a SWIFT template. This step is highly dependent on the nature of the original data, and hence procedures will need to be developed on a case by case basis. In mapping the data from its original schema to the eagle-i ontology, the following scenarios may be encountered:

  • Original data may need to be split into multiple eagle-i resource types
  • For a given type, there may be a one-to-one correspondence between the original field (e.g. column) and a property in the eagle-i ontology
  • For a given type, a field in the original data may need to be split into multiple eagle-i ontology properties
  • For a given type, fields in the original data may need to be combined into one eagle-i ontology property
  • A controlled vocabulary field in the original data may need to be mapped to a term in one of eagle-i's referenced taxonomies

...

  • A template for your main resource - use the most specific type possible (e.g. Monoclonal Antibody and not Reagent)
  • Templates for all the linked resources you need - consult the eagle-i ontology browser page of your main resource type to understand what types can be linked

Guidelines for filling the templates

Adding your data to SWIFT templates

We recommend that before  adding resource data to a template, you rename the file such that the name is meaningful, since all resources ETLd from a file will be tagged with that file's name, We've found it useful to use a name that reflects the date of the ETL , e.g. 20150627-NYSCF-human-study.xls.

SWIFT Templates include different kinds of columns:

...

  • You must always enter the Organization to which the resource is associated, either by name or by URI
    • It is best to use the Sweet to add the Organization (e.g. lab) to which these resources are associated, and then reference this name or URI in the files.
  • If there is more than one value for a given column, enter values separated by ; (semicolon). Conversely, check your input file for the presence of ; in values that are not meant to be split and substitute for a different character.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them or the name of the Tabs.
  • Every resource (primary or secondary) needs to have a name and a type as a minimum. For simplification, the type column is omitted if there are no possible subclasses (e.g. Person, Human Subject). If the template has a type column, you must enter a value.

The actual ETL processName the files meaningfully (expand)

  • Run the ETL process using dedicated credentials (e.g. create a "automated curator" user in your eagle-i repository), such that it is later possible to easily isolate resources that were bulk loaded.
  • ETL needs a Level 4  role
  • ETL first the secondary resource type files (i.e. linked resources), then the main resource type file.

Special rules for embedded instances

Embedded resources are instances of Embedded Classes (see the eagle-i ontology browser). Embedded resources only make sense as part of a "containing" resource and as such they can only be created or updated as part of a creation or update operation for that resource. Examples of embedded classes are Construct Inserts (embedded in Constructs) or Diagnosis (embedded in Human Subjects).

In the context of ETL and SWIFT templates, embedded resources are a special case of referenced resources. If there is only one embedded resource in a given resource, it is possible to enter its information directly in the SWIFT template row for the containing resource. If there are more than one embedded resources in a given resource, the following procedure must be used:

  • In the template for the containing resource, enter a list of semicolon-separated labels for the embedded resources, and fill the type column . This will result in the creation of empty embedded resources
  • ETL this file
  • In order to ETL the rest of the properties for the embedded resource:
    • Generate a template/map for the embedded resource class
      • Templates generated for embedded classes will have two additional columns: Main Resource Name and Main Resource Type
    • Fill the template with the information for the embedded resource (one row per embedded resource), making sure the entries for Main Resource Name and Main Resource Type match the label  and type previously used when ETLing  the main resource.
    • Run the ETL command with the embedded resource file as input. Note that the -p and -eid parameters will be ignored if they are present.