Introduction

SWIFT ETL commands require that all data be entered into SWIFT templates. One SWIFT template must be used for each resource type to be ETLd. Typically, the primary resource type you are ETLing i (e.g. Plasmid) will require several secondary resource types (e.g. Person, Journal Article, ...) to be fully described in eagle-i -- the secondary resources will be linked from the primary resources.

Entering data correctly into SWIFT templates is key for the ETL process to succeed without creating duplicates or non-conforming data. You may enter data into SWIFT templates in the following ways:

Understanding the original data and pre-processing it

Data that exists electronically will typically be stored in a relational database and accessible via a database dump, or accessible through an API (e.g. in JSON format). It is usually necessary to perform a few transformations on the original data in order to fit it into a SWIFT template. This step is highly dependent on the nature of the original data, and hence the procedures will need to be developed on a case by case basis. In mapping the data from its original schema to the eagle-i ontology, the following scenarios may be encountered:

We usually write ad-hoc scripts that perform such transformations on the original data before copy-pasting individual columns into a SWIFT template. For controlled vocabulary fields, we produce mapping tables with the help of our domain experts and use them during these pre-processing steps.

Generating the templates

SWIFT templates and maps need to be generated using the toolkit version that corresponds to your eagle-i repository version (this is very important because ontology versions matter - if versions don't match, you might end up with non-conforming or incomplete resource descriptions).

When ETLing a primary type, there are usually resources of other types that are related to it (e.g People, Organizations, Publications). It is necessary to enter information for these related types in a separate template. For example, when ETLing a Monoclonal Antibody, you'll have separate files for related Hybridoma Cell Lines, People and Publications. 

Use the generate-inputs command described in the SWIFT toolkit guide to generate the following templates:

Do not modify the directories that are created by this command.

Adding your data to SWIFT templates

Create a directory dedicated to your data files, and subdirectories for each resource type. Copy the templates you will use to the appropriate subdirectory - note that more than one file of a given resource type may be used (e.g. you could have a file per lab if you're ETLing multiple labs) We recommend that before adding resource data to a template, you rename the file such that the name is meaningful, since all resources ETLd from a file will be tagged with that file's name, We've found it useful to use a name that reflects the date of the ETL , e.g. 20150627-NYSCF-human-study.xls.

SWIFT Templates include different kinds of columns:

Consult the tooltips in the header rows for more information about what is expected.

It is highly recommended you delete any columns from the template that are not needed. This eliminates potential confusion and makes the sheets much less unwieldy.

Rules and guidelines

Special rules for generic organisms

Generic organisms are instances of an Organism subclass that are meant to represent that subclass as a whole, when no specific instances are needed in the data - for example the generic organism https://global.eagle-i.net/i/Mus_musculus represents the Organism subclass http://purl.obolibrary.org/obo/NCBITaxon_10090. Generic organisms typically reside in eagle-i's Centrally Curated Resources.
In your template, if a column can accept an Organism value, you may use a specific instance or a generic instance. If you are using a generic instance, enter the label (e.g. Mus musculus) in both name and type columns.

Special rules for embedded instances

Embedded resources are instances of Embedded Classes (see the eagle-i ontology browser). Embedded resources only make sense as part of another "main" resource and as such they can only be created or updated as part of a creation or update operation for that resource. Examples of embedded classes are Construct Inserts (embedded in Constructs) or Diagnosis (embedded in Human Subjects).

In the context of ETL and SWIFT templates, embedded resources are a special case of linked resources. If there is only one embedded resource in a given resource, it is possible to enter its information directly in the SWIFT template row for the main resource. If there are more than one embedded resources in a given resource, the following procedure must be used:

Examples

This Google Drive folder contains a few annotated examples of SWIFT templates with data.

Running the actual ETL

Post ETL tasks