Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

The toolkit described herein is currently not user-friendly (though it works well – we use it routinely to bulk-upload data for the Consortium members). If you encounter issues, please do not hesitate to contact us.

SWIFT (Semantic Web Ingest from Tables) is a toolkit that allows experienced users to bulk-upload data into an eagle-i repository, via ETL (Extract, Transform and Load). Currently the  The figure below is a high level depiction of the ETL process. The toolkit supports only Excel spreadsheets and csv files as input files.(though both need to conform to a SWIFT template, see below).

Gliffy Diagram
nameetl-high-level
pagePin3


This guide provides an overview of system administrator tasks pertaining to ETL and the usage of the SWIFT toolkit. The ETL workflow requires a person with domain knowledge and understanding of the eagle-i resource ontology to prepare the input files for optimal upload. This topic is outside the scope of this guide.

The figure below is a high level depiction of the ETL process for spreadsheets.

. Image Removed

, and a person with basic knowledge of Unix to run the commands and troubleshoot potential errors. A detailed description is provided in the page Preparing Input Data and Running an ETL Process

The SWIFT toolkit is comprised of:

  • an ETL Input Generator - command line program that generates Excel spreadsheet templates and mapping files for the various resource types of the eagle-i resource ontology (e.g. a template/map for antibodies, for instruments, etc.)
  • an ETLer -  command line program that executes a bulk upload
  • a DedeETLer - ETLer - command line program that reverses deletes a previous ETL upload
  • A bulk Bulk workflow - command - line program that executes workflow transitions on groups of recordsresources, e.g. Publish, Return to curation, Withdrawan ETL Input Generator - command line program that allows system administrators to generate spreadsheet templates and map files for the various resource types of the eagle-i resource ontology (e.g. a template/map for antibodies, for instruments, etc.)

Prerequistes

Running SWIFT Toolkit commands requires:

  • A Unix-like environment including a terminal for executing commands
    • MacOS and Linux users don't need to install anything extra. For MacOS, use the Terminal app under Applications/Utilities
    • cygwin is recommended for Windows users
  • A Java 1.7 runtime environment

Download

The SWIFT toolkit is packaged as an executable jara zip file, and can be downloaded from our our software repository. We have also manually packaged a zip file that contains shell scripts to wrap the main commands and a few pre-generated templates and maps (only for the most recently released eagle-i software and ontology versions); the following is a temporary location, while we integrate the generation of this package into our build processes:

https://open.med.harvard.edu/svn/eagle-i-dev/apps/trunk/dev-resources

Wiki Markup
Download the SWIFT toolkit distribution {{eagle-i-etl-dist-\[version\].zip}}, unzip it into a dedicated directory, and navigate to it, for example

 

Download the SWIFT toolkit distribution that matches the version of your eagle-i repository, named eagle-i-datatools-swift-[version]-dist.zip, Unzip it into a dedicated directory, and navigate to it. For example:

noformat
Code Block
languagebash
mkdir ~/eagle-i
unzip -d ~/eagle-i eagle-i-etldatatools-distswift-12.7MS20MS3.01-dist.zip
cd ~/eagle-i/eagle-i-etl-dist-1.7M2swift-2.0MS3.01

Available commands

Input generation

...

To generate etl templates and maps, navigate to the dedicated directory (above) and run the script:

noformat
Code Block
language
bash

./generate-inputs.sh --type INSTRUMENT|SERVICE|PERSON|ORGANIZATION|typeURI

...

t typeURI
  • You may obtain the type URI from the eagle-i ontology browser . Use the left bar to find the most specific type you need, select it and grab its URI

...

  • , e.g. http://purl.obolibrary.org/obo/ERO_0000229 for Monoclonal Antibodies.

...


  •  
Info

Innocuous warnings are produced when generating and uploading the templates; these may safely be ignored. If you encounter errors or issues, please do not hesitate to contact us.

This script will create/use two directories with obvious meanings: ./maps and ./templates

Transformation maps will be contained in a subdirectory of ./maps named after the type and ontology version, e.g:

No Format

./maps/instrument_ont_v1.1.0

At the moment a third input, mapfileinfo.properties, is not generated. The ETLer looks for it under the type and version subdirectory, e.g. ./maps/instrument_ont_v1.1.0/mapfileinfo.properties
The zip file contains pre-generated maps and templates for a few types, and includes these property files. If you need to generate inputs for a different type, please make sure to copy this property file to the appropriate subdirectory and edit it. We are working on automatically generating this file, this step will go away soon ;-)

. Do not modify them.

ETL

...

Warning

The ETLer expects data to be entered into one of the generated templates, and a few conventions to be respected (outside the scope of this guidesee Preparing Input Data and Running an ETL Process) . A data curator usually makes sure that the template is correctly filled. In particular, the location of the resources to be ETLd (e.g. Lab or Core facility name) must be provided in every row of data and must correspond to a location already entered in the eagle-i repository via SWEET.

...

.


  • A detailed report of the ETL results is generated in the ~/eagle-i/swift-2.0MS3.01/logs directory; please inspect it to verify that all rows were correctly uploaded. The RDF version of generated resources is also logged in this directory.
  • To further verify the data upload, log on to the SWEET application and select the lab to which the ETLd resources belong.

Assumptions

  • Templates that were generated by generate-inputs.sh are completed and in a directory, e.g.

...

  • dataDirectory.

    Note

    All files

...

  • in

...

  • the dataDirectory will be processed by the ETLer. Please be sure all secondary resource templates are in their own directories.

  • Maps (*.rdf) that were generated by generate-inputs.sh have been copied to the SWIFT executable directory's maps folder, e.g. ~/eagle-i/swift-2.0MS3.01/maps.

ETL to create new resources

Code Block
languagebash
titleETL new resources
./ETLer.

...

sh -d dataDirectory [-p DRAFT|CURATION|PUBLISH] -c username:password -r repositoryURL
  • This command will not attempt to determine if matching resources exist already in the eagle-i repository; it is therefore not idempotent - if it is run two times with the same data file, duplicate resources will be created.
  • The value of the -p (promote) parameter indicates the desired workflow state for all resources - we recommend to choose CURATION, verify the resources were ETLd correctly, and then publish using the bulk workflow command (see below). If you've already ran a test ETL in a staging environment, choose PUBLISH directly.

  • To avoid classpath confusion, please use the fully qualified path for the dataDirectory.
  • Make sure to use the full path of your directory,, eg /Users/juliane/swift-4.3.0/mcow_ipsc/test
  • Make sure you are in your swift directory in your terminal when you execute the command.

ETL to replace existing or create new resources.

Code Block
languagebash
titleETL command for replacing existing resources or creating new resources
./ETLer.sh -d dataDirectory [-p DRAFT|CURATION|PUBLISH] -c username:password 

...

-r repositoryURL -eid property-uri
  • Use this command if  the input file represents resources that have been previously uploaded or created in eagle-i

  • The value of the -eid parameter (external identifier) is the URI of a property that uniquely identifies the resource outside eagle-i. This property will be used to match the input to a resource in the eagle-i repository. Grab the property URI from the eagle-i ontology browser (expand the property name to see all information about a property). Example properties are: 

    • Catalog number, -eid http://purl.obolibrary.org/obo/ERO_0001528
    • Inventory number, -eid http://purl.obolibrary.org/obo/ERO_0000044
    • RDFS label, use the shorthand syntax -eid label
  • If the ETL process finds a matching resource, it will replace all its properties with the values from the input file; the URI of the matched resource will be preserved.
  • If the ETL process does not find a matching resource, a new resource will be created.
  • The value of the -p (promote) parameter indicates the desired workflow state for newly created resources. Existing resources will retain their workflow state.
  • To avoid classpath confusion, please use the fully qualified path for the dataDirectory.

Practicing ETL

...

If you are practicing the ETL process, you may wish to upload your data to the common eagle-i training node.

...

For example, if your directory is named dataDirectory and you wish to practice creating new resources, the script would be executed as follows (default workflow state is DRAFT):

Code Block
language

...

bash

...

 ./

...

ETLer.sh -

...

d dataDirectory -

...

c L4:Level4 -r https://training.eagle-i.net

...

Note that the data that is uploaded to the training node CAN be viewed and modified by others even in a draft state (even if you subsequently lock the records). Note also that the information in the training node is not persistent

...

as the node is

...

refreshed periodically.

De-ETL

Resources that are uploaded to an eagle-i repository via ETL are tagged with the name of the file from which they were extracted. It is therefore relatively simple to de-ETL an entire file. To do so, execute the following command:

Code Block
languagebash
 ./deETLer.sh -f filename -c username:password -r repositoryURL

Bulk Workflow

Execute the following command to perform workflow actions (e.g. send to curation, publish, unpublish) on all resources ETLd from a particular file (i.e. resources that are tagged with that filename in the eagle-i repository):

Code Block
languagebash
 ./bulk-workflow -f filename -p DRAFT|CURATION|PUBLISH -c username:password -r repositoryURL

Note the following limitations of bulk workflow:

  • All the resources tagged with the filename must be in the same state
  • You must choose a final state that is reachable from the resources' current state
    • if the resources are in draft, choose CURATION
    • if the resources are in curation, choose PUBLISH or DRAFT
    • if the resources are published, choose CURATION
    • if you want to publish resources that are currently in draft, you'll need to run the bulk workflow command twice

...