Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

The toolkit described herein is currently not user-friendly (though it works well – we use it routinely to bulk-upload data for the Consortium members). If you encounter issues, please do not hesitate to contact us.

...

This guide provides an overview of tasks pertaining to ETL and the usage of the SWIFT toolkit. The ETL workflow requires a person with domain knowledge and understanding of the eagle-i resource ontology to prepare the input files for optimal upload, and a person with basic knowledge of Unix to run the commands and troubleshoot potential errors. A detailed description is provided in Data preparation and ETL Workflow the page Preparing Input Data and Running an ETL Process

The SWIFT toolkit is comprised of:

  • an ETL Input Generator - command line program that generates spreadsheet templates and mapping files for the various resource types of the eagle-i resource ontology (e.g. a template/map for antibodies, for instruments, etc.)
  • an ETLer -  command line program that executes a bulk upload
  • a deETLer - command line program that deletes a previous ETL upload
  • Bulk workflow - command line program that executes workflow transitions on groups of resources, e.g. Publish, Return to curation, Withdraw

Prerequistes

The Running SWIFT Toolkit commands requires:

  • A Unix-like environment including a terminal for executing commands
    • MacOS and Linux users don't need to install anything extra. For MacOS, use the Terminal app under Applications/Utilities
    • cygwin is recommended for Windows users
  • A Java 1.7 runtime environment

...

Download the SWIFT toolkit distribution that matches the version of your eagle-i repository, named eagle-i-datatools-swift-[version]-dist.zip, Unzip it into a dedicated directory, and navigate to it. For example:

...

This script will create/use two directories with obvious meanings: ./maps and ./templates. Do not modify them. 

ETL

Warning

The ETLer expects data to be entered into one of the generated templates, and a few conventions to be respected (see Preparing Input Data preparation and Running an ETL WorkflowProcess) . A data curator usually makes sure that the template is correctly filled. In particular, the location of the resources to be ETLd (e.g. Lab or Core facility name) must be provided in every row of data.

  • Place your input data files (i.e. the completed templates) in a directory of your choice, e.g. dataDirectory. All files contained in this directory will be processed by the ETLer.

  • To run an ETL, execute one of the two commands below. 

  • A detailed report of the ETL results is generated in the ./logs directory; please inspect it to verify that all rows were correctly uploaded. The RDF version of generated resources is also logged in this directory.
  • To further verify the data upload, log on to the SWEET application and select the lab to which the ETLd resources belong.

...

  • This command will not attempt to determine if matching resources exist already in the eagle-i repository; it is therefore not idempotent - if it is applied run two times with the same input data file, duplicate resources will be created.
  • The value of the -p (promote) parameter indicates the desired workflow state for all resources - we recommend to choose CURATION, verify the resources were ETLd correctly, and then publish using the bulk workflow command (see below). If you've already ran a test ETL in a staging environment, choose PUBLISH directly.

...

  • Use this command if  the input file represents resources that have been previously uploaded or created in eagle-i

  • The value of the -eid parameter (external identifier) is the URI of a property that uniquely identifies the resource outside eagle-i. This property will be used to match the input to a resource in the eagle-i repository. Grab the property URI from the eagle-i ontology browser (expand the property name to see all information about a property). Example properties are: 

    • Catalog number, -eid http://purl.obolibrary.org/obo/ERO_0001528
    • Inventory number, -eid http://purl.obolibrary.org/obo/ERO_0000044
    • RDFS label, use the shorthand syntax -eid label
  • If the ETL process finds a matching resource, it will replace all its properties with the values from the input file; the URI of the matched resource will be preserved.
  • If the ETL process does not find a matching resource, a new resource will be created.
  • The value of the -p parameter  (promote) parameter indicates the desired workflow state for newly created resources. Existing resources will retain their workflow state.

...

Execute the following command to perform workflow actions (e.g. send to curation, publish, unpublish) on all resources ETLd from a particular file (i.e. resources that are tagged with the that filename in the eagle-i repository):

...