Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SWIFT (Semantic Web Ingest from Tables) is a toolkit that allows experienced users to bulk-upload data into an eagle-i repository, via ETL (Extract, Transform and Load). Currently the toolkit supports only Excel spreadsheets as input files.

...

The figure below is a high level depiction of the ETL process for spreadsheets.

. Image Modified

The SWIFT toolkit is comprised of:

  • an ETLer -  command line program that executes a bulk upload
  • a DedeETLer - ETLer - command line program that reverses deletes a previous ETL upload
  • A bulk workflow - command - line program that executes workflow transitions on groups of records, e.g. Publish, Return to curation, Withdraw
  • an ETL Input Generator - command line program that allows system administrators to generate spreadsheet templates and map files for the various resource types of the eagle-i resource ontology (e.g. a template/map for antibodies, for instruments, etc.)

The SWIFT toolkit is packaged as an executable jara zip file, and can be downloaded from our software repository. We have also manually packaged a zip file that contains shell scripts to wrap the main commands and a few pre-generated templates and maps (only for the most recently released eagle-i software and ontology versions); the following is a temporary location, while we integrate the generation of this package into our build processes:https://open.med.harvard.edu/svn/eagle-i-dev/apps/trunk/dev-resources 

Wiki Markup
Download the SWIFT toolkit distribution {{eagle-i-datatools-etl-dist-\[version\].zip}}, unzip it into a dedicated directory, and navigate to it,. forFor example

No Format
mkdir ~/eagle-i
unzip -d ~/eagle-i eagle-i-etl-dist-1.7MS2.01.zip
cd eagle-i/eagle-i-etl-dist-1.7M2.01

...

No Format
./generate-inputs.sh --type INSTRUMENT|SERVICE|PERSON|ORGANIZATION|t typeURI

*You may obtain the type URI from the eagle-i ontology browser . Use the left bar to find the most specific type you need, select it and grab its URI from the browser's address bar, e.g. http://purl.obolibrary.org/obo/ERO_0000229 for Monoclonal Antibodies. However in this case you will need to add  mapfileinfo.properties (see below). 

Info

Innocuous warnings are produced when generating and uploading the templates; these may safely be ignored. If you encounter errors or issues, please do not hesitate to contact us.

...

No Format
./maps/instrument_ont_v1.1.0

...

ETL instructions

Warning

The ETLer expects data to be entered into one of the generated templates, and a few conventions to be respected (outside the scope of this guidesee Appendix A) . A data curator usually makes sure that the template is correctly filled. In particular, the location of the resources to be ETLd (e.g. Lab or Core facility name) must be provided in every row of data and must correspond to a location already entered in the eagle-i repository via SWEET.

  1. Place your input files (i.e. the completed templates) in a directory of your choice, e.g. dataDirectory. All files contained in this directory will be processed by the ETLer.
  2. To run an ETL, execute the script:
    No Format
    ./etlETLer.sh --dird dataDirectory [--workflowp DRAFT|CURATION|PUBLISH] -c username:password -r repositoryURL
    
    Info

    If you are practicing the ETL process, you may wish to upload your data to the common eagle-i training node. In this case, if your directory is named dataDirectory, the script would be executed as follows (default workflow state is DRAFT):

    No Format
    ./etlETLer.sh --dird dataDirectory  --workflow DRAFTc L4:Level4 -r https://training.eagle-i.net
    

    Note that the data that is uploaded to the training node CAN be viewed and modified by others even in a draft state (even if you subsequently lock the records). Note also that the information in the training node is not persistent ; it as the node is refreshed periodically reset programmatically.

    All the data will be uploaded to the requested workflow state. 
  3. To verify the data upload, log on to the SWEET application and select the lab to which the ETLd resources belong.

Appendix A. Input file odds and ends

  • When ETLing a primary type, there are usually resources of other types that are related to it (e.g People, Organizations, Publications). It is best to enter information for these related types in a template of their own. For example, when ETLing a Monoclonal Antibody, it is best to have separate files for related Hybridoma Cell Lines, People and Publications. The primary file (Monoclonal Antibody) will contain references to instances from these other secondary files - references in the primary file need to use the exact name (ignoring case) entered in the secondary file for the correct linkage to occur. ETL the secondary files first and then the primary file.
  • It is best to use the Sweet to add the Organization (e.g. lab) to which these resources are associated, and then reference this name in the files.
  • If there is more than one value for a given column, enter values separated by ; (semicolon). Conversely, check your input file for the presence of ; in values that are not meant to be split and substitute for a different character.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them or the name of the Tabs.
  • Every resource needs to have a name and a type as a minimum. If the template has a type column, you must enter a value even if it is superfluous (e.g. in a template for Journal Article, you still need to enter Journal Article as a type) => this should be fixed in the near future.
  • You must always enter the Organization to which the resource is associated.