Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Manually add data, row by row.
  • If data exists electronically: obtain a data dump, pre-process it (see below) and copy individual columns into the SWIFT template.

...

  • A template for your primary resource type - use the most specific type possible (e.g. Monoclonal Antibody and not Reagent).
  • Templates for all the secondary resources you need - consult the eagle-i ontology browser page of your primary resource type to understand what types can be linked.

Do not modify the directories that are created by this command.

...

  • Plain text columns - you may enter any text. Do not include semi-colons, as they are used as field separator (more on this below).
  • Resource columns - represent secondary, linked resources. 
    • A value that matches the label of a resource in the repository (, in a secondary file, or a new resource ) is expected. If you know the resource's URI, you may enter it here and thus avoid a look-up operation during ETL - use this option with caution, as all checks are also by-passed (in particular, the ETL process does not verify that the resource exists). We have found that the URI option is most useful when all values in the column are the same - otherwise it makes it more difficult to visually inspect the data.The primary resource type template will contain references to instances from a secondary resource type template - references in the primary file need to use the exact name (ignoring case) entered in the secondary file for the correct linkage to occur
    • Resource columns are followed by Resource type columns; sometimes the resource type column is omitted if there are no possible values (i.e. the resource type has no subclasses) - if the column exists, it must be used.
  • Referenced taxonomies - a value that matches an eagle-i ontology term or synonym is expected - if you know the term URI, you may enter it here and thus avoid a look-up operation during ETL use this option with caution, as all checks are also by-passed (in particular, the ETL process does not verify that the term exists in the eagle-i ontology). We have found that the URI option is most useful when all values in the column are the same - otherwise it makes it more difficult to visually inspect the data.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them.

Consult the tooltips in the header rows for more information about what is expected.A few more guidelines and tips:.

Tip

It is highly recommended you delete any columns from the template that are not needed. This eliminates potential confusion and makes the sheets much less unwieldy.

Rules and guidelines

  • You must always enter the Organization to which the resource is associated, either by name or by URI

    Tip
    It is best to use the Sweet to add the Organization (e.g. lab) to which these resources are associated, and then reference this name or URI in the files.
  • References in the primary file, either to resources represented in a secondary file or resources in the repository, need to use the exact label (ignoring case) for the correct linkage to occur. 
  • If there is more than one value for a given column, enter values separated by ; (semicolon). Conversely, check your input file for the presence of ; in values that are not meant to be split and substitute for a different character.
  • The first two columns (hidden) of a template are reserved for metadata. Please do not modify them.
  • Every resource (primary or secondary) needs to have a name and a type as a minimum. For simplification, the type column is omitted if there are no possible subclasses (e.g. Person, Human Subject). If the template has a type column, you must enter a value.
  • Use URIs instead of labels with caution, as all checks are also by-passed (in particular, the ETL process does not verify that the resource exists in the repository or that the term exists in the ontology, so you may end up with non-conforming data).

  • Generate ETL templates for the most specific type needed, i.e. Antibody instead of parent type Reagent, Core Facility instead of parent type Organization, Biological process phenotype instead of parent type Phenotype, Journal Article instead of parent type Document, etc. This is because the properties generated for each of these more sub-types will be different than those generated for the root class.

    Tip

    We have found that the URI option is most useful when all values in the column are the same - otherwise it makes visual inspection of the data more difficult.

Special rules for generic organisms

Generic organisms are instances of an Organism subclass that are meant to represent that subclass as a whole, when no specific instances are needed in the data - for example the generic organism https://global.eagle-i.net/i/Mus_musculus represents the Organism subclass http://purl.obolibrary.org/obo/NCBITaxon_10090. Generic organisms typically reside in eagle-i's Centrally Curated Resources.
In your template, if a column can accept an Organism value, you may use a specific instance or a generic instance. If you are using a generic instance, enter the label (e.g. Mus musculus) in both name and type columns.

Special rules for embedded instances

...

  • In the template for the main resource, enter a list of semicolon-separated labels for the embedded resources, and fill the type column if it exists (only one value is required, not a semicolon-separated list). This will result in the creation of "empty" embedded resources
  • ETL this file (see below)
  • In order to ETL the rest of the properties for the embedded resources:
    • Generate a template/map for the embedded resource class
      • Templates generated for embedded classes will have two additional columns: Main Resource Name and Main Resource Type
    • Fill the template with the information for the embedded resource (one row per embedded resource), making sure the entries for Main Resource Name and Main Resource Type match the label and type previously used when ETLing the main resource. If there are different sub-types for different embedded resources (for example, for Phenotypes), those can be specified in the type field here.
    • Run the ETL command with the embedded resource file as input. Note that the -p and -eid parameters will be ignored if they are present.

...

  • Determine the type of ETL command that you need to use (create all new or replace/create). ETL commands are described in the SWIFT toolkit guide. 
  • The

    Note that the

    ETL process may need to run for a long time (depending on the number of rows in your input, the number of properties filled per row and the network speed) so plan accordingly - e.g. leave it running over night.

    Tip
    It is good practice to run a test ETL with a subset of rows to detect possible errors in the template data. Run it against a staging server if you have access to one or against the eagle-i training node. If neither of these options is desirable, use your live eagle-i node but ETL the data in the DRAFT state and de-ETL it afterwards.
  • We recommend that the ETL process be run using dedicated credentials (e.g. create an "automated curator" user in your eagle-i repository), such that it is later possible to easily isolate resources that were bulk loaded. ETL needs a Level 4 role.
  • ETL first the secondary resource type files (i.e. linked resources), then the primary resource type file.

    Tip
    We find it useful to name the data subdirectories according to the order in which they need to be ETLd, e.g. for iPS Cells:
    1-human-subject
    2-diagnosis
    3-biological-specimen
    4-primary-cell-line
    5-induced-pluripotent-stem-cell-line
  • We recommend to ETL resources into the CURATION state, verify the resources were ETLd correctly, and then publish them using the bulk workflow command. The PUBLISH state may be directly used if you are satisfied with your testing.

Post ETL tasks

  • Examine the logs generated by the ETL command for possible errors. A log file will be generated per ETLd file, and the row where the error occurred will be indicated. You may need to re-ETL these rows.
  • If an ontology term is not resolved during the ETL process, a triple will be added with the predicate http://eagle-i.org/ont/datatools/1.0/temp_term_not_found and the literal value found in the data file.
    • issue the following SPARQL query against your repository to find these instances and correct them via the SWEET:

      Code Block
      languagetext
       select * where {?s <http://eagle-i.org/ont/datatools/1.0/temp_term_not_found> ?o}

...