Date: Thu, 28 Mar 2024 16:03:54 -0400 (EDT) Message-ID: <493837387.804.1711656234131@prodopencatalystconfluence.catalyst> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_803_178766064.1711656234127" ------=_Part_803_178766064.1711656234127 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
SWIFT ETL commands require that all data be entered into SWIFT templates= . One SWIFT template must be used for each resource type to be ETLd. Typica= lly, the primary resource type you are ETLing i (e.g. Plasmid) will require= several secondary resource types (e.g. Person, Journal Article, ...) to be= fully described in eagle-i -- the secondary resources will be linked from = the primary resources.
Entering data correctly into SWIFT templates is key for the ETL process = to succeed without creating duplicates or non-conforming data. You may ente= r data into SWIFT templates in the following ways:
Data that exists electronically will typically be stored in a relational= database and accessible via a database dump, or accessible through an API = (e.g. in JSON format). It is usually necessary to perform a few transformat= ions on the original data in order to fit it into a SWIFT template. Th= is step is highly dependent on the nature of the original data, and he= nce the procedures will need to be developed on a case by case basis. In ma= pping the data from its original schema to the eagle-i ontology, the follow= ing scenarios may be encountered:
We usually write ad-hoc scripts that perform such transformations on the= original data before copy-pasting individual columns into a SWIFT template= . For controlled vocabulary fields, we produce mapping tables with the help= of our domain experts and use them during these pre-processing steps.
SWIFT templates and maps need to be generated using the toolkit version = that corresponds to your eagle-i repository version (this is very important= because ontology versions matter - if versions don't match, you might end = up with non-conforming or incomplete resource descriptions).
When ETLing a = primary type, there are usually resources of other types that are related t= o it (e.g People, Organizations, Publications). It is necessary to enter in= formation for these related types in a separate template. For example, when= ETLing a Monoclonal Antibody, you'll have separate files for related Hybri= doma Cell Lines, People and Publications.
Use the generate-inputs
command described in the SWIFT toolkit gui=
de to generate the following templates:
Do not modify the directories that are created by this command.
Create a directory dedicated to your data files, and subdirectories for =
each resource type. Copy the templates you will use to the appropriate subd=
irectory - note that more than one file of a given resource type may be use=
d (e.g. you could have a file per lab if you're ETLing multiple labs) We re=
commend that before adding resource data to a template, you rename the file=
such that the name is meaningful, since all resources ETLd from a file wil=
l be tagged with that file's name, We've found it useful to use a name that=
reflects the date of the ETL , e.g. 20150627-NYSCF-human-study.xls=
code>.
SWIFT Templates include different kinds of columns:
Consult the tooltips in the header rows for more information about= what is expected.
It is highly recommended you delete any columns from the template that a= re not needed. This eliminates potential confusion and makes the sheets muc= h less unwieldy.
You must always enter the Organization to which the resource i=
s associated, either by name or by URI
Use URIs instead of labels with caution, as all checks are a= lso by-passed (in particular, the ETL process does not verify= that the resource exists in the repository or that the term exists in the = ontology, so you may end up with non-conforming data).
Generate ETL templates for the most specific type needed, i.e. Antib= ody instead of parent type Reagent, Core Facility instead of parent type Or= ganization, Biological process phenotype instead of parent type Phenotype, = Journal Article instead of parent type Document, etc. This is because the p= roperties generated for each of these more sub-types will be different than= those generated for the root class.
We have found that the URI option is most useful when all values i= n the column are the same - otherwise it makes visual inspection of the dat= a more difficult.
Generic organisms are instances of an Organism subclass that are meant t=
o represent that subclass as a whole, when no specific instances are needed=
in the data - for example the generic organism https:/=
/global.eagle-i.net/i/Mus_musculus represents the Organism subclas=
s http://purl.obolibrary.org/obo/NCBITaxon_10=
090. Generic organisms typically reside in eagle-i's Centrally Curated =
Resources.
In your template, if a column can accept an Organism value, you may use a =
specific instance or a generic instance. If you are using a generic instanc=
e, enter the label (e.g. Mus musculus) in both name and type columns.
Embedded resources are instances of Embedded Classes (see the = eagle-i ontology browser). Embedded resources only make sense as part o= f another "main" resource and as such they can only be created or updated a= s part of a creation or update operation for that resource. Examples of emb= edded classes are Construct Inserts (embedded in Constructs) or Diagnosis (= embedded in Human Subjects).
In the context of ETL and SWIFT templates, embedded resources are a spec= ial case of linked resources. If there is only one embedded resource in a g= iven resource, it is possible to enter its information directly in the SWIF= T template row for the main resource. If there are more than one embedded r= esources in a given resource, the following procedure must be used:
-p
and -eid
parameters will be ignored if th=
ey are present.This Google Drive folder contains= a few annotated examples of SWIFT templates with data.
The ETL process may need to =
run for a long time (depending on the number of rows in your input, the num=
ber of properties filled per row and the network speed) so plan accordingly=
- e.g. leave it running over night.
ETL first the secondary resource type files (i.e. linked resources),= then the primary resource type file.
http://eag=
le-i.org/ont/datatools/1.0/temp_term_not_found
and the l=
iteral value found in the data file.=20
issue the following SPARQL query against your repository to find the= se instances and correct them via the SWEET:
s= elect * where {?s <http://eagle-i.org/ont/datatools/1.0/temp_term_not_fo= und> ?o}