Skip to content

OpenCGA Catalog Data Models

Nacho edited this page May 11, 2015 · 19 revisions

Catalog Data Models Definition

In this section will be explained all the data models used in Catalog.

For more detailed information about the Java data models you can browse the source code at Java beans or take a look at the JSON Schemas.

  • User
    • Project
      • Study
        • File
        • Job
        • VariableSet
          • Variable
        • Sample
          • AnnotationSet
            • Annotation
          • Individual
        • ACL
        • Experiment
        • Dataset
        • Cohort
    • Tool
      • Manifest
Commons fields

There are some fields that can be found in many of the different data models, these are:

  • id: a numeric positive identifier which is unique in the whole Catalog. This id can be used in the API and REST web services.
  • attributes: this field can be used by different applications using OpenCGA to store custom information, any well-formed JSON object is accepted.
  • lastActivity: this field reports when was the last time the data was updated, this is useful when updating a web client interface. []: # (For a lastActivity known value, if the value matches with the stored, it is not necessary to)

Session

Register every login and logout made by the user. The sessionId is valid only while the field logout is empty.

User and Project

This is the root level of the hierarchy. It represents any people registered in the system.

Most relevant fields are:

  • id: Alphanumerical string identifier. This is the only non-numerical Id.
  • status: Accepted account status:
    • ACTIVE:
    • BANNED:
    • DELETED:
    • ACTIVATION_PENDING:
  • role: Accepted role values:
    • ADMIN:
    • USER:
    • ANONYMOUS:
  • tools:

Example:

{
  "id": "jcoll",
  "name": "jacobo",
  "email": "[email protected]",
  "password": "dWArAxd6QlNqzL9qGchg",
  "organization": "ACME",
  "role": "USER",
  "status": "ACTIVE",
  "sessions": [
    {
      "id" : "Sq6JKQ5Uv8MOwK5jmgyd",
      "ip" : "10.0.0.14",
      "login" : "20141215162449",
      "logout" : "20141215165837"
    }
  ],
  "lastActivity": "20141215182938676",
  "tools": [],
  "configs": {},
  "attributes": {},
  "projects": [
    {
      "id": 14,
      "name": "Project1",
      "alias": "proj1",
      "creationDate": "20141215182727",
      "description": "Test project",
      "organization": "ACME",
      "status": "",
      "studies": [  ],
      "attributes": {}
    }
  ]
}

* In this example the array of studies and tools have been omitted. Will be explained below

Study

Main Catalog object. A study is a set of, among others, files, jobs and samples. All the files in a study share location, cypher and sharing options (ACLs).

Most relevant fields are:

  • type: (to cohort?)
    • CASE_SET:
    • CONTROL_SET:
    • CASE_CONTROL:
    • PAIRED:
    • FAMILY:
    • TRIO:
  • stats: (to cohort?)
  • status:
    • ACTIVE:
  • diskUsage: Sum of the diskUsage of all files in the study.
  • cipher: Mechanism used to cypher all files in study. Accepted values:
    • NONE: Without encryption.
    • AES_256: not implemented yet
  • uri: Location of the study. An URI is required instead of a Path because the study could be in different hosts and file systems.

Example:

{
  "id": 15,
  "name": "Study test 1",
  "alias": "std1",
  "type": "FAMILY",
  "creatorId": "jcoll",
  "creationDate": "20141215182938",
  "description": "",
  "status": "ACTIVE",
  "lastActivity": "20141215182938",
  "diskUsage": 0,
  "cipher": "NONE",
  "acl": [ ],
  "experiments": [ ],
  "files": [ ],
  "jobs": [ ],
  "samples": [ ],
  "uri": "hdfs:///data/opencga/catalog2/users/jcoll/projects/14/15/",
  "datasets": [
    {
      "id": 0,
      "name": "bam_test_files",
      "creationDate": "20141215182938",
      "description": " ... ",
      "files": [ 26, 27, 28, 29, 35, 36, 38],
      "attributes": { }
    }
  ],
  "cohorts": [ ],
  "variableSets": [ ],
  "stats": { },
  "attributes": { }
}

* In this example the array of files, jobs and samples have been omitted. Will be explained below

Dataset
Cohort
VariableSet and Variable

File

Most relevant fields are:

  • type: Accepted values:
    • FILE: Any real file stored in the file system.
    • FOLDER: File container.
    • INDEX: Not a real file. Represents a indexed file in a OpenCGA-Storage Engine. Removed at v0.6.0
  • format:
    • PLAIN:
    • GZIP:
    • BINARY:
    • IMAGE:
    • EXECUTABLE:
  • bioformat:
    • VARIANT:
    • ALIGNMENT:
    • SEQUENCE:
    • NONE:
  • status: File status. For more information, go to File life cycle. Accepted values:
    • UPLOADING: The file is being uploaded.
    • UPLOADED: Whole file uploaded. It has to be moved to the final destination.
    • INDEXING: The file is being indexed. Removed at v0.6.0
    • READY: File is ready to use.
    • DELETING: Deletion pending.
    • DELETED: Deleted file. Irreversible deletion.
  • jobId and experimentId: Specifies the source of the file. A file can be generated from a job or an experiment.

Example:

{
  "id" : 3,
  "name" : "chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.AMR.vcf.gz",
  "type" : "FILE",
  "format" : "GZIP",
  "bioformat" : "VARIANT",
  "path" : "data/vcf/chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.AMR.vcf.gz",
  "ownerId" : "jcoll",
  "creationDate" : "20141215162449",
  "description" : " ... ",
  "status" : "READY",
  "diskUsage" : 24276833,
  "experimentId" : -1,
  "sampleIds" : [ ],
  "jobId" : -1,
  "acl" : [ ],
  "stats" : { },
  "attributes" : { }
}

Job

Example:

{
  "id" : 138,
  "name" : "Test job",
  "userId" : "jcoll",
  "toolName" : "network-miner",
  "date" : "20141031151537",
  "description" : " ... ",
  "startTime" : 1415632245213,
  "endTime" : 1415632258708,
  "outputError" : "",
  "commandLine" : "/opt/opencga/analysis/network-miner/babelomics/babelomics.sh --tool network-miner  --seedlist 150140.chrom20.ILLUMINA.bwa.CHM1.20131218.bam.bai --significant-value 0.05 --list HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam --list-tags gene --intermediate 1 --outdir /home/cafetero/opencga/catalog/jobs/J_KrOrWfEwkx/ --order ascending --interactome hsa --randoms 500 --components false --group all --o-name result",
  "visits" : -1,
  "status" : "READY",
  "outDirId" : "6",
  "tmpOutDirUri" : "file:///home/cafetero/opencga/catalog/jobs/J_KrOrWfEwkx/",
  "input" : [
    66
  ],
  "tags" : [ ],
  "output" : [
    658,
    659
  ],
  "attributes" : { },
  "executionAttributes" : {
    "type" : "analysis",
    "jobExecutionId" : "268",
    "executionManager" : "SGE",
    "qname" : "normal.q",
    "group" : "cafetero",
    "jobname" : "network-miner_Test_job",
    "end_time" : "Wed Dec 10 11:10:06 2014",
    "jobnumber" : 268,
    "failed" : 0,
    "start_time" : "Wed Dec 10 11:10:06 2014",
    "hostname" : "host001",
    "qsub_time" : "Wed Dec 10 11:10:00 2014",
    "mem" : "0.000",
    "cpu" : "0.049",
    "exit_status" : 0
  }
}

Sample

Example:

{
  "id" : 19,
  "name" : "SMP00096",
  "source" : "",
  "individual" : null,
  "description" : " ... ",
  "annotationSets" : [
    {
      "name" : "Basic annotation",
      "variableSetId" : 21,
      "annotations" : [
        { "id" : "NAME",      "value" : "Glennie the platypus" },
        { "id" : "BORN-DATE", "value" : "20071000000000" },
        { "id" : "GENDER",    "value" : "FEMALE" }
        { "id" : "PHEN",      "value" : "CASE" }
        { "id" : "WEIGHT",    "value" : 25.38 }
      ],
      "date" : "20141216135957",
      "attributes" : { }
    }
  ]
}