Data storage and AmiGO2 Working Group: Difference between revisions

From Planteome.org
Jump to navigation Jump to search
 
(46 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Goals and Objectives:=
=Goals and Objectives:=
== Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes.  
== Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes. ==
*Deliverables: A centralized portal for common reference ontologies for plants and the associated data sets. Novel data store and web user interface.
*Deliverables: A centralized portal for common reference ontologies for plants and the associated data sets. Novel data store and web user interface.
===3.1 Planteome Web Portal Development===
===3.1 Planteome Web Portal Development===
* Drupal portal will host the AmiGO browser, the ontology database developed by the PO and the GO consortium, and a BioMart
* Drupal portal will host the AmiGO browser, the ontology database (similar to the one developed by the PO and the GO), and a BioMart
* Transition to AmiGO 2.0 with new features
* Transition to AmiGO 2.0 with new features


Line 11: Line 11:
* Investigate using HDF ([http://en.wikipedia.org/wiki/Hierarchical_Data_Format Hierarchical Data Format]),  as a storage format for any numerical or sequence-based data.
* Investigate using HDF ([http://en.wikipedia.org/wiki/Hierarchical_Data_Format Hierarchical Data Format]),  as a storage format for any numerical or sequence-based data.
* Create an efficient way to add annotations incrementally to the database, (not possible in the current AmiGO database)
* Create an efficient way to add annotations incrementally to the database, (not possible in the current AmiGO database)
* Implementation of OLAP (Online Analytical Processing) data cubes (http://en.wikipedia.org/wiki/OLAP_cube)


===3.3 Integration with the iPlant infrastructure ===
===3.3 Integration with the iPlant infrastructure ===
* Initial design and testing will happen locally at the Center for Genome Research and Biocomputing at Oregon State University
* Use of virtual machine (VM) images in the iPlant cloud computing environments
* Utilization of high performance computing resources, such as:
** The supercomputer 'Stampede' at Texas Advanced Computing Center (TACC)
** Use of iRODS at iPlant for data file storage and retrieval
** Image hosting via Bisque hosted on the iPlant infrastructure (See 3.4, below)
* Interaction with resources such as CoGE, Bisque, and the Integrated Breeding Platform (IBP)


===3.4 Storage of Annotated Digital image collections===
===3.4 Library of Publicly-Accessible, Annotated Digital Images===
* Design a relational data schema to support the large-scale storage of annotated images (and their associated metadata)
* Image library main goal: A training set for a new auto-segmentation and annotation active-learning algorithm
* Support other visual analysis tools and the integration of image data with ontology data
* Will also function as a home for community-contributed image data


===3.5 Application Programming Interface (APIs)===
===3.5 Application Programming Interface (APIs)===
* Develop of publicly available APIs for both internal and external data access to ontology terms and annotations
* Extend the existing lightweight web services providing Plant Ontology terms, synonyms, and definitions to the Planteoem APIs, including direct web service access to annotated data
* Potential Users:
** Gramene project- information about annotations and ontologies
** DOE KBase project (http://kbase.science.energy.gov/)
** iPlant tools and services.
* Integrate our data with other external APIs, For example:
** EBI (the Gene Expression Atlas, Ensembl Plants, IntAct),
** ERA-CAPS (genotype-to-phenotype data)
** DOE KBase
** GCP Integrated Breeding Platform
** Agave on iPlant which provides web-focused developer access to the iPlant data store and other integration services, providing a direct link to high-performance computing systems such as the TACC.


=Participants=
=Participants=
Line 23: Line 48:
* Zhang Lab (OSU, EECS):'' Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student)
* Zhang Lab (OSU, EECS):'' Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student)


=Data storage and AmiGO 2 Working Group Meetings:=
= Link to [[Data storage and AmiGO 2 Working Group Meetings]]=
* Data storage and AmiGO 2 call 1-30-15
** Who: PJ, CM, Seth, EZ, LC, JP, JE
- Discussion of the planned transition to the AmiGO 2.0 platform
 
- JE is working on installing SolR database - View details and progress reports here: [[AmiGO2_install]]
 
* Data Storage and AmiGO2 call 2-18-15 [[Media:Data_Storage_2-18-15.mp4 ]]
** Who: JE, JP, EZ
- Further discussion of AmiGO2 progress and overview of AmiGO2 interface
 
 
Relevant links:
 
https://github.com/geneontology/amigo
 
Demo: http://amigo.geneontology.org/
 
http://amigo2.berkeleybop.org/ - dev server

Latest revision as of 19:04, 2 June 2015

Goals and Objectives:

Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes.

  • Deliverables: A centralized portal for common reference ontologies for plants and the associated data sets. Novel data store and web user interface.

3.1 Planteome Web Portal Development

  • Drupal portal will host the AmiGO browser, the ontology database (similar to the one developed by the PO and the GO), and a BioMart
  • Transition to AmiGO 2.0 with new features

3.2 Planteome Data Warehouse Development

  • Novel data warehouse for storing both the ontologies and annotation data based on NoSQL (e.g. MongoDB, http://www.mongodb.org, and Apache™ Hadoop®, http://hadoop.apache.org)
  • Integrate the MapReduce algorithm to increase scalability and performance
  • Investigate using HDF (Hierarchical Data Format), as a storage format for any numerical or sequence-based data.
  • Create an efficient way to add annotations incrementally to the database, (not possible in the current AmiGO database)
  • Implementation of OLAP (Online Analytical Processing) data cubes (http://en.wikipedia.org/wiki/OLAP_cube)

3.3 Integration with the iPlant infrastructure

  • Initial design and testing will happen locally at the Center for Genome Research and Biocomputing at Oregon State University
  • Use of virtual machine (VM) images in the iPlant cloud computing environments
  • Utilization of high performance computing resources, such as:
    • The supercomputer 'Stampede' at Texas Advanced Computing Center (TACC)
    • Use of iRODS at iPlant for data file storage and retrieval
    • Image hosting via Bisque hosted on the iPlant infrastructure (See 3.4, below)
  • Interaction with resources such as CoGE, Bisque, and the Integrated Breeding Platform (IBP)

3.4 Library of Publicly-Accessible, Annotated Digital Images

  • Design a relational data schema to support the large-scale storage of annotated images (and their associated metadata)
  • Image library main goal: A training set for a new auto-segmentation and annotation active-learning algorithm
  • Support other visual analysis tools and the integration of image data with ontology data
  • Will also function as a home for community-contributed image data

3.5 Application Programming Interface (APIs)

  • Develop of publicly available APIs for both internal and external data access to ontology terms and annotations
  • Extend the existing lightweight web services providing Plant Ontology terms, synonyms, and definitions to the Planteoem APIs, including direct web service access to annotated data
  • Potential Users:
  • Integrate our data with other external APIs, For example:
    • EBI (the Gene Expression Atlas, Ensembl Plants, IntAct),
    • ERA-CAPS (genotype-to-phenotype data)
    • DOE KBase
    • GCP Integrated Breeding Platform
    • Agave on iPlant which provides web-focused developer access to the iPlant data store and other integration services, providing a direct link to high-performance computing systems such as the TACC.

Participants

  • Jaiswal Lab (OSU, BPP): Justin Elser
  • Mungall Group (Lawrence Berkeley National Laboratory): Chris Mungall (Co-PI), Seth Carbon
  • Zhang Lab (OSU, EECS): Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student)

Link to Data storage and AmiGO 2 Working Group Meetings