Data storage and AmiGO2 Working Group: Difference between revisions
Jump to navigation
Jump to search
(51 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
=Goals and Objectives:= | =Goals and Objectives:= | ||
Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes. | == Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes. == | ||
*Deliverables: A centralized portal for common reference ontologies for plants and the associated data sets. Novel data store and web user interface. | |||
3.1 Planteome Web Portal Development | ===3.1 Planteome Web Portal Development=== | ||
* Drupal portal will host the AmiGO browser, the ontology database developed by the PO and the GO | * Drupal portal will host the AmiGO browser, the ontology database (similar to the one developed by the PO and the GO), and a BioMart | ||
* Transition to AmiGO 2.0 with new features | * Transition to AmiGO 2.0 with new features | ||
3.2 Planteome Data Warehouse Development | ===3.2 Planteome Data Warehouse Development=== | ||
* Novel data warehouse for storing both the ontologies and annotation data based on [http://nosql-database.org/ NoSQL] | * Novel data warehouse for storing both the ontologies and annotation data based on [http://nosql-database.org/ NoSQL] (e.g. MongoDB, http://www.mongodb.org, and Apache™ Hadoop®, http://hadoop.apache.org) | ||
* Integrate the MapReduce algorithm to increase scalability and performance | * Integrate the MapReduce algorithm to increase scalability and performance | ||
* Investigate using HDF ([http://en.wikipedia.org/wiki/Hierarchical_Data_Format Hierarchical Data Format]), as a storage format for any numerical or sequence-based data. | * Investigate using HDF ([http://en.wikipedia.org/wiki/Hierarchical_Data_Format Hierarchical Data Format]), as a storage format for any numerical or sequence-based data. | ||
* Create an efficient way to add annotations incrementally to the database, (not possible in the current AmiGO database) | |||
* Implementation of OLAP (Online Analytical Processing) data cubes (http://en.wikipedia.org/wiki/OLAP_cube) | |||
===3.3 Integration with the iPlant infrastructure === | |||
* Initial design and testing will happen locally at the Center for Genome Research and Biocomputing at Oregon State University | |||
* Use of virtual machine (VM) images in the iPlant cloud computing environments | |||
* Utilization of high performance computing resources, such as: | |||
** The supercomputer 'Stampede' at Texas Advanced Computing Center (TACC) | |||
** Use of iRODS at iPlant for data file storage and retrieval | |||
** Image hosting via Bisque hosted on the iPlant infrastructure (See 3.4, below) | |||
* Interaction with resources such as CoGE, Bisque, and the Integrated Breeding Platform (IBP) | |||
===3.4 Library of Publicly-Accessible, Annotated Digital Images=== | |||
* Design a relational data schema to support the large-scale storage of annotated images (and their associated metadata) | |||
* Image library main goal: A training set for a new auto-segmentation and annotation active-learning algorithm | |||
* Support other visual analysis tools and the integration of image data with ontology data | |||
* Will also function as a home for community-contributed image data | |||
===3.5 Application Programming Interface (APIs)=== | |||
* Develop of publicly available APIs for both internal and external data access to ontology terms and annotations | |||
* Extend the existing lightweight web services providing Plant Ontology terms, synonyms, and definitions to the Planteoem APIs, including direct web service access to annotated data | |||
* Potential Users: | |||
** Gramene project- information about annotations and ontologies | |||
** DOE KBase project (http://kbase.science.energy.gov/) | |||
** iPlant tools and services. | |||
* Integrate our data with other external APIs, For example: | |||
** EBI (the Gene Expression Atlas, Ensembl Plants, IntAct), | |||
** ERA-CAPS (genotype-to-phenotype data) | |||
** DOE KBase | |||
** GCP Integrated Breeding Platform | |||
** Agave on iPlant which provides web-focused developer access to the iPlant data store and other integration services, providing a direct link to high-performance computing systems such as the TACC. | |||
=Participants= | =Participants= | ||
Line 16: | Line 48: | ||
* Zhang Lab (OSU, EECS):'' Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student) | * Zhang Lab (OSU, EECS):'' Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student) | ||
=Data storage and AmiGO 2 Working Group Meetings | = Link to [[Data storage and AmiGO 2 Working Group Meetings]]= | ||
Latest revision as of 19:04, 2 June 2015
Goals and Objectives:
Aim-3: Develop an online informatics portal and data warehouse for ontology-based, annotated plant genome data and plant genomes.
- Deliverables: A centralized portal for common reference ontologies for plants and the associated data sets. Novel data store and web user interface.
3.1 Planteome Web Portal Development
- Drupal portal will host the AmiGO browser, the ontology database (similar to the one developed by the PO and the GO), and a BioMart
- Transition to AmiGO 2.0 with new features
3.2 Planteome Data Warehouse Development
- Novel data warehouse for storing both the ontologies and annotation data based on NoSQL (e.g. MongoDB, http://www.mongodb.org, and Apache™ Hadoop®, http://hadoop.apache.org)
- Integrate the MapReduce algorithm to increase scalability and performance
- Investigate using HDF (Hierarchical Data Format), as a storage format for any numerical or sequence-based data.
- Create an efficient way to add annotations incrementally to the database, (not possible in the current AmiGO database)
- Implementation of OLAP (Online Analytical Processing) data cubes (http://en.wikipedia.org/wiki/OLAP_cube)
3.3 Integration with the iPlant infrastructure
- Initial design and testing will happen locally at the Center for Genome Research and Biocomputing at Oregon State University
- Use of virtual machine (VM) images in the iPlant cloud computing environments
- Utilization of high performance computing resources, such as:
- The supercomputer 'Stampede' at Texas Advanced Computing Center (TACC)
- Use of iRODS at iPlant for data file storage and retrieval
- Image hosting via Bisque hosted on the iPlant infrastructure (See 3.4, below)
- Interaction with resources such as CoGE, Bisque, and the Integrated Breeding Platform (IBP)
3.4 Library of Publicly-Accessible, Annotated Digital Images
- Design a relational data schema to support the large-scale storage of annotated images (and their associated metadata)
- Image library main goal: A training set for a new auto-segmentation and annotation active-learning algorithm
- Support other visual analysis tools and the integration of image data with ontology data
- Will also function as a home for community-contributed image data
3.5 Application Programming Interface (APIs)
- Develop of publicly available APIs for both internal and external data access to ontology terms and annotations
- Extend the existing lightweight web services providing Plant Ontology terms, synonyms, and definitions to the Planteoem APIs, including direct web service access to annotated data
- Potential Users:
- Gramene project- information about annotations and ontologies
- DOE KBase project (http://kbase.science.energy.gov/)
- iPlant tools and services.
- Integrate our data with other external APIs, For example:
- EBI (the Gene Expression Atlas, Ensembl Plants, IntAct),
- ERA-CAPS (genotype-to-phenotype data)
- DOE KBase
- GCP Integrated Breeding Platform
- Agave on iPlant which provides web-focused developer access to the iPlant data store and other integration services, providing a direct link to high-performance computing systems such as the TACC.
Participants
- Jaiswal Lab (OSU, BPP): Justin Elser
- Mungall Group (Lawrence Berkeley National Laboratory): Chris Mungall (Co-PI), Seth Carbon
- Zhang Lab (OSU, EECS): Eugene Zhang (Co-PI), Botong Qu (CS Ph.D. student)