CardioVascular Research Grid Software Architecture

From CVRG Wiki

Revision as of 19:43, 8 November 2008 by Kurc (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

CVRG Architecture Overview 

The main objectives of the CardioVascular Research Grid (CVRG) infrastructure are to enable

  • resource providers to implement and deploy databases and analysis programs as interoperable services and enforce access control policies,
  • application developers to implement client programs that can access such services, and
  • researchers to discover such services, carry out federated queries against these resources, and compose analysis workflows which can link multiple data and analytical resources together.

In this section we will present an overview of the architecture of CVRG which is designed to provide this functionality.


Contents

Motivation and Challenges

An example use case supported by the CVRG infrastructure is shown in Figure 1. In this use case the researcher would like to 1) access different types of data (e.g., ECG data, image data), 2) analyze these datasets using existing analysis programs (e.g., LDDMM for images, Berger QT for ECG data), and 3) integrate the analyzed data with clinical data and other data types (e.g., SNP data) in order to, for instance, look for common and unique trends across groups of subjects. Figure 2 shows an example workflow involving the query and analysis of ECG data.

CVRG Arch Fig 1

Figure 1. An example data exploration and integration scenario.

CVRG Arch Fig 2

Figure 2. An example data analysis workflow. ECG data is queried and retrieved from two datasets. The data is stored in different data formats (Norav ECG and Physionet ECG) in these datasets. The data in each format is processed through analysis methods that understand the data format; the Norav ECG data is processed by Berger QT Algorithm and the Physionet ECG data by Physionet QT Algorithm.

In a multi-institutional setting, the datasets and analysis programs in this example use case may be hosted at different locations (as shown in Figure 3). The datasets may be stored in different database systems using different database schemas. The analysis programs may need to be invoked in different ways, even when they are implemented to carry out similar types of analyses. The programs may have different input and output data formats. Figure 2 illustrates the case in which ECG data is stored in different formats (HL7aECG and WFDB). The analysis programs accept one or the other of these formats. In such a setting, without architecture support, the client program has to be designed to understand different data formats and program invocation mechanisms. When new data sources and/or analysis programs are added to the environment, the client may need to be updated to be able to interact with the new sources. Moreover, the researcher will have no means of discovering other data and analytical resources that may be available in the environment.

CVRG Arch Fig 3

Figure 3. Challenges in carrying out data exploration, analysis, and integration when databases and analytical programs are heterogeneous and hosted at multiple locations. Different data formats and invocation mechanisms create barriers to interoperability of these resources. It is also necessary to enable resource providers to implement and enforce access control mechanisms so that they can restrict access to their resources by only those users who have the appropriate privileges (e.g., a user may be a collaborator using ECG and image data, or a clinician who has permission to access clinical information).


CVRG Architecture Design

In order to address these issues and accomplish its main objectives, the architecture of the CVRG is designed as a service-oriented, model/metadata driven, layered system. This layered architecture is shown in Figure 4. At the lowest layer are databases, analysis programs, and other resources hosted at one or more institutions. These resources are wrapped as domain analytical and data services using common development tools. The core Grid middleware infrastructure of the CVRG provides support for deploying secure services and accessing these services from client programs. Client programs and the CVRG portal environment constitute the highest layer of the architecture. This layer is the entry point for researchers and scientists to the CVRG environment. Developers of client programs and portal plug-ins (portlets) should use domain services, the core Grid middleware infrastructure, and the common tools provided with the middleware infrastructure in order to implement their client applications or portlets. Service developers will need to use the core middleware infrastructure and its common tools in order to implement, secure, and deploy domain specific services. In the next sections, we describe the core middleware infrastructure, common tools, domain services, and the CVRG portal environment.

CVRG Arch Fig 4

Figure 4. The layered architecture of the CVRG.

Core Grid Middleware Infrastructure and Common Tools

The core Grid middleware infrastructure of the CVRG provides runtime support and a suite of tools and infrastructure services for resource providers to develop Grid services for their resources and deploy these services in the CVRG environment. The CVRG extensively leverages the caGrid infrastructure, which has been developed in the cancer Biomedical Informatics Grid (caBIG) program of the National Cancer Institute. More detailed information on caGrid can be found at the caGrid Knowledge Center: https://cabig-kc.nci.nih.gov/CaGrid/KC/index.php/Main_Page. Here we provide a brief overview of caGrid.

caGrid Overview

caGrid is a services oriented Grid software infrastructure, building on the Grid Services architecture. The current release of caGrid is version 1.2 (caGrid 1.2) and can be downloaded from the caGrid Knowledge Center. caGrid is built on open source and publicly and freely available under a liberal open source license for use by software development and research teams.

caGrid leverages Grid Services technologies and leverages several Grid systems, including the Globus Toolkit (http://www.globus.org) and Mobius, and tools developed by the NCI such as the caCORE infrastructure. As a primary principle of caBIG™ is open standards, caGrid is built upon the Grid Services standards, and more specifically upon the Web Services Resource Framework standards, as a services oriented architecture. Each data and analytical resource in caGrid is implemented as a Grid Service, which interacts with other resources and clients using Grid Service protocols. caGrid services are standard WSRF v1.2 services and can be accessed by any specification-compliant client. The caGrid infrastructure also consists of coordination services, runtime environment to support the deployment, execution, and invocation of data and analytical services, and tools for easier development of services, management of security, and composition of services into workflows. The coordination services provide support for common Grid-wide operations required by clients and other services. These operations include metadata management; advertisement and discovery; federated query; workflow management; and security. The coordination services can be replicated and distributed to achieve better performance and scalability to large numbers of clients. Users can access these services via web portals or application specific client programs.

caGrid adopts a model-driven architecture best practice. Client and service APIs in caGrid represent an object-oriented view of data and analytical resources. These APIs operate on registered data models, expressed as object classes and relationships between the classes in UML. Each caGrid service can describe itself using service metadata. When a service is deployed, its service metadata is registered with an indexing registry service, called the Index Service, provided by the Globus Toolkit, and used in the caGrid infrastructure. The Index Service can be thought of the repository of information about all advertised and available services in the environment. A researcher can discover services of interest by looking them up in this registry. caGrid provides support for rich service metadata. The infrastructure implements a series of high-level APIs for performing searches on service metadata, facilitating discovery of resources based on data models and semantic information associated with them. caGrid provides a comprehensive set of services for security. These services enable Grid-wide management of user credentials, support for grouping of users into virtual organizations for role based access control, and management of trust fabric in the Grid.

Service Development and Deployment in caGrid: Introduce Toolkit. Introduce is the service development and deployment toolkit provided as part of the core caGrid infrastructure. The toolkit provides a graphical development environment for developers to create service interfaces, secure services with authentication and authorization extensions, and deploy implemented services. It hides the details of low level processes, libraries, and compilation steps from the developer, allowing them to focus on the implementation of the application logic in their services. It should be noted that while Introduce assists in generating service interfaces and deploying services, it is the responsibility of the developer to implement the necessary application code for these interfaces. Introduce is extensible in that application specific common service patterns (e.g., a particular data service style) can be incorporated into the toolkit as plug-in extensions.

Security Support in caGrid: GAARDS Tools and Services. The Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) infrastructure is the security infrastructure of caGrid. GAARDS provides services and tools for the administration and enforcement of security policy in an enterprise Grid. The CVRG infrastructure extensively leverages GAARDS to support security requirements in the CVRG environment. GAARDS is developed on top of the Globus Toolkit and extends its GSI component to provide enterprise services and administrative tools for: 1) Grid user management, 2) identity federation, 3) trust management, 4) group/VO management 5) Access Control Policy management and enforcement, and 5) Integration between existing security domains and the Grid security domain. GAARDS services can be used individually or in concert to meet the authentication and authorization needs. These services include: Dorian is a Grid service for the provisioning and management of Grid users accounts. Dorian provides an integration point between external security domains and the Grid, allowing accounts managed in external domains to be federated and managed in the Grid. It allows users to use their existing credentials (which may be external to the Grid) to authenticate to the Grid. Grid Trust Service (GTS) is a Grid-wide mechanism for maintaining and provisioning a federated trust fabric consisting of trusted certificate authorities, allowing Grid services to make authentication decisions against the most recent information. Grid Grouper provides a group-based authorization solution for the Grid. Grid services and applications enforce authorization policy based on membership to Grid-level groups. Authentication Service - Provides a framework for issuing SAML assertions for existing credential providers so they may easily integrate with Dorian and other Grid credential providers. The authentication service also provides a uniform authentication interface upon which applications can be built.

CVRG Extensions and Additional Support

XML Data Services Extensions. We have developed plug-in extensions to Introduce to support development of data services the backend of which are XML databases. This requirement has stemmed from the fact that HL7aECG datasets are stored in XML documents and results from ECG analyses are managed in XML documents. The current implementation of the XML Data Services extension enables the creation of a data service from an XML schema. The generated data service allows for storage of XML documents conforming to the schema in the backend XML database system and query of the data from Grid-enabled clients. Presently, the backend XML database system is built on the Oracle Berkeley XML DB, which is an open-source, fast XML database system.

Support for Biomedical Image Datasets. We have developed software components to support management, sharing, and integration of image datasets in the CVRG environment. The development of this support is funded in part by the CVRG project as well as other initiatives, including the caBIG program. The CVRG image management support extensively leverages the In-vivo Imaging (IVI) middleware infrastructure, which is implemented on top of the caGrid infrastructure. The IVI middleware provides 1) DICOM-Grid Interoperability to facilitate exchange of image data between existing DICOM PACS and grid entities, 2) support for efficient transport of large amount of binary image data, and 3) a Service Development Kit to facilitate development and deployment of image related grid services. The CVRG image data management support utilizes the functionalities implemented in the IVI Middleware and the VirtualPACS application, which has been developed using the IVI middleware. VirtualPACS allows a radiologists to use a DICOM workstation to access PACS systems exposed in a Grid environment.

CVRG Domain Analytical and Data Services

Application developers and researchers can implement and deploy services that manage and securely expose their databases and analysis applications. These services are referred to as Domain Services. We have developed a number of domain services that can be used by research groups and developers of client programs. These services are listed below:

  • HL7aECG Data Service: This service is is used to manage ECG datasets stored in HL7aECG format. It provides secure Grid enabled access to these datasets.
  • OpenClinica Data Service: This service is used to manage de-identified clinical information. It uses OpenClinica database as the underlying database to store the clinical information.
  • QTVIni Data Service: This service is used to manage "ini" files used by the Berger QTVi Analysis. The Berger QTVi analysis takes HL7aECG files (which are managed in the HL7aECG Data Service along with additional metadata) and "ini" files which describe manual Q and T point identification. In an analysis workflow, a CVRG client can query an HL7aECG Data Service instance to retrieve the ECG data of interest and an instance of the QTViIni service for an "ini" file and pass this information to an instance of the qtviAnalysisService, which implements a Grid service interface to the Berger QTVi analysis, for analysis purposes.
  • SNP Data Service: It provides grid-enabled access to SNP data. SNP Data Service uses MySQL version 5.0.22 database and caCORE 3.2.1.
  • WFDB Data Service: It is used to manage ECG datasets stored in WFDB format. It provides secure Grid enabled access to these datasets.
  • AutoQRS Data Service: It is used to manage datasets analyzed by Auto QRS Algorithms. It provides secure Grid enabled access to these datasets.
  • DICOM Data Service: This service provides Grid-enabled access to image data stored in DICOM format.

CVRG Portal

Users can access CVRG domain services through the CVRG portal. The portal provides a high-level, graphical entry point to the CVRG environment. A user can use the portal to locate relevant services and applications. The portal is web-based providing access anywhere capabilities and secure access to services and applications in the CVRG environment.

Domain specific applications are integrated to the portal environment as portlets. The CVRG portal has been developed using the Liferay portal infrastructure. We have implemented Web Single Sign On (WebSSO) mechanism in the portal environment. This would allow a client to log onto the portal once and use different applications (through portlets) and access different CVRG Grid services based on his/her credentials without having to log on for each individual application or service. The workflow for logging on and using CVRG services through the portal is illustrated in Figures 5 and 6.

CVRG Arch Fig 5

Figure 5. The basic portal workflow.

CVRG Arch Fig 6

Figure 6. Logging on to the environment through the portal and WebSSO.

Putting It All Together

Using these services and caGrid infrastructure, the example set up in Figure 1 and Figure 2 can be implemented as illustrated in Figure 7. The data sources and analysis programs are wrapped as caGrid data and analytical services. Using the caGrid infrastructure services and tools, the client can log on to the environment and can discover services that are available in the environment. The client then can query these services or compose workflows for data analysis.

CVRG Arch Fig 7

Figure 7. The CVRG environment built on the caGrid infrastructure and CVRG domain services.

Personal tools
Project Infrastructures