GeneTegra: Semantic Integration of Biomedical Information

*E. Patrick Shironoshita, Yves R. Jean Mary, Patricia Buendia, Ray M. Bradley, Mansur R. Kabuk
INFOTECH Soft, Inc.
Demo-Interactive Presentation – Business Track
Saturday, Sept 29, 2012: 4:05 PM – 4:18 PM – Demo Pavilion

← Back to Medicine X 2012 Proceedings

*Presenting Speaker

GeneTegra is a novel information integration solution to explore and query diverse data sources from a single graphical user interface. It leverages Semantic Web standards to resolve the semantic and syntactic diversity of the large and increasingly complex body of public and private biomedical research data.

Data sources are modeled in GeneTegra utilizing OWL, the Web Ontology Language, in order to provide a uniform syntactic representation, capture the semantics of the data, and facilitate the integration and querying of multiple sources. The system automatically generates ontology models by extracting the metadata contained in relational database, XML or RDF schemas or by detecting the organization of structured text files. Classes are used to depict conceptual entities such as tables, object properties the relationships between these entities, and datatype properties the values associated with these entities. GeneTegra also provides facilities to create or use ontology models that are decoupled from the underlying data schemas, linking data fields from multiple tables into single conceptual entities, or separating a single table in multiple entities. In this manner, an existing ontology can be used as the model for an RDF or relational data source. Integrated models are created by merging classes and properties from the models of multiple underlying sources, matching and joining those classes that represent the same semantic concepts with the aid of the ASMOV ontology alignment algorithm.

Query building is done through a sophisticated graphical user interface, where drag-and-drop procedures are used to select classes, object properties, and datatype properties, and to establish filtering and joining conditions between data. Underneath the interface, the system constructs a query using the SPARQL Query Language for RDF. If the query is defined against an integrated model, GeneTegra distributes it into multiple sub-queries against the underlying sources, using the semQA query algebra to rearrange the query while ensuring that the intended result will be obtained. The queries against the sources are transformed using different mechanisms depending on the type of source; for relational databases, the R2RML emerging standard is used for conversion, and for RDF sources the SPARQL query is sent directly. Mechanisms for querying XML and structured text sources directly are under development; currently, these types of data are loaded into RDF or a relational database prior to querying. The results obtained from these different sources are converted to a variable binding table, aggregated according to the conditions defined in the query, and presented to the user in both tabular and graphical formats.

In this paper, we present a comprehensive description of the GeneTegra system and its modeling, integration, and query building and execution mechanisms. To show how the system works, two illustrative examples will be presented: one that retrieves information on a list of breast cancer genes from Ensembl, the Gene Ontology (GO), and the UCSC Genome Database, and one that seeks to find from these same sources experimental data supporting expression in brain tissues for a list of genes identified from putative exonic, splicing regulatory sequences (ESRs) in Drosophila.