What's Linked Life Data?

Linked Life Data (LLD) is a data-as-a-service platform that provides access to 25 public biomedical databases through a single access point. The service allows writing of complex data analytical queries, answering complex bioinformatics questions such as 'give me all human genes located in Y-chromosome with the known molecular interactions'; simply navigate through the information, or export subsets like 'all approved drugs and their brand names'.

The service offers two different access levels:

  • LLD Public - completely free anonymous access for developing proof-of-concept applications with no hosting and data setup costs.
  • LLD Enterprise - premium service access for matured applications, which guarantees extra features.
FeatureLLD PublicLLD Enterprise
SPARQL EndpointLimited to queries executed in 30sUnlimited and exclusive server access
Secure HTTPS accessNoYes
Write accessNoYes
Data sourcescopyright limited and copyright free sourcessources with commercial license
UpdatesAnnualMonthly
Subscription feeNoYes
Professional services to improve the dataNoYes
SupportNoneCommercial support regulated by service level agreement

We provide enterprise support of the linked data cloud. To contact us, please send an email to life-sciences@ontotext.com.

Linked Data

LLD uses a distributed graph data model to represent complex heterogeneous information. Imagine it as linked data. Linked data is method of publishing structured data so that it can be easily interlinked. The picture below demonstrates in different colours (i.e. different physical locations) how the information is represented and identified by global identifiers (URI). To create linked data you have to follow these few very simple steps:

  • Use URIs to identify all resources (conceptual notions)
  • Make sure that all identifiers can be resolved ("dereferenced") by people and computers
  • Provide adequate support for all public W3C standards such as RDF and SPARQL.
  • Include links to other related things (using their URIs), when publishing data

Linked Data

GraphDB database

GraphDB is a semantic repository - a software component for storing and manipulating huge quantities of RDF and linked data. This is the database instance used to power every LLD server node. More specifically, every node operates GraphDB-SE. GraphDB-SE is suitable for handling massive volumes of data and very intensive querying activities. It is designed as an enterprise-grade database management system. This has been made possible through:

  • File-based indices, which enable it to scale to billions of statements even on desktop machines
  • Special-purpose index and query optimization techniques, ensuring fast query evaluation against very large volumes of data
  • Optimized handling of owl:sameAs (identifier equality) to boost efficiency for data integration tasks
  • Efficient retraction of explicit statements and their inferences, which allows efficient delete operations
  • A range of powerful 'advanced features' including: Full text search (Node search, RDF search), ranking, selection and notifications

GraphDB allows the loading of all 10 billion RDF statements in a single machine, and guarantees very fast query response time. The database also supports federated queries and links to URIs hosted by external systems.

Data Conventions

In the process of integrating the 25 databases, which are part of the public service, the following linked data generation conventions were used:

  • Preserve the original RDF structure, if distributed by the owner
  • Use resolvable URIs for the data sources with no RDF distribution
  • Construct the generated URIs in the form of lld:resource/db/type/id
  • Identify the graph names with lld:resource/db
  • Name all generated predicate URIs lld:resource/db/predicate
  • Generate stable new URIs based on unique labels, which describe the resources (see dataset provenance and updates)

After all data is represented into RDF data format, many additional connections between the resources must be made, before it becomes truly “linked” data. The blue lines and the blue text of the captions (used either as part of the URI, or as literals) mark the criteria for linking the information. The specified mapping rules are applied only to the specified subsets of information. Another form of connection is Semantic Annotation, which uses NLP analysis to generate links between entities and unstructured textual fragments.

Linked Data Alignment Rules

Current Databases

The service covers the full path of data - gene, protein, molecular interaction, pathway, target, drug, disease and clinical trial related information. These are the primary entities of a knowledge base composed by structured databases (NCBI Gene, Uniprot, DrugBank, BioPAX and many more), terminologies (UMLS, OBO), and semi-structured documents (Pubmed, ClinicalTrials.gov). The public service integrates only free databases. While others, used in the Enterprice edition have further license restrictions. To check the complete list, visit the current list of processed databases.

Acknowledgement

This work was partially funded by Linked Life Data and the EU IST projects, KHRESMOI (FP7-257528) and LarKC (FP7-215535)