Shivani Sharma — September 1, 2021
Advanced Data Engineering Database NoSQL Project SQL

This article was published as a part of the Data Science Blogathon

The transcriptome sequencing (RNA-seq) method has become quite a routine method for studying model organisms as well as crops. As a result of bioinformatic processing of such experiments, volumetric heterogeneous data are obtained, represented by the nucleotide sequences of transcripts, amino acid sequences, and their structural and functional annotation. It is important to present the obtained data to a wide range of researchers in the form of databases (DB).

In this article, we will consider a hybrid approach to the creation of molecular genetic databases that contain information about transcript sequences and their structural and functional annotations. To build the database of transcriptomes of agricultural plants this technology is used to implement. The article discusses the features of the implementation of this approach and examples of the formation of both simple and complex queries to such a database in the SQL language.

Introduction

The study of plant transcriptomes using sequencing of maximum throughput is highly used to solve problems such as assessing gene expression for various genotypes and under different environmental conditions, identifying RNA sequences (for non-model organisms), and checking out markers for functionally important genes. The results of this experiment are in the form of short fragments of nucleotide sequences. It is the results of bioinformatic processing that are of interest to a biologist and can be interpreted in terms of the functions of genes, their products, expression levels, genetic variations, etc.

It should be noted that as a result of a transcriptome experiment, a large amount of data (tens and hundreds of thousands of sequences) is obtained, free and convenient access to which is important to provide to a wide range of biologists, far from routine bioinformatic processing of sequencing results. This purpose is served by databases that have a convenient user interface and organize links between biological sequences and their functional annotation.

A database on gene expression for certain types of organisms: TodoFirgene for fir Abies sachalinensis; gene expression atlas for the rose; the database of annotated transcriptomes of seaside pine EuroPineDB, etc. The results of processing RNA-seq experiments presented in such databases are complex and include transcript sequences, their localization in the genome, classification by gene types (mRNA, lncRNA, mi-RNA, tRNA, etc.), functional annotation of transcripts, assessment of expression levels, assessment of transcript isoform variants. These results are presented as binary and text files in various formats. These can be sequence files (FASTA, FASTQ formats), alignments (BAM, SAM, PSL, etc.) or markup files (BED, GFF, GTF).

The results of the analysis of differential gene expression are usually presented in the form of tables (TSV, XLS formats). It is convenient to describe such well-structured data in the form of the classic relational relationship model RDBMS (Relational Database Management System, eng.

Note, however, that along with well-structured data, analysis of RNA-seq experiments generates loosely structured and unstructured data that cannot be described using a relational model. Difficulties can arise due to the heterogeneity of data obtained during the execution of bioinformatics computational pipelines. This heterogeneity is due to the variety of methods that are involved in the conveyors of bioinformatic processing of transcriptome data.

The nodes in pipelines are various programs that implement data processing methods. In practice, it usually happens that changes are made to the existing computational pipeline to solve a problem in the process of work: for example, some nodes are replaced with new ones, which implement more accurate or more efficient methods of data processing. Therefore, it is impossible in advance to fully declare the data structure that will be obtained as a result of their processing. It is more convenient to describe such semi-structured data using No-SQL technologies (not only SQL).

In this work, we propose an integrated approach to the description of transcriptome data, which consists of the use of RDBMS elements when describing well-structured data and No-SQL to describe poorly structured data obtained as a result of large-scale bioinformatic analysis of transcriptome experiments in five agricultural plants (maize, rice, barley, tomato, and potato). The analysis of these data was aimed at identifying new transcripts that either do not align to the reference plant genome, or align to its unannotated regions and, thus, represent a new, previously unexplored part of the transcriptome.

Using the example of the problem of mass analysis of transcriptomes of agricultural plants, we propose sets of relational relations to describe the main entities: research, experiment, nucleotide, and protein sequences. At the same time, for each of these entities, we propose to introduce an opportunity for annotation with semi-structured data sets, the presentation format of which may be unknown in advance.

Based on the proposed hybrid approach, the OORT (Out Of Reference Transcripts) database has been developed, which allows users, using search queries, to extract information about the structure and functions of a previously unannotated part of the transcriptome of agricultural plants, in particular: to identify new genes for resistance to diseases and abiotic stress, long non-coding RNA’s, mRNA sequences, obtain estimates of the expression level of these transcripts. The database is based on the analysis of 1241 transcriptome experiments and contains information on 20440228 nucleotides and 4055996 amino acid annotated sequences. The functionality of the OORT database is demonstrated with several queries.

Database content

The content of the OORT database includes:

  • meta-information about libraries of transcriptome experiments;

  • nucleotide sequences of transcripts obtained as a result of de novo assembly;

  • amino acid sequences obtained as a result of translation of nucleotide sequences of transcripts encoding proteins;

  • annotations of nucleotide sequences (prediction of the coding potential, alignment to the reference genome, assessment of the expression level);

  • amino acid sequence annotations (prediction of functional domains, sequence-associated gene ontology terms).

When working with the database, it is important for the user to solve a number of search problems related to the identification of sequences in the database by meta-information about the experiment, belonging to the organism, homology, functional characteristics, and the level of transcript expression. This search can be performed for both nucleotide and amino acid sequences. This should take into account the relationship between the amino acid sequence and the nucleotide sequence of the transcript from which it was obtained.

Database formation, indexing

When forming the OORT database, we used PostgreSQL version 12 as a DBMS. It was used to build a relational data schema linking research, experiments, contigs, and proteins (see the section “Database structure”). The database allows you to define relational relationships using primary and foreign keys.

Using the Python programming language, the data from the files with the results of bioinformatics processing (see the section “Contents of the OORT DB”) were transformed into a “dictionary” type structure, which was then converted into JSON using the SQLAlchemy library. SQLAlchemy allows you to translate Python code into SQL commands and pass them to the DBMS for execution, receiving query results in the form of Python objects. Further, this data was converted into JSONB structures for further storage and access.

The database schema we are creating was designed to search for the information presented in both typed (well-structured) and semi-structured fields. Typed fields are intended to describe entities in the database, the types of which are known and uniquely defined. For example, a sequence is presented as a text string, the length of the sequence is described by a field of type int, the name of an organism is described as a text field. To search for information in typed data (int, real, and text), indexing was carried out using the B-tree structure, which later made it possible to perform, among other things, search operations taking into account the relations “equal”, “greater than”, “less” … The B-tree structure allows you to optimize the execution of queries with the specified operations.

In addition to type, well-structured data, the content of the database included loosely structured information. Such data included annotations of amino acid sequences in terms of gene ontology (GO) and protein domains; results of alignment of contigs to reference genomes; the results of assessing the level of expression of transcripts, etc. This kind of data was presented as text strings in JSON format: records of the form key-value without predetermined restrictions. PostgreSQL 12 DBMS allows you to store information in JSON format in the structure of a relational database as fields of a special type JSONB. This data is in JSON format, presented in binary form for quick reference without re-parsing, and the keys and values ​​in this field are not described in the SQL language format. On top of the JSONB field keys, you can create indexes to quickly find data, and some indexes also allow you to use new search options on values, for example, finding the desired object in an array. JSON fields were indexed using GIN indexes. This indexing allows you to quickly search for text strings in a specified array of terms, for example, ontology terms and protein domains.

To index the coordinates of contigs in the genome, we used the “materialized representation” structure, a special technique implemented in the PostgreSQL DBMS, which allows saving the result of a certain query as a separate database table [41]. This allows, after executing a query in the form of a “materialized view”, to refer to it as a separate table in the relational database. This technology makes it possible to significantly speed up the execution of queries (especially complex ones) to the database, since in the case of a repeated query, only this table is accessed, and not the real fields of the database.

Access to information in the database

Access to retrieving and modifying data in the OORT database was performed in the SQL query language. A typical data search query usually includes the following sections of this language:

  • SELECT- section for selecting table fields. Here, separated by commas, you need to specify the names of the columns that you want to display, or * to display all fields.

  • FROM- section of tables. In this section, you should specify the tables from which you want to get the results. Also in this section, the JOIN operator can be used to join tables.

  • WHERE- condition section. In it, the user specifies the search conditions and combines them using the logical operators AND, OR, and NOT (the condition is not met).

It should be noted that PostgreSQL version 12 extends the SQL: 2016 dialect with additional operators that are used to retrieve data from JSON fields. These operators include an operator ‘->’ that is used to retrieve a nested JSON value by key, and an operator ‘->>’that is used to retrieve string values ​​by key. The use of these operators is demonstrated below in the “Examples of search queries” section. The SQL interface to the developed OORT database can be implemented using applications such as Data Grip, pg Admin, or the psql console application of PostgreSQL.

Database structure

The relational database schema includes the description of the metadata of the transcriptome experiment and the results of de novo assembly of transcripts in the form of the following objects: study, experiment (exp), nucleotide sequence (contig) and amino acid sequence (pep) (Fig. 1). As it was mentioned in the section “Database formation, indexing”, when creating links between the database tables, we used primary and secondary keys, which were created as follows:

  • each of the database tables stores an id field, which is the primary key;

  • The exp table contains a secondary key study_id that refers to the study table;

  • the contig table contains a secondary key exp_id that refers to the exp table;

  • The pep table contains the contig_id and exp_id secondary keys, which refer to the contig and exp tables, respectively.

As a result, relational links were created between the indicated tables, allowing to efficiently search for information on the experiment, research, and the sequences obtained as a result of the experiment. The structure of relationships between tables in the OORT database is shown in Figure 1.

Fig. 1. OORT relational database structure. Database tables and relationships between them are shown | RDBMS NoSQL data preprocessing

The amount of data in the OORT database is presented in the table. The total amount of data was 45.5 gigabytes.

Table name Number of records Data size
Description of research 2766 3.0 MB
Description of experiments 1241 7.4 MB
Nucleotide Sequences 20440228 39000.0 MB
Amino acid sequences 4055996 6495.0 MB

 

An example of describing semi-structured data

Let’s take a closer look at the way of describing weakly structured data in the OORT database. This data in the above tables ‘exp’, ‘pep’, and ‘contig’ are implemented as text fields of type jsonb that contain the results of the sequence annotation. Let us consider an example of the description of the results of assessing the expression level of the TRINITY_DN3631_c0_g1_i1 transcript by the Kallisto program in JSON format in the ‘ann’ field of the ‘contig’ table of the OORT database:

"kallisto": {
  "g": {
    "tpm": 8.44,
    "eff_length": 341.61,
    "est_counts": 35.0
  },
  "i": {
    "tpm": 8.43596,
    "eff_length": 341.614,
    "est_counts": 35.0
  }
}

Here is part of the content of the ‘ann’ field for the ‘contig’ table of the TRINITY_DN3631_c0_g1_i1 transcript, which describes the expression level estimate obtained by using the Kallisto program. This information in the ‘ann’ field can be distinguished by the name of the “Kallisto” key (the first line of the record). The Kallisto program evaluates the expression level both for the gene to which the transcript belongs (the section of the record that corresponds to the key “g”, second line) and for the transcript itself, which represents one of the isoforms of this gene (the section of the record that corresponds to the key “i ”, Seventh line). For each such estimate of the expression level, the Kallisto program outputs three parameters: “TPM”, the estimate of the expression level in terms of “transcript per million”, TPM; “eff_length”, the effective sequence length; “est_counts”, the estimate of the number of reads, aligned to the given transcript. All of these parameters are given in the corresponding values ​​for the keys specified in the JSON format.

Amino acid sequence annotation is described in a similar manner. The description of the results of the annotation of protein domains by the Interproscan program contains the following sections:

  • tsv – a two-dimensional array of InterproScan program output;

  • go_ann – an array of strings with GO protein annotations;

  • pathways_ann – an array of strings with a list of signaling and metabolic pathways in which the protein is involved.

The description of the results of predicting an open reading frame and its parameters by the TransDecoder program contains the following sections:

  • type – a string indicating the type of the predicted translation frame (full, from start to stop codon, or its fragment);

  • chain – a string indicating the direction of the chain from which the translation occurs (“+” or “-”);

  • a score is a number characterizing the reliability of identifying an open translation frame in a nucleotide sequence;

  • cds_end – number: coordinate of the end of the open translation frame in the nucleotide sequence;

  • cds_start – number: coordinate of the beginning of the open broadcast frame.

Examples of implementation of search queries to the database

Let’s give examples of several queries in the PostgreSQL 12 DBMS language to the OORT database. The first request is to obtain all transcript sequences obtained from the rice ( O. Sativa ) libraries. For execution, this query uses only standard SQL statements for accessing database fields:

Query 1: Get the gene sequences for the species ‘Oryza Sativa

select contig.seq
from contig join exp on exep.id = contig.expe_id
where scientific_name = 'Oryza sativa'

The second request is to obtain a list of identifiers of all genes, the expression level of which is in the range from 3 to 7 TPM. This request uses a call to the JSON field ‘ann’ and retrieves the value for the key “Kallisto” from it using the ->and operators ->>.

 

Query 2: Find all genes with expression levels ranging from 3.0 to 7.0 TPM

select *
from contigs
where ((ann -> 'kallistoo' -> 'genes' ->> 'tpm')::real)
between 3.0 and 7.0

The third request is to search for genes whose amino acid sequences are identical (i.e., have the same md5 hash code), while their level of expression in different experiments is different.

 

Query 3: List genes encoding identical

– peptides with different levels of expression in different experiments

select subseq.exp_id, subseq.study_id,
max(subseq.contig_g_tpm) as contig_g_tpm,
study.scientific_name, tissue_type
from (
select (contig.ann->'kallisto'->'g'->'tpm')::real
as contig_g_tpm, contig.id as contig_id, pep.id as
pep_id, contig.exp_id, e.study_id, pep.md5,
e.ann->'ebi'->'tissue_type' as tissue_type
from pep join contig on pep.contig_id = contig.id
join exp e on contig.exp_id = e.id
where pep.md5 = '857c1634328c5333ab73efcc11a45038'
)
subseq join study on subseq.study_id = study.id
group by exp_id, study_id, study.scientific_name,
tissue_type

Query 4 is about finding amino acid sequences that, in their annotations, contain certain terms of the gene ontology predicted by the InterproScan program. Since the annotation information is not typed, before executing this query, it is necessary to GIN index the ‘ann’ field of the ‘pep’ table for the values ​​of the “go_ann” key. This allows you to search for records already in the indexed array.

Query 4: Find all amino acid sequences

– with a certain set of terms of gene ontology

– creating an index to execute the query

create index ix_pep_ann_interpro_go_ann on pep using

gin (((ann -> ‘interpro’ :: text) -> ‘go_ann’ :: text))

– Search for proteins according to the gene ontology annotation based on the created index

select * from pep
where (ann -> 'interpro' -> 'go_ann')
?& array['GO:0055085', 'GO:0006811']

Query 5 searches for all studies in the OORT database that contain the term ‘Categorizing’ in the description. To implement it, at the first stage, a GIN index is created on the vector of words from the ‘abstract’ field (line 4, the to_tsvector method), where the brief descriptions of the study are written. Next, a query is made to these vectors (lines 9-10).

 

Query 5: Search for a word in study descriptions

– creating an index for full-text search

create index study_abstract_idx on study
(to_tsvector('english'::regconfig, abstract));
- performing full-text search
select *
from study
where to_tsvector(study.abstract)
@@ plainto_tsquery('Categorizing')

Request 6 is to search for proteins that are synthesized from transcripts aligned to the genome region with specified coordinates by the gmap program. Since the “gmap” key has the structure of a two-dimensional array, at the first stage these arrays are normalized in the “material display” so that one row of the table contains one row from the original array (lines 2–9). Next, three positions in the resulting table are indexed (lines 12–15) and the search for the desired sequences is performed (lines 19–30).

– material view implementation contig_gmap_view_sqlalchemy

create materialized view
contig_gmap_view_sqlalchemy as
select db.id, jsonb_array_elements(
db.ann -> 'gmap'::text) AS gmap
from (
select contig.id, contig.ann
from contig ORDER BY contig.id
) db;
- creating indexes on top of GMAP columns
create index gmap_start_end_chr_idx
on contig_gmap_view_sqlalchemy
(((gmap ->> 15)::integer), ((gmap ->> 16)::integer),
(gmap ->> 13));

– Search for proteins synthesized from chromosome 2 from 2000 to 5000

– positions in the genome

select pep.id AS pep_id
from exp join contig ON exp.id = contig.exp_id
join contig_gmap_view_sqlalchemy
on contig.id = contig_gmap_view_sqlalchemy.id
join pep ON pep.contig_id = contig.id
where
cast(contig_gmap_view_sqlalchemy.gmap ->> 15
as integer) >=2000 AND
cast(contig_gmap_view_sqlalchemy.gmap ->> 16
as integer) <= 5000
and(contig_gmap_view_sqlalchemy.gmap ->> 13) = '2'
and exp.scientific_name like 'Zea mays'

The examples given show that it is possible to organize rather complex queries to an existing database, which can include both information about the sequence, its localization in the genome, and annotations. Queries can take into account the results of comparing various characteristics of the sequences described in the database.

The database created by us based on the hybrid method for describing complex molecular biological data in the form of both well-structured (typed) data and poorly structured information has several advantages. First of all, such an organization of data storage facilitates their maintenance: when changing the set of annotation programs, their versions, or the set of parameters they output, instead of modifying the relational schema of the database and reloading the data into the database, the developer can compose loosely coupled structures in the JSON format.

The proposed database architecture preserves the connectivity of tables using primary and secondary key methods. It should be noted that the structure of the JSON description of untyped data is not strictly defined and may differ for different records of even one table. Ultimately, this structure depends on what data was entered into a particular field in the database. Maintaining the uniformity of data in the JSON format with such an organization lies with the developer: control of the structure of JSON fields occurs either at the time of data export and/or by implementing handler functions that are launched with each update and addition of data.

Conclusions

A hybrid approach is proposed by this article to the creation of databases of molecular genetic data, that contains information in the same way about the sequences of transcripts and their structural and functional annotation. The essence of the approach lies in the simultaneous storage of both structured information and weakly structured data in the database. This schema has several advantages in maintaining the data: instead of modifying the relational schema of the database and reloading the data into the database, the developer can compose loosely coupled structures in JSON format, and also index the field within this structure.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *