Summary of Commonly Used Databases for Metabolomics Study

The rapid development of metabolomics, especially the advances in analytical techniques, increased sample size, diversification of sample types, and the combined use of multiple detection platforms have dramatically increased the number and complexity of metabolome data. The development of metabolomics databases has important implications for summarizing these big data, increasing data usage, conducting deep analysis, and revealing the biological mechanisms behind big data.

Currently, the databases involved in metabolomics research can be roughly divided into two levels, the original database that stores the raw test data and the metabolite library that stores information about metabolites and metabolic pathways. The earliest and relatively mature is the metabolite library. The early metabolite libraries are primarily basic information for storing various metabolites, including profiles of metabolites, chemical formulas, molecular weights, chemical classifications, chemical properties, metabolic pathways, and mass spectra. The user can compare the information of the substance to be identified with the information of the metabolites in the library, and perform qualitative and metabolic pathway search on the target substance. Metabolite libraries such as Human Metabolome Database (HMDB), Kyoto Encyclopedia of Genes and Genomes (KEGG), Metabolite Link (Metlin), The Golm Metabolome Database (GMD), and the Small Molecule Pathway Database (SMPDB) are such databases. Among them, Human Metabolome Database is relatively mature and widely used.

Since 2010, with the development of precision medicine and bioinformatics, the original database has begun to emerge under the advocacy and vigorous promotion of some international organizations. There are many difficulties in the establishment, improvement, standardization and promotion of such databases, based on the development of scientific research concepts and technologies around the world. The successful precedent for the construction of genomic database has certain promotion and reference for the development of this kind of database. The emergence and standardization of the original database will provide more opportunities for researchers to exchange and cooperate, and it is also an effective way to further improve data utilization and depth of digging, which will greatly promote the advancement of metabolomics technology, and will also lay foundation for the integrated analysis of omics and the cross-study of omics and other disciplines lay the foundation for data.

Therefore, although the construction and improvement of such databases is difficult, it is an inevitable trend in the development of omics. Since 2010, a number of institutions in Europe and the United States have gradually established a series of original databases and formed a professional team dedicated to maintaining and promoting applications. Currently, the four representative libraries are the Metabolomics Workbench of the US NIH, the Metabolights of the European Bioinformatics Institute, the Metabolic Phenotype Database (MetaPhen, part of the Metabolome Express), and the Metabolomic Repository Bordeaux (MeRy-B). Among them, the first two are widely used, and accept data from a variety of instrument platforms and species. Metabolomics Workbench also allows exploratory statistical analysis of publicly available data. Metabolights is more focused on data management, and the standards for data submission are more stringent. MetaPhen and MeRy-B are smaller and focus on plant metabolomics. MeRy-B is dominated by 1H-NMR data, while MetaPhen focuses on GCMS data.

At present, the recognized standards for original database construction are MSI (Metabolomics Standards Initiative, European Bioinformatics Institute, http://msi-workgroups.sourceforge.net/) and COSMOS (Coordination of Standards in Metabolomics, European Union, http://cosmosfp7. Eu). The above databases basically conform to these two standards. Some organizations have also published their own standards, but they are highly consistent with these two standards. According to the requirements of MSI and COSMOS, the database requires the authorized resource provider to provide the following information while providing the original data of the specified format (such as ISA-Tab): the basic information of the submitter, the experimental design, the research object and the corresponding processing, sample collection and storage conditions, sample preparation, instrument platform and analysis conditions, clinical information of the sample, and metabolite information. The information on metabolites includes basic description, external database identification code, chemical formula, simplified molecular-input line-entry system (SMILES), the International Chemical Identifier of IUPAC, peak or intensity, as well as related information for identifying metabolites, such as m/z, retention index, fragmentation information, etc. If the resource the provider gave has been published, the full text of the article is also required. Only resources that meet the above requirements will be added to the database.

At present, the application of major metabolites has been relatively mature, and the contribution to the development of metabolomics is obvious to all. Although the original database is strong, it is still in the early stage of construction, and there have been no reports of a large number of applications. Good news is that some scholars have integrated multiple original databases or multiple resources in a certain library to further improve the utilization of data resources. In 2015, the Leiden University in the Netherlands, the European Institute of Bioinformatics and the Leibniz Institute of Phytochemistry in Germany jointly established a cross-database raw data retrieval platform, MetabolomeXchange (http://metabolomexchange.org/site/), which provides another quick way to integrate and extend database resources.

Comments

Popular posts from this blog

Practice of N-terminal Sequencing by Edman Degradation Technology in Protein and Peptide Sequence Analysis

Part I: Basic Knowledge of Quantitative Analysis of Protein Acetylation Modification

De Novo Protein Sequencing Procedures and Features