黑客松项目名称:《An LLM-based Retrieval Augmented Generation Pipeline for Multi-omics Rice and Wheat Data》

项目梗概/Project Theme
◎ LLM-based Retrieval Augmented Generation for Multi-omics Data Resource of Rice and Wheat
Rice and Wheat are important food crops worldwide. The data resources for rice and wheat are rich and diverse, covering genome, transcriptome, proteome, and metabonomics, which are known as multi-omics data. Known data resources related to the characterization of rice genes include Rice Gene Index, Oryzabase, Rice-Alterome, and Rice Trait Ontology. Those for wheat include WheatOmics, Triticeae-Gene Tribe, and Wheat Trait Ontology. The Ensembl Plants database encompasses the genetic information of various crops, including both rice and wheat (Tab. 1).
A dilemma appears when researchers are searching for comprehensive information about a biological object. That is because multi-omics data usually scatters in different databases and does not have a unified format. For example, suppose a researcher is interested in a drought-resistant phenotype as well as its associated genes, DNA sequences, Protein sequences, transcripts, etc. In that case, he/she has to bounce around different databases to obtain data on different omics. Moreover, there is always a need for the alignment of data formats from diverse resources.
To avoid the complexity of shifting from one database to another, Q&A platforms based on Large Language Modeling (LLM) can effectively integrate the existing resources and provide fast and easy-to-understand responses to users’ questions using the Retrieval Augmented Generation (RAG) strategy. It provides LLM with the ability to retrieve information from data sources and use it as a basis for generating responses.
Table 1. Resources to include in for data integration.
◎ Standardized Annotation Integration and Consolidation of Multi-omics Data
Supported by BLAH9, we are calling on the construction of “An LLM-based Retrieval Augmented Generation Pipeline for Multi-omics Rice and Wheat Data”. Project steps and issue points are as follows:
- How to semantically parse multi-omics data by accurate and reliable indexing, thus facilitating personalized Q&A generation for resource linking.
- How to smoothly handle database disagreements during resource linking when using the RAG strategy.
- How to employ RAG and LLM to build a reliable pipeline offering user-friendly and knowledge-supportive query services.