Navigating Metagenomics Research: A High School Student's Journey at USC BUGS Jr. Summer Research Program 2023

Diya Sreedhar, a rising Junior at Troy High School, Fullerton, CA successfully completed a novel research project in the niche field of Metagenomics as part of the USC BUGS 2023 Summer Research Program. She conducted her research at the Mangul Lab under the guidance of her mentor, Dr. Nitesh Sharma, who is a Postdoctoral researcher. Diya’s research project, titled ‘Metagenomic Analysis and Development of Comprehensive Gene Sequences for Bacteria across Public Genomic Databases’, involved complex analysis of large, publicly available bacterial genomic datasets to understand the coverage, overlap and quality of gene sequence data. The study resulted in a unique, consolidated view of all the datasets and exposed a single, unified, high quality dataset for future research. To the best of our knowledge, no prior research has ventured into this domain, making this study truly pioneering.

Metagenomics is the study of the structure and function of entire nucleotide sequences, typically from a specific community of microorganisms, such as bacteria. It allows researchers to study the diversity of microbial communities, and helps identify novel genes and metabolic pathways. The most popular bacterial genome datasets available for research and analysis are – the Bacterial and Viral Bioinformatics Resource Center (BV-BRC), the National Center for Biotechnology Information Reference Sequence Database (NCBI RefSeq), and EnsemblGenomes. A key challenge in sourcing data from these public databases is that not all bacterial species are commonly available across all the datasets, and the gene sequencing data for common bacteria may not be of the same completeness and quality.

In this research study, Diya acquired all three genomic datasets and conducted computational analysis to identify the common bacteria strains available in all the databases. For commonly available bacterial strains and sub-strains, she extracted the most complete nucleotide sequences. She was able to develop a unified, meticulously curated, and thoroughly annotated amalgamation of the three databases, offering researchers an unparalleled resource of the highest data quality for investigating bacterial strains. The potential benefits of integrating these datasets can facilitate cross-species comparison, support drug discovery efforts and enable the discovery of new therapeutic targets and biomarkers.

Diya used her prior knowledge of programming in Python to download the datasets and read the sequencing data (FASTA format) using the Biopython package. Due to the large dataset sizes (over 1.5 terabytes) and volume (over 1 million files), she leveraged the USC supercomputing cluster environment to process her data in batches. Diya also learnt how to use UNIX-based job submission and management, and developed several BASH scripts to automate the data processing. By using fuzzy matching, and machine learning techniques such as clustering and classification, she was able to develop the unified genome dataset over a single summer.

Her future work will focus on developing an automated mechanism to maintain the completeness and accuracy of these sequences, ensuring a sustained level of data quality over time. Careful consideration of data compatibility and mapping of gene identifiers will be crucial to ensuring the reliability of the resulting unified dataset.

Diya is a highly motivated and talented student who devoted her summer to learning about complex gene sequence analysis. With strong programming and data analysis skills, she deftly navigated large public databases, uncovering novel techniques to interpret genome information. Her efforts are helping us develop a comprehensive genome database for bacteria, which will be the first of its kind, supporting researchers in future studies.

Her foray into genomic data analysis provided the foundation to pursue a longer term research project under the mentorship of Dr. Mangul. She is currently working on genomic research to better understand the genetic and molecular changes that may cause cancer.

Diya’s dedication, passion, and ingenuity have left a lasting impact, exemplifying the true potential of young scientists. Her passion for research will inspire and motivate others in their pursuit of scientific excellence.