Du, Zhenjiao2024-11-052024-11-052024https://hdl.handle.net/2097/44662The increasing global demand for sustainable and natural health-promoting food ingredients has led to a growing interest in bioactive peptides derived from food proteins. These peptides exhibit various beneficial biological effects, such as antioxidant and antihypertensive activities. Traditional methods for producing, characterizing, and identifying bioactive peptides are often labor-intensive, time-consuming, and resource-intensive. The goal of this study was to employ advanced computational techniques for prediction model development to accelerate bioactive peptide discovery and extend applications into broader biological domains. Specific objectives were to: 1) develop traditional chemometric methods for antioxidant peptide discovery: investigate predictive quantitative structure-activity relationship (QSAR) model development for antioxidant dipeptides and tripeptides, experimentally validate model performance, design an in silico hydrolysis simulation tool, and develop a workflow for applying these predictive models in peptide discovery from food proteins; 2) pre-train protein language models (pLM) for bioactive peptide discovery: build a universal architecture for peptide prediction model development and evaluate it across different peptide datasets; develop state-of-the-art (SOTA) models for allergenic proteins and peptides and investigate the effects of pLM sizes on prediction model performance; build a SOTA prediction model for high-activity angiotensin-I converting enzyme (ACE) inhibitory peptides and explore the superiority of pLMs over traditional peptide representation methods and assess the compatibility of different machine learning classifiers with pLMs; 3) fuse knowledge between language models for performance enhancement: develop multimodal prediction models for enzyme-substrate pair prediction using protein and chemistry language models to enhance performance through knowledge fusion; 4) build accessibility and user-friendliness of prediction models: deploy the developed prediction models to user-friendly web servers to make them and their codes easily accessible for scientific discovery. Two machine learning based QSAR models were developed with SOTA performance for predicting the antioxidant activity of tripeptides and dipeptides, respectively. Notably, the tripeptide model achieved an R² of 0.847 on the test dataset. The predicted antioxidant activities of peptides were confirmed through experimental validation. Subsequent feature analysis revealed that C-terminal residues in tripeptides and N-terminal residues in dipeptides contributed more significantly to antioxidant activity. Finally, a hydrolysis simulation tool (R-PeptideCutter) was developed to connect food proteins with predictive models for antioxidant peptide discovery from sorghum proteins, leading to the identification of a high-antioxidant-activity dipeptide, YR. Utilizing pLMs, our universal architecture UniDL4BioPep outperformed existing SOTA models in 15 out of 20 peptide datasets, with accuracy improvements ranging from 0.7% to 7%, and demonstrated great potential for other peptide prediction tasks with custom datasets. To enhance usability, a well-annotated template was designed for non-programming users, and an advanced version was introduced for extremely imbalanced datasets. The pLM-based models exhibited SOTA performance in predicting high activity ACE inhibitory peptides with an accuracy of 88.3% and allergenic proteins/peptides with an accuracy of 95.1%, offering solutions to health concerns like hypertension and allergies. The pLMs demonstrated superiority over traditional peptide descriptors in peptide representation and showed strong compatibility with support vector machine, multilayer perceptron, and logistic regression in the ACE inhibitory peptide dataset. A positive correlation between pLM sizes and prediction model performance was observed in the allergenic protein and peptide dataset. All developed prediction models were deployed on user-friendly web servers for scientific exploration. Beyond the application of pLMs in peptide discovery, a multimodal model, FusionESP, was presented for enzyme-substrate interaction prediction. By employing a novel projection head based on a contrastive learning strategy for knowledge fusion, FusionESP successfully achieved SOTA performance with an accuracy of 94.7% by integrating protein and chemistry language models, while requiring fewer computational resources and data points. Overall, this work demonstrates the transformative potential of integrating advanced computational techniques—such as machine learning and deep learning—into bioactive peptide research and broader biological domains. The developed models and tools can significantly accelerate bioactive peptide discovery and provide valuable resources for scientific exploration. By making these tools accessible and user-friendly, this work contributes to sustainable agriculture, improved health outcomes, and fosters innovation across food science, biochemistry, and computational biology.en-USBioactive peptidesAntioxidant peptidesQuantitative structure-activity relationshipProtein language modelsEnzyme-substrate interactionFunctional foodsMachine learning empowered discovery of bioactive peptides from food proteins and beyondDissertation