Log in

Relevant bibliographies by topics / Protein language models / Journal articles

To see the other types of publications on this topic, follow the link: Protein language models.

Journal articles on the topic 'Protein language models'

Author: Grafiati

Published: 26 October 2024

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Protein language models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Tang, Lin. "Protein language models using convolutions." Nature Methods 21, no. 4 (2024): 550. http://dx.doi.org/10.1038/s41592-024-02252-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Lee, Jin Sub, Osama Abdin, and Philip M. Kim. "Language models for protein design." Current Opinion in Structural Biology 92 (June 2025): 103027. https://doi.org/10.1016/j.sbi.2025.103027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Ali, Sarwan, Prakash Chourasia, and Murray Patterson. "When Protein Structure Embedding Meets Large Language Models." Genes 15, no. 1 (2023): 25. http://dx.doi.org/10.3390/genes15010025.

Full text

Abstract:

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature

APA, Harvard, Vancouver, ISO, and other styles

4

Ferruz, Noelia, and Birte Höcker. "Controllable protein design with language models." Nature Machine Intelligence 4, no. 6 (2022): 521–32. http://dx.doi.org/10.1038/s42256-022-00499-z.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Li, Xiang, Zhuoyu Wei, Yueran Hu, and Xiaolei Zhu. "GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models." International Journal of Biological Macromolecules 280 (November 2024): 135599. http://dx.doi.org/10.1016/j.ijbiomac.2024.135599.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Singh, Arunima. "Protein language models guide directed antibody evolution." Nature Methods 20, no. 6 (2023): 785. http://dx.doi.org/10.1038/s41592-023-01924-w.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Tran, Chau, Siddharth Khadkikar, and Aleksey Porollo. "Survey of Protein Sequence Embedding Models." International Journal of Molecular Sciences 24, no. 4 (2023): 3775. http://dx.doi.org/10.3390/ijms24043775.

Full text

Abstract:

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to d

APA, Harvard, Vancouver, ISO, and other styles

8

Pokharel, Suresh, Pawel Pratyush, Hamid D. Ismail, Junfeng Ma, and Dukka B. KC. "Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction." International Journal of Molecular Sciences 24, no. 21 (2023): 16000. http://dx.doi.org/10.3390/ijms242116000.

Full text

Abstract:

O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a

APA, Harvard, Vancouver, ISO, and other styles

9

Hu, An, Linai Kuang, and Dinghai Yang. "LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model." Applied Sciences 15, no. 6 (2025): 3283. https://doi.org/10.3390/app15063283.

Full text

Abstract:

The prediction of protein–protein interactions is a key task in proteomics. Since protein sequences are easily available and understandable, they have become the primary data source for predicting protein–protein interactions. With the development of natural language processing technology, language models have become a research hotspot in recent years, and protein language models have also been developed accordingly. Compared with single-encoding methods, such as Word2Vec and one-hot, language models specifically designed for proteins are expected to extract more comprehensive information from

APA, Harvard, Vancouver, ISO, and other styles

10

Pang, Yihe, and Bin Liu. "IDP-LM: Prediction of protein intrinsic disorder and disorder functions based on language models." PLOS Computational Biology 19, no. 11 (2023): e1011657. http://dx.doi.org/10.1371/journal.pcbi.1011657.

Full text

Abstract:

Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence info

APA, Harvard, Vancouver, ISO, and other styles

11

Weissenow, Konstantin, and Burkhard Rost. "Are protein language models the new universal key?" Current Opinion in Structural Biology 91 (April 2025): 102997. https://doi.org/10.1016/j.sbi.2025.102997.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Wang, Wenkai, Zhenling Peng, and Jianyi Yang. "Single-sequence protein structure prediction using supervised transformer protein language models." Nature Computational Science 2, no. 12 (2022): 804–14. http://dx.doi.org/10.1038/s43588-022-00373-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Zhang, Zitong, Quan Zou, Chunyu Wang, Junjie Wang, and Lingling Zhao. "Improving protein–protein interaction modulator predictions via knowledge-fused language models." Information Fusion 123 (November 2025): 103227. https://doi.org/10.1016/j.inffus.2025.103227.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Kenlay, Henry, Frédéric A. Dreyer, Aleksandr Kovaltsuk, Dom Miketa, Douglas Pires, and Charlotte M. Deane. "Large scale paired antibody language models." PLOS Computational Biology 20, no. 12 (2024): e1012646. https://doi.org/10.1371/journal.pcbi.1012646.

Full text

Abstract:

Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consis

APA, Harvard, Vancouver, ISO, and other styles

15

Zhao, Long, Qiang He, Huijia Song, et al. "Protein A-like Peptide Design Based on Diffusion and ESM2 Models." Molecules 29, no. 20 (2024): 4965. http://dx.doi.org/10.3390/molecules29204965.

Full text

Abstract:

Proteins are the foundation of life, and designing functional proteins remains a key challenge in biotechnology. Before the development of AlphaFold2, the focus of design was primarily on structure-centric approaches such as using the well-known open-source software Rosetta3. Following the development of AlphaFold2, deep-learning techniques for protein design gained prominence. This study proposes a new method to generate functional proteins using the diffusion model and ESM2 protein language model. Diffusion models, which are widely used in image and natural language generation, are used here

APA, Harvard, Vancouver, ISO, and other styles

16

Qu, Yang, Zitong Niu, Qiaojiao Ding, et al. "Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction." International Journal of Molecular Sciences 24, no. 22 (2023): 16496. http://dx.doi.org/10.3390/ijms242216496.

Full text

Abstract:

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical

APA, Harvard, Vancouver, ISO, and other styles

17

Weber, Leon, Kirsten Thobe, Oscar Arturo Migueles Lozano, Jana Wolf, and Ulf Leser. "PEDL: extracting protein–protein associations using deep language models and distant supervision." Bioinformatics 36, Supplement_1 (2020): i490—i498. http://dx.doi.org/10.1093/bioinformatics/btaa430.

Full text

Abstract:

Abstract Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep

APA, Harvard, Vancouver, ISO, and other styles

18

Wang, Yang. "Enhanced protein function prediction by fusion embedding based on protein language models." Highlights in Science, Engineering and Technology 66 (September 20, 2023): 177–84. http://dx.doi.org/10.54097/hset.v66i.11697.

Full text

Abstract:

Natural language models can accomplish non-natural language tasks such as protein prediction, but the actual prediction effect is low and occupies large computational resources. In this paper, a fusion embedding model is proposed to improve the prediction effect of the model and reduce the computational cost of the model by fusing information of different dimensions. The paper is validated by the downstream task of protein function prediction, which provides a reference for solving practical tasks using fusion embedding methods.

APA, Harvard, Vancouver, ISO, and other styles

19

Sun, Yuanfei, and Yang Shen. "Variant effect prediction using structure-informed protein language models." Biophysical Journal 122, no. 3 (2023): 473a. http://dx.doi.org/10.1016/j.bpj.2022.11.2537.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Thumuluri, Vineet, Hannah-Marie Martiny, Jose J. Almagro Armenteros, Jesper Salomon, Henrik Nielsen, and Alexander Rosenberg Johansen. "NetSolP: predicting protein solubility in Escherichia coli using language models." Bioinformatics 38, no. 4 (2021): 941–46. http://dx.doi.org/10.1093/bioinformatics/btab801.

Full text

Abstract:

Abstract Motivation Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. Results In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves

APA, Harvard, Vancouver, ISO, and other styles

21

Wang, Bo, and Wenjin Li. "Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction." Genes 15, no. 8 (2024): 1090. http://dx.doi.org/10.3390/genes15081090.

Full text

Abstract:

Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of

APA, Harvard, Vancouver, ISO, and other styles

22

Mardikoraem, Mehrsa, and Daniel Woldring. "Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods." Pharmaceutics 15, no. 5 (2023): 1337. http://dx.doi.org/10.3390/pharmaceutics15051337.

Full text

Abstract:

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating t

APA, Harvard, Vancouver, ISO, and other styles

23

Liu, Yubao, Benrui Wang, Bocheng Yan, Haiyue Jiang, and Yinfei Dai. "POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction." International Journal of Molecular Sciences 26, no. 13 (2025): 6362. https://doi.org/10.3390/ijms26136362.

Full text

Abstract:

Protein function prediction plays a crucial role in uncovering the molecular mechanisms underlying life processes in the post-genomic era. However, with the widespread adoption of high-throughput sequencing technologies, the pace of protein function annotation significantly lags behind that of sequence discovery, highlighting the urgent need for more efficient and reliable predictive methods. To address the problem of existing methods ignoring the hierarchical structure of gene ontology terms and making it challenging to dynamically associate protein features with functional contexts, we propo

APA, Harvard, Vancouver, ISO, and other styles

24

Deutschmann, Nicolas, Aurelien Pelissier, Anna Weber, Shuaijun Gao, Jasmina Bogojeska, and María Rodríguez Martínez. "Do domain-specific protein language models outperform general models on immunology-related tasks?" ImmunoInformatics 14 (June 2024): 100036. http://dx.doi.org/10.1016/j.immuno.2024.100036.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Bhat, Suhaas, Garyk Brixi, Kalyan Palepu, et al. "Abstract C118: Design of programmable peptide-guided oncoprotein degraders via generative language models." Molecular Cancer Therapeutics 22, no. 12_Supplement (2023): C118. http://dx.doi.org/10.1158/1535-7163.targ-23-c118.

Full text

Abstract:

Abstract Targeted protein degradation of pathogenic proteins represents a powerful new treatment strategy for multiple cancers. Unfortunately, a sizable portion of these proteins are considered “undruggable” by standard small molecule-based approaches, including PROTACs and molecular glues, largely due to their disordered nature, instability, and lack of binding site accessibility. As a more modular strategy, we have developed a genetically-encoded protein architecture by fusing target-specific peptides to E3 ubiquitin ligase domains for selective and potent intracellular degradation of oncopr

APA, Harvard, Vancouver, ISO, and other styles

26

Liu, Dan, Francesca Young, Kieran D. Lamb, David L. Robertson, and Ke Yuan. "Prediction of virus-host associations using protein language models and multiple instance learning." PLOS Computational Biology 20, no. 11 (2024): e1012597. http://dx.doi.org/10.1371/journal.pcbi.1012597.

Full text

Abstract:

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance

APA, Harvard, Vancouver, ISO, and other styles

27

Yadalam, Pradeep kumar, Ramya Ramadoss, Pradeep kumar R, and Jishnu Krishna Kumar. "Pre-Trained Language Models Based Sequence Prediction of Wnt-Sclerostin Protein Sequences in Alveolar Bone Formation." Journal of Pioneering Medical Science 12, no. 3 (2023): 55–60. http://dx.doi.org/10.61091/jpms202312311.

Full text

Abstract:

Background and Introduction: Osteocytes, the most numerous bone cells, create sclerostin. The sclerostin protein sequence predictive model helps create novel medications and produce alveolar bone in periodontitis and other oral bone illnesses, including osteoporosis. Neural networks examine protein variants for protein engineering and predict their structure and function impacts. Proteins with improved function and stability have been engineered using LLMs and CNNs. Sequence-based models, especially protein LLMs, predict variation effects, fitness, post-translational modifications, biophysical

APA, Harvard, Vancouver, ISO, and other styles

28

Nana Teukam, Yves Gaetan, Loïc Kwate Dassi, Matteo Manica, Daniel Probst, Philippe Schwaller, and Teodoro Laino. "Language models can identify enzymatic binding sites in protein sequences." Computational and Structural Biotechnology Journal 23 (December 2024): 1929–37. http://dx.doi.org/10.1016/j.csbj.2024.04.012.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Hoang, Minh, and Mona Singh. "Locality-aware pooling enhances protein language model performance across varied applications." Bioinformatics 41, Supplement_1 (2025): i217—i226. https://doi.org/10.1093/bioinformatics/btaf178.

Full text

Abstract:

Abstract Motivation Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications, including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict pro

APA, Harvard, Vancouver, ISO, and other styles

30

Valentini, Giorgio, Dario Malchiodi, Jessica Gliozzo, et al. "The promises of large language models for protein design and modeling." Frontiers in Bioinformatics 3 (November 23, 2023). http://dx.doi.org/10.3389/fbinf.2023.1304099.

Full text

Abstract:

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the “language of proteins” invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accu

APA, Harvard, Vancouver, ISO, and other styles

31

Mall, Raghvendra, Rahul Kaushik, Zachary A. Martinez, Matt W. Thomson, and Filippo Castiglione. "Benchmarking protein language models for protein crystallization." Scientific Reports 15, no. 1 (2025). https://doi.org/10.1038/s41598-025-86519-5.

Full text

Abstract:

Abstract The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost

APA, Harvard, Vancouver, ISO, and other styles

32

Unsal, Serbulent, Heval Ataş, Muammer Albayrak, Kemal Turhan, Aybar C. Acar, and Tunca Doğan. "Learning Functional Properties of Proteins with Language Models." October 28, 2020. https://doi.org/10.5281/zenodo.5795850.

Full text

Abstract:

This dataset includes; - Precomputed representation vectors of human proteins with various protein embedding models. - Precomputed representation vectors of SKEMPI dataset with various protein embedding models. - MSAs of human proteins calculated with HHBlits.      -- Splitted tar.gz files can be opened by command; cat human_protein_msa.tar.gz.* | tar xzvf - - MSAs of protein sequences of SKEMPI dataset calculated with HHBlits.

APA, Harvard, Vancouver, ISO, and other styles

33

Avraham, Orly, Tomer Tsaban, Ziv Ben-Aharon, Linoy Tsaban, and Ora Schueler-Furman. "Protein language models can capture protein quaternary state." BMC Bioinformatics 24, no. 1 (2023). http://dx.doi.org/10.1186/s12859-023-05549-w.

Full text

Abstract:

Abstract Background Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of de

APA, Harvard, Vancouver, ISO, and other styles

34

Boshar, Sam, Evan Trop, Bernardo P. de Almeida, Liviu Copoiu, and Thomas Pierrot. "Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks." Bioinformatics, August 30, 2024. http://dx.doi.org/10.1093/bioinformatics/btae529.

Full text

Abstract:

Abstract Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to

APA, Harvard, Vancouver, ISO, and other styles

35

An, Jingmin, and Xiaogang Weng. "Collectively encoding protein properties enriches protein language models." BMC Bioinformatics 23, no. 1 (2022). http://dx.doi.org/10.1186/s12859-022-05031-z.

Full text

Abstract:

AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing con

APA, Harvard, Vancouver, ISO, and other styles

36

Narang, Kush, Abhigyan Nath, William Hemstrom, and Simon K. S. Chu. "HaloClass: Salt-Tolerant Protein Classification with Protein Language Models." Protein Journal, October 21, 2024. http://dx.doi.org/10.1007/s10930-024-10236-7.

Full text

Abstract:

AbstractSalt-tolerant proteins, also known as halophilic proteins, have unique adaptations to function in high-salinity environments. These proteins have naturally evolved in extremophilic organisms, and more recently, are being increasingly applied as enzymes in industrial processes. Due to an abundance of salt-tolerant sequences and a simultaneous lack of experimental structures, most computational methods to predict stability are sequence-based only. These approaches, however, are hindered by a lack of structural understanding of these proteins. Here, we present HaloClass, an SVM classifier

APA, Harvard, Vancouver, ISO, and other styles

37

Tule, Sanjana, Gabriel Foley, and Mikael Bodén. "Do protein language models learn phylogeny?" Briefings in Bioinformatics 26, no. 1 (2024). https://doi.org/10.1093/bib/bbaf047.

Full text

Abstract:

Abstract Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans, and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends

APA, Harvard, Vancouver, ISO, and other styles

38

Vitale, Rosario, Leandro A. Bugnon, Emilio Luis Fenoy, Diego H. Milone, and Georgina Stegmayer. "Evaluating large language models for annotating proteins." Briefings in Bioinformatics 25, no. 3 (2024). http://dx.doi.org/10.1093/bib/bbae177.

Full text

Abstract:

Abstract In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a

APA, Harvard, Vancouver, ISO, and other styles

39

McWhite, Claire Darnell, Isabel Armour-Garb, and Mona Singh. "Leveraging protein language models for accurate multiple sequence alignments." Genome Research, July 6, 2023, gr.277675.123. http://dx.doi.org/10.1101/gr.277675.123.

Full text

Abstract:

Multiple sequence alignment is a critical step in the study of protein sequence and function. Typically, multiple sequence alignment algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. While successful, standard methods struggle on sets of proteins with low sequence identity - the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a pow

APA, Harvard, Vancouver, ISO, and other styles

40

Jing, Xiaoyang, Fandi Wu, Xiao Luo, and Jinbo Xu. "Single-sequence protein structure prediction by integrating protein language models." Proceedings of the National Academy of Sciences 121, no. 13 (2024). http://dx.doi.org/10.1073/pnas.2308788121.

Full text

Abstract:

Protein structure prediction has been greatly improved by deep learning in the past few years. However, the most successful methods rely on multiple sequence alignment (MSA) of the sequence homologs of the protein under prediction. In nature, a protein folds in the absence of its sequence homologs and thus, a MSA-free structure prediction method is desired. Here, we develop a single-sequence-based protein structure prediction method RaptorX-Single by integrating several protein language models and a structure generation module and then study its advantage over MSA-based methods. Our experiment

APA, Harvard, Vancouver, ISO, and other styles

41

Peicong, Lin, Tao Huanyu, Li Hao, and Huang Sheng-You. "Protein-protein contact prediction by geometric triangle-aware protein language models." August 31, 2023. https://doi.org/10.5281/zenodo.8304327.

Full text

Abstract:

1st version of DeepInter standalone package. DeepInter is a deep learning-based method which is developed to predict the contacts across the interfaces of the dimers by geometric triangle-aware protein language model.  

APA, Harvard, Vancouver, ISO, and other styles

42

Haselbeck, Florian, Maura John, Yuqi Zhang, et al. "Superior protein thermophilicity prediction with protein language model embeddings." NAR Genomics and Bioinformatics 5, no. 4 (2023). http://dx.doi.org/10.1093/nargab/lqad087.

Full text

Abstract:

Abstract Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein l

APA, Harvard, Vancouver, ISO, and other styles

43

Lin, Peicong, Huanyu Tao, Hao Li, and Sheng-You Huang. "Protein–protein contact prediction by geometric triangle-aware protein language models." Nature Machine Intelligence, October 19, 2023. http://dx.doi.org/10.1038/s42256-023-00741-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Kabir, Anowarul, Asher Moldwin, Yana Bromberg, and Amarda Shehu. "In the Twilight Zone of Protein Sequence Homology: Do Protein Language Models Learn Protein Structure?" Bioinformatics Advances, August 17, 2024. http://dx.doi.org/10.1093/bioadv/vbae119.

Full text

Abstract:

Abstract Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence

APA, Harvard, Vancouver, ISO, and other styles

45

Ieremie, Ioan, Rob M. Ewing, and Mahesan Niranjan. "Protein language models meet reduced amino acid alphabets." Bioinformatics, February 3, 2024. http://dx.doi.org/10.1093/bioinformatics/btae061.

Full text

Abstract:

Abstract Motivation Protein Language Models (PLMs), which borrowed ideas for modelling and inference from Natural Language Processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. Results Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in captu

APA, Harvard, Vancouver, ISO, and other styles

46

Pudžiuvelytė, Ieva, Kliment Olechnovič, Egle Godliauskaite, et al. "TemStaPro: protein thermostability prediction using sequence representations from protein language models." Bioinformatics, March 20, 2024. http://dx.doi.org/10.1093/bioinformatics/btae157.

Full text

Abstract:

Abstract Motivation Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable development of more versatile thermostability predictors for multiple ranges of temperatures. Results We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein languag

APA, Harvard, Vancouver, ISO, and other styles

47

Chen, Bo, Ziwei Xie, Jiezhong Qiu, Zhaofeng Ye, Jinbo Xu, and Jie Tang. "Improved the heterodimer protein complex prediction with protein language models." Briefings in Bioinformatics, June 16, 2023. http://dx.doi.org/10.1093/bib/bbad221.

Full text

Abstract:

Abstract AlphaFold-Multimer has greatly improved the protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e. interologs) of the complex under prediction. Here we propose a novel method, ESMPair, that can identify interologs of a complex using protein language models. We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% i

APA, Harvard, Vancouver, ISO, and other styles

48

Xiang, Wenkai, Zhaoping Xiong, Huan Chen, et al. "FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling." Bioinformatics, November 14, 2024. http://dx.doi.org/10.1093/bioinformatics/btae680.

Full text

Abstract:

Abstract Motivation Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels. Results We introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional

APA, Harvard, Vancouver, ISO, and other styles

49

Tang Tian-Yi, Xiong Yi-Ming, Zhang Rui-Ge, et al. "Progress in Protein Pre-training Models Integrated with Structural Knowledge." Acta Physica Sinica, 2024, 0. http://dx.doi.org/10.7498/aps.73.20240811.

Full text

Abstract:

The AI revolution sparked by natural language and image processing has brought new ideas and research paradigms to the field of protein computing. One significant advancement is the development of pre-trained protein language models through self-supervised learning from massive protein sequences. These pre-trained models encode various information about protein sequences, evolution, structures, and even functions, which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities. Recently, researchers are further developing multimodal pre-trained mo

APA, Harvard, Vancouver, ISO, and other styles

50

Sun, Yuanfei, and Yang Shen. "Structure-Informed Protein Language Models are Robust Predictors for Variants Effects." July 30, 2023. https://doi.org/10.5281/zenodo.8197882.

Full text

Abstract:

Data and model repo for <em>Structure-Informed Protein Language Models are Robust Predictors for Variants Effects</em> Included files upon decompression: Data pretrain_seq_inputs.tar.bz2 finetune_seq_only_inputs.tar.bz2 finetuen_seq_struct_inputs.tar.bz2 mutation_fitness_set_data.tar.bz2 Model pretrained_models.tar.bz2 seq_finetuned_models.tar.bz2 seq_struct_finetuned_models.tar.bz2 File descriptions: Data Files pretrain_seq_inputs: this folder contains Pfam domain sequence data for pre-training. two compressed files are included 'seq_json_rp75_all.tar.bz2' and 'seq_json_rp15_all.tar

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!