Information

Rosetta ab initio prediction and protein-protein interaction fitness help

Rosetta ab initio prediction and protein-protein interaction fitness help


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have designed several proteins which I predict have interactions with another protein using the sequence based Conjoint Triad Method. I would like to know which ones structurally are predicted to dock and have interactions. I have Rosetta installed and have looked at the examples a little. Rosetta doesn't have much "how to" documentation on protein design and fitness selection. Could someone please walk me through the process line by line of taking a fasta with a series of proteins through rosetta's ab initio pdb generation and then how to and where to look for a fitness of each proteins interaction/docking? Also I read in "Design of Protein-Protein Interaction Specificity Using Computational Methods and Experimental Library Screening" page 91 that it was possible to generate specific protein combinations that map to a structure using Rosetta. How do you do that? thank you :)

Rosetta commons links to additional software add ins but omits the crucial Sparks-X. I found Sparks-X from at this webpage. And I found that the code for make_fragments.pl needs to have many changes made for it to run properly. Several files are listed in make_fragments.pl. Two files without links pdb_seqres.txt and entries.idx are found at RSCB, below.

The NCBI nr database requires about 100 gigs of free space to properly load and process. Anything less than that results in errors.

make_fragments.pl requires a single gene in a fasta file. multiple genes in the fasta file cause it it crash.

Psipred nolonger has a dat4 so this line needs in make_fragments needs to have the dat4 removed.

"$PSIPRED sstmp.mtx $PSIPRED_DATA/weights.dat $PSIPRED_DATA/weights.dat2 $PSIPRED_DATA/weights.dat3 $PSIPRED_DATA/weights.dat4 > psipred_ss",

My make_fragments does not always finish it sometimes gives an error

ERROR: Error reading in FragmentPicker::read_spine_x(): does not match size of query!

This error was caused by a ^M in the fasta file removed with vi.

I'm attempting to bypass the fragment picker with csrosetta toolkit from .csrosetta.org/

libstdc++6 32bit version must be used for talos+ I used a Ubuntu 32 bit machine to copy the /usr/lib/libstdc++.so.6 and uploaded it here. http://www.mediafire.com/?j0133qqwiilsuz1

cs-rosetta does not create fragments for unknown reason. I have emailed the creator and I have emailed rosetta commons support as well. For now I recomend the on-line server roberta.

There are examples of how to use prepacking and docking in rosetta_tests.

PPI can be designed by docking a protein to another protein then running sequence tolerance. Though the resfile doesn't seem to be able to limit the changes to certain nt at this point.

Sequence tolerance has an extra rotamer bug and fills up memory and the page file for certain proteins if the "ex" series flags are use. It also sometimes ignores the resfile so double check the initial scripts results before begining a multi-generational study.

Ab initio relax has a bug which prevents it from running twice in the same directory. Remove default.out to rerun the structure prediction.

Docking can be forced to one location by using a constraints file but the radomize and spin options must be turned off for it to work properly.

-constraints:cst_weight integer -constraints:cst_file cstfile

http://www.rosettacommons.org/manuals/archive/rosetta3.4_user_guide/de/d50/constraint_file.html

http://www.rcsb.org/pdb/static.do?p=general_information/about_pdb/summaries.html


Rosetta May Hold Key To Predicting Protein Folding

February 12, 2001 &mdash A computational method developed by Howard Hughes Medical Institute investigator David A. Baker and his colleagues has proven quite successful in predicting the three-dimensional structure of a folded protein from its linear sequence of amino acids.

Rosetta, the name of the computational technique developed by Baker and his colleagues at the University of Washington, showed striking success in predicting the three-dimensional structure of proteins during the fourth Critical Assessment of Techniques for Protein Structure Prediction (CASP4).

In the CASP4 experiment (http://predictioncenter.llnl.gov/casp4), which began in April 2000, more than 100 research groups generated three-dimensional structures for 40 candidate proteins. A candidate protein, or target, was considered to be eligible for CASP4 if its three-dimensional structure had been deduced through structural analysis but not yet published by researchers or made public in a protein structure database. Each research group was given the amino acid sequence of the target proteins, and they were asked to develop three-dimensional models of the folded proteins. Results of CASP4 were presented and discussed at a conference in Asilomar, California in early December.

Even a few years ago, says Baker, success in predicting how proteins assume their intricate three-dimensional forms was considered highly unlikely if there was no related protein of known structure. For those proteins whose sequence resembles a protein of known structure, the three-dimensional structure of the known protein can be used as a "template" to deduce the unknown protein structure. However, about 60 percent of protein sequences arising from the genome sequencing projects have no homologs of known structure.

Despite the lack of past success, researchers have pursued the problem of predicting three-dimensional protein structure only from the amino acid sequence&mdashcalled ab initio prediction&mdashbecause it is one of the central problems in computational molecular biology. Recently, the problem has taken on more importance as human gene sequencing efforts have provided researchers with massive amounts of raw gene sequence data

"One of the problems with structure prediction is that it is all too easy to produce a program that correctly predicts the structure of a protein if you know the correct structure in advance," Baker said. "By challenging researchers to produce models before knowing the right answer, the CASP experiments have provided an invaluable boost to the field."

The Rosetta computer algorithm for predicting protein folding draws on experimental studies of protein folding by Baker&rsquos laboratory and many others. "During folding, each local segment of the chain flickers between a different subset of local conformations," said Baker. "Folding to the native structure occurs when the conformations adopted by the local segments and their relative orientations allow burial of the hydrophobic residues, pairing of the beta strands, and other low energy features of native protein structures. In the Rosetta algorithm, the distribution of conformations observed for each short sequence segment in known protein structures is taken as an approximation of the set of local conformations that sequence segment would sample during folding. The program then searches for the combination of these local conformations that has the lowest overall energy."

The results reported using Rosetta at the CASP4 meeting revealed that enormous progress has been made in ab initio structure prediction, said Baker. For example, four years ago, at the CASP2 meeting, there were few reasonable ab initio structure predictions, he said. "In contrast, in the CASP4 experiment, analysis of the predicted structures showed that for the majority of proteins with no homology to proteins of known structure, we had produced reasonable low-resolution models for large fragments of up to about 90 amino acids.

"Interestingly, some of our predicted structures were quite similar to structures of proteins that had already been solved, and which turned out to have similar functions to the target protein, even though there was no significant sequence similarity. Thus, our predicted structures provided clues about function that could not be obtained by traditional sequence comparison methods," Baker said.

Peter Kollman, an expert in computational molecular modeling at the University of California, San Francisco, who participated in the CASP4 experiment, gives some additional perspective: "The evaluators of the structures for the ab initio predictions gave two points for a structure which was 'among the very best,' one point for a structure that was 'pretty good' and zero if the structure was reasonably far from the correct one.

"The amazing thing is that David Baker's group had 31 points and the next best group had 8 points. It is like baseball in 1927, when Babe Ruth hit 60 home runs and the runner up hit 14 [and] some teams didn't hit as many as he.

"Nonetheless, there is still some way to go in predicting these structures to experimental accuracy," said Kollman, "but all of us are hopeful this will advance also."

Baker concurs: "While these three-dimensional structures are not detailed enough, for example, for structure-based drug design, they can yield invaluable insights into the function of unknown proteins," said Baker. "So, our aim is to use our ab initio structure prediction method to produce three-dimensional models for proteins of unknown function. And using those models, we can search the database of protein structures to determine whether they are similar to proteins of known function. From this similarity, it might be possible to draw functional inferences about what those proteins do.

"We&rsquore very excited now about trying to do this on a large scale, to make functional inferences for the large fraction of proteins about which one cannot currently say anything at all," said Baker. "The power of these methods is that, since no information is needed other than the amino acid sequence, one can conceive of going through a genome and generating structures and possibly functional insights for every protein."


1. Introduction

The protein and design work in my group is carried out using a computer program called Rosetta. At the core of Rosetta are potential functions for computing the energies of interactions within and between macromolecules, and optimization methods for finding the lowest energy structure for an amino acid sequence (protein structure prediction) or a protein–protein complex, and for finding the lowest energy amino acid sequence for a protein or protein–protein complex (protein design). Both the potential functions and the search algorithms are continually being improved based on feedback from the prediction and design tests (see schematic in figure 1 ). There are considerable advantages in developing one computer program to treat these quite diverse problems: first, the different applications provide very complementary tests of the underlying physical model (the fundamental physics/physical chemistry is of course the same in all cases), and second, many problems of current interest, such as flexible backbone protein design and protein–protein docking with backbone flexibility involve a combination of the different optimization methods.

Schematic diagram of Rosetta structure prediction and design efforts.

In the following sections, I summarize recent progress and highlights in each of the different areas and illustrate the development of the physical model. I will put particular emphasis on the results from each of the areas that suggest real progress is being made in high-resolution modelling.

(a) Design of protein structure

Over the past several years, we have used our computational protein design method to dramatically stabilize several small proteins by completely redesigning every residue of their sequences (Dantas et al. 2003), to redesign protein backbone conformation (Nauli et al. 2001), to convert a monomeric protein to a strand swapped dimer (Kuhlman et al. 2002), and to thermostabilize an enzyme (Korkegian et al. 2005). A highlight was the redesign of the folding pathway of protein G, a small protein containing two beta hairpins separated by an alpha helix. In the naturally occurring protein, the first hairpin is disrupted and the second hairpin is formed at the rate limiting step in folding, but in a redesigned variant in which the first hairpin was significantly stabilized and the second hairpin destabilized, the order of events is reversed: the first hairpin is formed and the second hairpin disrupted in the folding transition state (Nauli et al. 2002). The ability to rationally redesign protein folding pathways shows that our understanding of the determinants of protein folding has advanced considerably.

Particularly exciting more recently is the achievement of a grand challenge of computational protein design—the creation of novel proteins with arbitrarily chosen three dimensional structures. We developed a general computational strategy for creating such novel protein structures that incorporates full backbone flexibility into rotamer-based sequence optimization. This was accomplished by integrating ab initio protein structure prediction, atomic level energy refinement, and sequence design in Rosetta. The procedure was used to design a 93 residue protein called Top7 with a novel sequence and topology. Top7 was found experimentally to be monomeric and folded, and the X-ray crystal structure of Top7 is strikingly similar (r.m.s.d.=1.2 Å) to the design model ( figure 2 Kuhlman et al. 2003). The successful design of a new globular protein fold and the very close correspondence of the crystal structure to the design model have broad implications for protein design and protein structure prediction, and open the door to the exploration of the large regions of the protein universe not yet observed in nature.

Comparison of Top7 X-ray crystal structure (red) and design model (blue). (a) Calpha overlay (b), detail of sidechain packing in the core.

(b) Design of protein–protein interactions

To explore the extension of these methods to protein–protein interactions, and in particular to the redesign of interaction specificity, we chose as a model system the high-affinity complex between Colicin E7 Dnase and its cognate inhibitory immunity protein. Novel Dnase–inhibitor protein pairs predicted to interact tightly with one another but not with the wild type proteins were generated using the physical model described above and a modification of our rotamer search-based computational design strategy incorporating elements of both positive and negative design. Experimental characterization demonstrated that the designed protein complexes have sub-nanomolar affinities, are functional and specific in vivo, and have more than an order of magnitude affinity difference between cognate and non-cognate pairs in vitro (Kortemme et al. 2004). The approach should be applicable to the design of interacting protein pairs with novel specificities for delineating and reengineering protein interaction networks in living cells.

In collaboration with Dr Barry Stoddard's and Dr Ray Monnat's research groups, we generated an artificial highly specific endonuclease by fusing domains of homing endonucleases I-DmoI and I-CreI through computational optimization of a new domain𠄽omain interface between these normally non-interacting proteins. The resulting enzyme, E-DreI (Engineered I-DmoI/I-CreI), binds a long chimeric DNA target site with nanomolar affinity, cleaving it precisely at a rate equivalent to its natural parents (Chevalier et al. 2002). We are currently trying to develop a whole new generation of new endonucleases by redesigning the protein𠄽NA interface using an extension of our design methodology to protein–nucleic acid interfaces (Havranek et al. 2004).

In both of these systems, it has been possible to determine X-ray crystal structures of the designed complexes. As in the Top7 case, the actual structures are very close to the design models, which is an independent and important validation of the accuracy of our approach to high-resolution modelling.

(c) Prediction of protein structure

The picture of protein folding that motivates our approach to ab initio protein tertiary structure prediction is that sequence-dependent local interactions bias segments of the chain to sample distinct sets of local structures, and that non-local interactions select the lowest free-energy tertiary structures from the many conformations compatible with these local biases. In implementing the strategy suggested by this picture, we use different models to treat the local and non-local interactions. Rather than attempting a physical model for local sequence–structure relationships, we turn to the protein database and take the distribution of local structures adopted by short sequence segments (fewer than 10 residues in length) in known three-dimensional structures as an approximation to the distribution of structures sampled by isolated peptides with the corresponding sequences. The primary non-local interactions considered are hydrophobic burial, electrostatics, main-chain hydrogen bonding and excluded volume. Structures that are simultaneously consistent with both the local sequence structure biases and the non-local interactions are generated by minimizing the non-local interaction energy in the space defined by the local structure distributions using simulated annealing.

Rosetta has been tested in the biannual CASP protein structure prediction experiments in which predictors are challenged to make blind predictions of the structures of sequences whose structures have been determined but not yet published. Since CASP3 in 1998 Rosetta has consistently been the top performing method for ab initio prediction, as can be seen in the published reports of the independent assessors. For example, Rosetta was tested on 21 proteins whose structures had been determined but were not yet published in the CASP4 experiment. The predictions for these proteins, which lack detectable sequence similarity to any protein with a previously determined structure, were of unprecedented accuracy and consistency (Bonneau et al. 2002). Excellent predictions were also made in the CASP5 experiment (Bradley et al. 2003). Encouraged by these promising results, we generated models for all large protein families fewer than 150 amino acids in length (Bonneau et al. 2002). For CASP6 (December 2004), we developed improved methods for beta sheet protein prediction, and I was also delighted that many of the other top groups used the Rosetta software, which has been freely available (source code in addition to executable) for the past several years.

Since CASP4 I have been convinced that real progress in structure prediction (both de novo prediction and comparative modelling) would only come from progress in high-resolution refinement. While Rosetta predictions in CASP have been quite good on a relative scale, they have been poor on an absolute scale, with the topology roughly correct in favourable cases in at least one out of five submitted predictions but the high-resolution details for the most part completely wrong. Refinement of these rough models is critical for improving the accuracy of the models, and perhaps even more critically, for improving their reliability. The stability of proteins in large part derives from the close complementary packing of sidechains in the protein core, and hence evaluating the physical plausibility of a model requires modelling these interactions. Unfortunately, complementary sidechain packing is disrupted by changes in the backbone conformation of the magnitude of the errors in typical Rosetta low-resolution models. Hence, a major focus of our work in the past 5 years has been to develop high resolution all atom refinement methods which can drive the rough de novo models towards the native structure and thus transform our predictions from educated low-resolution guesses to confident high-resolution models. While we have been able make steady progress on both the sampling problem and the energy function, measurable progress on de novo prediction refinement has been small up until recently. However, the improved methods turned out to be very useful for both the design of Top7, described above, where they were critical in the backbone optimization step, and for the protein–protein docking method, described below, which utilizes the same energy function and much of the same optimization methodology.

A highlight of CASP6 for me was Target 281, the first de novo blind prediction which utilized our high-resolution refinement methodology to achieve close to high-resolution accuracy. As the sequence was relatively short (76 residues), during CASP we had time to apply our all atom refinement methodology not only to the native sequence but also to the sequence of many homologues. The centre of the lowest energy cluster of structures turned out to be remarkably close to the native structure (1.5 Å). The high-resolution refinement protocol decreased the r.m.s.d. from 2.2 to 1.5 Å and the sidechains pack in a somewhat native like manner in the protein core. Since last summer, we have used this protocol on a number of other very small proteins and results are very promising. There is still a huge amount to do on this very challenging problem, and improving refinement methods will continue to be a focus of our work for the next 5-year period. A very concrete problem of considerable practical importance is the closely related comparative modelling refinement problem: for proteins with sequence similarity to proteins of known structure, models can be built by essentially 𠆌opying’ the coordinates of the homologue, but most efforts to improve on this starting template structure have failed (we have had some success recently using evolutionary information to guide the sampling Qian et al. 2004). Hence comparative models typically do not accurately represent the structural features that differ between the homologues, which is a serious shortcoming that impairs prediction of interaction specificity and other uses of the models. Thus, as we develop improved methods we will test them on both the de novo structure refinement problem and the comparative modelling problem. The goal is simple—to be able to produce sufficiently accurate models either with or without a starting template structure to allow structure-based biological insights without need for tedious and expensive experimental structure determination—or even more simply put, to solve the protein folding problem.

We have extended the Rosetta ab initio structure prediction strategy to the problem of generating models of proteins using limited experimental data. By incorporating chemical shift and Nuclear Overhouser effect (NOE) information (Bowers et al. 2000) and more recently dipolar coupling information (Rohl & Baker 2002) into the Rosetta structure generation procedure, it has been possible to generate much more accurate models than with ab initio structure prediction alone or using the same limited data sets with conventional NMR structure generation methodology. An exciting recent development is that the Rosetta procedure can also take advantage of unassigned NMR data and hence circumvent the difficult and tedious step of assigning NMR spectra (Meiler et al. 2003).

The Rosetta ab initio structure prediction method, the Rosetta-based NMR structure determination method, and a new method for comparative modelling (Rohl & Baker 2003) that uses the Rosetta de novo modelling approach to model the parts of a structure (primarily long loops) that cannot be accurately modelled based on a homologous structure template have all been implemented in a public server called Robetta which was one of the best all around fully automated structure prediction servers in the CASP5 and CASP6 tests (Chivian et al. 2005) and has a constant backlog of users worldwide.

(d) Prediction of protein–protein interactions

As described above, we have been working for a number of years on protein structure refinement, which is challenging because of the very large number of degrees of freedom. I became interested in the protein–protein docking problem because, with the approximation that the two partners do not undergo significant conformational changes during docking, the space to be searched is much smaller—only the 6 rigid body degrees of freedom in addition to the sidechain degrees of freedom, and thus it seemed like a good stepping stone to the harder structure refinement problem while being important in its own right.

We developed a new method to predict protein–protein complexes from the coordinates of the unbound monomer components (Gray et al. 2003) that employs a low-resolution, rigid-body, Monte Carlo search followed by simultaneous optimization of backbone displacement and sidechain conformations with the Monte Carlo minimization procedure and physical model used in our high-resolution structure prediction work. The simultaneous optimization of sidechain and rigid body degrees of freedom contrasts with most other current approaches which model protein–protein docking as a rigid body shape matching problem with the sidechains kept fixed. We have recently improved the method (RosettaDock) further (Wang et al. 2005) by developing an algorithm which allows efficient sampling of off rotamer sidechain conformations during docking.

The power of RosettaDock was highlighted in the very recent blind CAPRI protein–protein docking challenge which was held in December of 2004. In CAPRI, predictors are given the structures of two proteins known to form a complex, and challenged to predict the structure of the complex. RosettaDock predictions for targets without significant backbone conformational changes were quite striking, as shown in figure 3 . Not only were the rigid body orientations of the two partners predicted nearly perfectly, but also almost all the interface sidechains were modelled very accurately. Importantly, these correct models clearly stood out as lower in energy than all other models we generated, which suggests the potential function is not too far off. These predictions were qualitatively better than predictions made using standard grid-based methods which keep protein sidechains fixed during docking.

CAPRI protein–protein docking results. (a) (i): Energy spectrum of models generated in global docking calculations carried out before experimental structures were released (ii) free energy landscape mapped out by starting trajectories at lowest energy points sampled in global docking runs. (b): comparison of predicted (blue) rigid body orientation with X-ray crystal structure (red and yellow). (c): close up of interface showing that in addition to the rigid body orientation also the detailed conformations of the sidechains were correctly predicted. The predicted models are those submitted to the CAPRI organizers and are the lowest energy models found in the global and local searches shown on the (a).

These very promising results suggest that the method may soon be useful for generating models of biologically important complexes from the structures of the isolated components, and more generally suggest that high-resolution modelling of structures and interactions is within reach. A clear goal for our monomeric structure prediction work is to approach the level of accuracy of these models.


Ab initio prediction of peptide-MHC binding geometry for diverse class I MHC allotypes

Since determining the crystallographic structure of all peptide-MHC complexes is infeasible, an accurate prediction of the conformation is a critical computational problem. These models can be useful for determining binding energetics, predicting the structures of specific ternary complexes with T-cell receptors, and designing new molecules interacting with these complexes. The main difficulties are (1) adequate sampling of the large number of conformational degrees of freedom for the flexible peptide, (2) predicting subtle changes in the MHC interface geometry upon binding, and (3) building models for numerous MHC allotypes without known structures. Whereas previous studies have approached the sampling problem by dividing the conformational variables into different sets and predicting them separately, we have refined the Biased-Probability Monte Carlo docking protocol in internal coordinates to optimize a physical energy function for all peptide variables simultaneously. We also imitated the induced fit by docking into a more permissive smooth grid representation of the MHC followed by refinement and reranking using an all-atom MHC model. Our method was tested by a comparison of the results of cross-docking 14 peptides into HLA-A*0201 and 9 peptides into H-2K b as well as docking peptides into homology models for five different HLA allotypes with a comprehensive set of experimental structures. The surprisingly accurate prediction (0.75 Å backbone RMSD) for cross-docking of a highly flexible decapeptide, dissimilar to the original bound peptide, as well as docking predictions using homology models for two allotypes with low average backbone RMSDs of less than 1.0 Å illustrate the method's effectiveness. Finally, energy terms calculated using the predicted structures were combined with supervised learning on a large data set to classify peptides as either HLA-A*0201 binders or nonbinders. In contrast with sequence-based prediction methods, this model was also able to predict the binding affinity for peptides to a different MHC allotype (H-2K b ), not used for training, with comparable prediction accuracy. Proteins 2006. © 2006 Wiley-Liss, Inc.

The Supplementary Materials referred to in this article can be found at http://www.interscience.wiley.com/jpages/0887-3585/suppmat/

Filename Description
jws-prot.20831.dat1.dat3 KB A0201 binders.
jws-prot.20831.dat2.dat3 KB A0201 non-binders.
jws-prot.20831.dat3.dat250 B Kb binders.
jws-prot.20831.dat4.dat250 B Kb non-binders.

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.


SUMMARY

We describe here an easy-to-use web server interface to the Rosetta FlexPepDock protocol for the high-resolution modeling of peptide–protein interactions. FlexPepDock has recently been used by us to successfully address several ‘real world’ modeling tasks ( 34–37 ) and we expect that increasing its usability through this web server will open the door for a wide range of new systems and applications.

We have recently extended the FlexPepDock protocol and introduced ‘FlexPepDock ab-initio ’, a powerful protocol for simultaneous de novo folding and docking of peptides at a known binding site that does not require an initial peptide backbone conformation. FlexPepDock ab initio performed well on a benchmark of peptide–protein interactions ( 38 ). This protocol is however computationally expensive and therefore not yet available on the web server. It can be downloaded as part of the next Rosetta release.


4 Conclusions

InterPep2 applies structural templates for docking peptide fragments, using a random forest regressor to score plausible interaction models. Because InterPep2 is using a residue-order-independent structural alignment for positioning the peptide, it is not limited to use peptide–protein interaction templates, but can use any protein–protein interaction surface as template to model peptide–protein interaction complexes.

InterPep2-Refined achieves state-of-the-art performance on a large set of 251 bound peptide–protein complexes with up to 25 residues long peptides, placing the peptide within 4.0 Å LRMSD of its native conformation in 50 structures considering top10 predictions, and with the highest precision across all recall levels, for example at 50% recall the precision is 61.5% compared to 47.8% precision for the second best method. This performance is maintained when testing on a new set (PDB16–19) of 252 complexes from structures deposited after the complexes used in the construction of the InterPep2 training and template sets, for which 67 peptides were placed in the correct conformation.

On a frequently used dataset of 27 unbound-to-bound complexes InterPep2-Refined performed second-best, successfully placing the peptide within 4.0 Å LRMSD in 15 of 27 peptide conformations, and modeling it with an fnat of at least 0.6 in 13 of the 27, without the use of templates with similar sequence to the target. More interesting however, is that a method combining the template-based InterPep2-Refined with the ab initio method PIPER-FlexPepDock vastly outperformed both methods it was derived from, successfully generating models with the peptide within 4.0 Å LRMSD of its native position for 22 of the 27 complexes, with an fnat of at least 0.6 in 19 of the 27.


The Rosetta macromolecular modeling software is a versatile, rapidly developing set of tools that are now being routinely utilized to address state-of-the-art research challenges in academia and industrial research settings. A Rosetta Conference (RosettaCon) describing updates to the Rosetta source code is held annually. Every two years, a Rosetta Conference (RosettaCon) special collection describing the results presented at the annual conference by participating RosettaCommons labs is published by the Public Library of Science (PLOS). This is the introduction to the third RosettaCon 2014 Special Collection published by PLOS.

The Rosetta macromolecular modeling software is a versatile, rapidly developing set of tools that are now being routinely utilized to address state-of-the-art research challenges in academia and industrial research settings. The software is being co-developed by 44 laboratories from universities, government labs, and research centers in the United States, Europe, Asia, and Australia. The Rosetta software package is the result of a collaborative effort among these research institutions, building upon shared discoveries and free exchange of knowledge and software tools. Every institution with a participating laboratory is a member of an organization called RosettaCommons that facilitates code development and collaboration (http://www.rosettacommons.org). To enhance this collaborative development effort, RosettaCommons holds an annual conference in Leavenworth, WA, USA in the last week of July or the first week of August. Every two years, a Rosetta Conference (RosettaCon) special collection describing the results presented at the conference by participating RosettaCommons labs is published by the Public Library of Science (PLOS). As organizers of the 2014 Rosetta Conference, we are pleased to introduce the third RosettaCon 2014 Special Collection published by PLOS.

The applications of Rosetta software can be broadly divided into two themes–modeling or predicting structures of natural biological polymers [1,2], and the design of novel biomacromolecules [3,4] using, in some cases, an expanded alphabet that included non-natural sidechain and/or backbone functional groups [5,6]. These diverse applications, however, use the same underlying conceptual and software framework consisting of generating various conformations (sampling) of a molecule and scoring these conformations to identify optimal atomic-resolution arrangements (energy function). A crucial early insight was that both scoring and sampling techniques should ideally be independent of the problem under consideration and trained on experimental data [7]. Examples of these datasets include the distributions of protein backbone conformations or side chain rotamers seen in the Protein Databank [1,8], or the measured changes in free energies upon mutation in protein cores [9]. In this framework, the successes and failures of each structural modeling or design exercise provides valuable feedback for improving the underlying methods to iteratively recapitulate a greater proportion of experimental results. Therefore, reproducibility, verification and generalizability of new Rosetta computational algorithms is crucial.

A recent report extrapolates that fully 50% of biological research is not reproducible [10]. Accessibility of new techniques to an outside user can significantly impact reproducibility [11]. In principle, computational biology simulations should offer greater control over both accessibility and reproducibility compared to “wet” lab experiments, as the number of uncontrolled ingredients (reagents etc.) are lower. Yet in practice both reproducibility and accessibility can suffer. This is because academic labs often develop shortcuts and shorthand in day-to-day practice of a newly developed technique, and often omit to mention these little details in their publications, which, in turn, may contribute negatively to reproducibility. Additionally, the structural and design complexity of multi-purpose software such as Rosetta is high (currently at 2.7 million lines of code) and new software developments are usually made in academic laboratories by non-professional software developers who are focused on solving a specific scientific problem. For example, the use of specific data structures that assume molecular connectivity corresponding to canonical L-amino acids can frustrate the extension of a structure prediction algorithm to non-canonical side chains or backbone groups.

One idea to achieve reproducibility and accessibility was explored in the previous Rosetta collections—Protocol Capture [12]. In a Protocol Capture, all individual steps in a newly developed protocol are listed as a step-by-step flowchart [13]. Input and expected output files, along with a reference to the code executable (or version number), are provided to the user. In this manner, the user can identify what was actually done in the simulation. This helps both scientific reproducibility (by reporting exactly what was done) as well as accessibility (by allowing non-specialists to reproduce the main findings of the work). However, the issues of laboratories using their shorthand and assumptions, as well as insufficient attention being paid to generalizability still remained. In this collection, we sought to address these issues by requiring an author from an external (but still RosettaCommons) laboratory to serve as a “tester”. This follows from the well-established practice in the software industry where testing and development are separate functions. For the Rosetta community, this approach provides the additional benefit that the external “tester” author, while being an expert in the general area, is sufficiently removed from the laboratory-specific jargon and project-specific scientific goals. Thus, the perspective of the tester author should increase the clarity of description as well as generalizability of the underlying code itself.

This year’s collection contains 12 papers published in PLOS One and PLOS Computational Biology. These papers characterize the diversity of modeling applications present in the Rosetta Macromolecular Code framework, including structure prediction, protein design, modeling of conformational states, and enzyme redesign. We have grouped the papers into four broad categories: structure prediction, membrane proteins, scientific benchmarks, and docking. Many of these categories are artificial, as some of the papers in the collection can fit into multiple categories. Nevertheless, they serve as a useful rubric for appreciating the depth and breadth of the Rosetta Macromolecular software package.

Protein Structure Prediction

The structural prediction of monomeric, soluble proteins is still an unsolved problem, notwithstanding notable recent advances. One important necessity in computational prediction protocols is reducing the high dimensional search space during simulations. An increasingly successful approach is the incorporation of structural restraints derived from phylogeny or low-resolution experiments𠄻oth approaches provide valuable but sparse and/or noisy information, and the challenge is to productively use these data. For example, Braun et al. demonstrate that evolutionary information on the protein fold can be discretized as residue-residue 𠇌ontact maps”, and that these can be combined with iterative sampling techniques for more accurate protein structure prediction [14]. In another example, Huber and colleagues show the integration of Rosetta with sparse EPR constraints to model conformational states in a model protein [15]. One technical issue that arises with the incorporation of multiple experimentally derived restraints is that individual sets are incompatible with each other, thus requiring manual intervention from the coder. To address this problem, Porter et al. developed a computational framework that simplifies combined sampling strategies in Rosetta [16]. They then demonstrated this powerful framework on a range of modeling problems, including domain insertion and ab initio structure prediction with multiple sets of experimental restraints.

Membrane Proteins

The design and modeling of membrane proteins is an emerging research area. Gray and colleagues present an integrated framework for membrane protein modeling and design [17]. In this work they showed application of the modeling framework to predict free energy changes upon mutation, high-resolution structural refinement, protein-protein docking, and assembly of symmetric protein complexes.

Docking

A significant issue limiting the success of both protein-protein and protein-small molecule docking is the large size and ruggedness of the search space. To efficiently sample conformational space, several approximations are made in the Rosetta approach: a low resolution Monte Carlo search, typically with a coarse-grained representation of the molecules and an approximate energy function, is first performed, followed by high resolution Monte Carlo refinement with atomic resolution [18]. In spite of these approximations, sampling remains computationally inefficient. Furthermore, the energy functions used in the high-resolution step, while being more accurate than the low-resolution step, are still built for speed over accuracy, and often suffer from incorrect modeling of interactions between polar groups, and protein with the solvent. More specifically, in the Rosetta high-resolution energy function, the balance of hydrogen bonding, electrostatics and desolvation forces is a known contributor to energy function inaccuracy [8,19]. It should be noted that the limitations in scoring and sampling are related𠄾nhanced sampling allows identification of false positive conformations, where as more accurate scoring increases ease of identification of true positive solutions by more efficient identification of more optimal basins. Several papers tackle the sampling and scoring issues in docking:

Zhang et al. show the application of replica exchange and other advance sampling techniques to increase the efficiency of Monte Carlo search during docking. Using a benchmark set of 20 protein-protein complexes, they identified an advanced sampling strategy showed better performance with equivalent computational resources. A new sampling approach was used by DeLuca et al. [20] to improve the accuracy and decrease the computational cost of the RosettaLigand docking protocol used in the prediction of protein-small molecule interactions [21]. For protein-small docking, the Karanicolas group report several significant improvements to a previously developed “ray casting” docking approach [22] used for the prediction of small molecules that disrupt protein-protein interactions [23]. Bazzoli et al. show that the use of two recent enhancements to the Rosetta energy function𠄾xplicitly including a Coulombic electrostatic term, and using a modified form of the implicit solvation potential�n markedly improve the ability to identify small-molecule inhibitors of protein-protein interactions [24].

Protein Multispecificity Design

The design of multi-specificity of proteins is important in applications ranging from structural vaccine design, bispecific antibody therapy, and combinatorial biocatalysis. Many computational design strategies rely on genetic algorithms, which are slow and limit search space. To address this problem, the Meiler group developed a new algorithm that can find multistate minima without reliance on techniques that limit search space like a fixed backbone approximation [25].

Scientific Benchmarks

Many of the above protocols were developed by evaluating performance against a benchmark set. Development of accessible, standard benchmarks for different end uses has the potential to increase the speed of method development, and aid reproducibility. For that reason, the Kortemme lab has developed a centralized web resource for standardized benchmark datasets (https://kortemmelab.ucsf.edu/benchmarks) [26]. This web resource includes analysis scripts, Rosetta commandlines, and tutorials for the given benchmark. There are three main sets of benchmarks in this resource: tests estimating the energetic effects upon mutation, tests for structure prediction, and ones for protein design. As a further example of the utility of benchmark sets, Ollikainen et al. developed a benchmark in order to test different protein design protocols on the re-design of enzyme substrate specificity [27]. They then showed that a protocol coupling backbone with side-chain flexibility improves prediction of sequence recovery over a competing fixed backbone approach.

Taken together, the articles in this collection highlight the utility of the Rosetta approach in tackling wide-ranging problems in biomolecular modeling and design using a common platform that allows the accessible and reproducible re-utilization of software. The common framework also provides an inherent feedback loop where new algorithms for sampling and scoring can be widely utilized and benchmarked for diverse scientific problems, in the process highlighting limitations of the approaches and areas where further developments are needed. We hope that through this collection readers will get a taste of the excitement and the unity in diversity that we enjoyed at RosettaCon 2014!


Protein Loop Modeling

Loop modeling is a complex and central element of protein structure prediction and design. There are two typical biological problems:

  • modeling loops into regions of low electron density in crystal structures
  • modeling loops into regions of low homology or with no secondary structure in homology models There exist a variety of tools for approaching these tasks. For an overview of loop modeling in Rosetta, please see this.

Modeling Loops in Regions of Low Electron Density

For explicit refinement of crystallography data, see here.

loops from density is a script to take badly fit electron data and a cutoff suggesting how much of the pose you're willing to rebuild and to generate input "loops" files for loop modeling.

For modeling of missing loops on existent protein structures, you can use any of the methods in the section below.

Modeling Loops in Regions of Low Homology or with No Secondary Structure

What if I am building a homology model and there are regions with low homology or no predicted secondary structure? These are the typical problems solved by loop modeling algorithms. Most loop modeling algorithms in Rosetta are contained within a single executable and run by setting different flags. The fastest, but least accurate method is cyclic coordinate descent (CCD). CCD closes a loop by iteratively solving for phi/psi angles which position the mobile terminus closer to the target anchor after fragment insertion. CCD is generally not recommended but can be used in specific cases (e.g. when time is a limiting factor). The currently (June 10th, 2015) accepted method of loop modeling is next-generation KIC (NGK). KIC sampling can be enhanced/concerted with fragments (KIC with fragments). There also exists an alternative, Monte Carlo stepwise, loop modeling method which can be applied to proteins and RNA. Unfortunately, stepwise loop modeling (for proteins and RNA) tends to be slow.

What if I am modeling a protein with a disordered region?

You probably should not be doing this using Rosetta, if at all. Disordered proteins are dynamic in the context of a cell. It is unlikely that any static, in silico, model of a disordered protein or protein region will be very accurate. Rosetta's scorefunctions are parameterized on crystallized proteins, not disordered proteins. However, if you have a specific question, such as "can my disordered tail of 20 residues plausibly interact with this other region of my protein?" Then you may begin to approach this question with FloppyTail.


Protein Structure Prediction: Conventional and Deep Learning Perspectives

Protein structure prediction is a way to bridge the sequence-structure gap, one of the main challenges in computational biology and chemistry. Predicting any protein's accurate structure is of paramount importance for the scientific community, as these structures govern their function. Moreover, this is one of the complicated optimization problems that computational biologists have ever faced. Experimental protein structure determination methods include X-ray crystallography, Nuclear Magnetic Resonance Spectroscopy and Electron Microscopy. All of these are tedious and time-consuming procedures that require expertise. To make the process less cumbersome, scientists use predictive tools as part of computational methods, using data consolidated in the protein repositories. In recent years, machine learning approaches have raised the interest of the structure prediction community. Most of the machine learning approaches for protein structure prediction are centred on co-evolution based methods. The accuracy of these approaches depends on the number of homologous protein sequences available in the databases. The prediction problem becomes challenging for many proteins, especially those without enough sequence homologs. Deep learning methods allow for the extraction of intricate features from protein sequence data without making any intuitions. Accurately predicted protein structures are employed for drug discovery, antibody designs, understanding protein–protein interactions, and interactions with other molecules. This article provides a review of conventional and deep learning approaches in protein structure prediction. We conclude this review by outlining a few publicly available datasets and deep learning architectures currently employed for protein structure prediction tasks.

This is a preview of subscription content, access via your institution.


CONCLUSION

We report recent advancements made to the online COFACTOR server for hybrid protein function annotations. In general, the biological function of a protein can be intricate and often contains multiple levels of categorizations. The COFACTOR server focuses on the three most widely-used and computationally amenable categories of function: GO, EC number and ligand-binding sites. Compared with the previous version of COFACTOR, which generated function annotations purely based on structural homology transfer, the updated server introduced several new pipelines built on sequence profile and PPI network information to enhance the accuracy and coverage of the structure-based function predictions. Accordingly, new sources of function templates, including sequence homologs and PPI partners, have been incorporated into the default function library (BioLiP) of the COFACTOR server. Our large-scale benchmark tests have shown that the new composite pipelines can generate function predictions with accuracy outperforming the former version of COFACTOR, as well as many state-of-the-art methods in the literature.

To facilitate the use and interpretation of the prediction results, a confidence scoring system has been introduced (as calibrated in Figure 2), which can help users to quantitatively estimate the accuracy of the predictions. Meanwhile, new DAG combined with animation software are introduced to facilitate the viewing, analysis and manipulation of the prediction models. These developments and updates significantly enhance the accuracy and usability of an already widely applied structure function service system and will make it continue to be a powerful tool, powered by new state of the art algorithms, both for rapid annotation of uncharacterized proteins and for providing a starting point to understand and further characterize targets that may be identified in high-throughput experimental studies.