Teknos

View Original

Big Data: Identifying Protein Decoys

Making Sense of Big Molecular Data: Clustering Algorithms to Identify Near-Native Protein Decoys

Rohan Pandit
Computer Systems Laboratory
Thomas Jefferson High School for Science and Technology

Abstract

Modeling and simulation are now well-established tools to characterize biological molecules central to the inner workings of a cell. Typically, molecular simulation packages, whether based on Newtonian mechanics to accurately albeit inefficiently simulate molecular dynamics, or heuristic-based stochastic optimization to efficiently albeit less accurately model structures and motions, generate large volumes of molecular data. Many computational techniques have been developed over the years to simplify and reduce such data. Predominantly, studies have focused on dimensionality reduction techniques capable of extracting reaction coordinates that succinctly describe the essential dynamics hidden in the myriad of molecular data. In this project, we investigate using such techniques in a novel application setting, de novo protein structure prediction. The challenge here is to select which decoy/structure is the true native structure among thousands of possibilities. We demonstrate here that feature-based clustering of low dimensional embeddings of data is superior to high-dimensional clustering, and we specifically evaluate a recently discovered non-linear dimensionality reduction technique: the locally scaled diffusion map. We show that the locally scaled diffusion map, combined with agglomerative clustering, is effective at addressing the open problem of decoy selection for the purpose of making blind predictions.

Introduction

Advances in processing speeds and capabilities gave rise to modeling and simulation in silico as a viable and effective approach to grow and improve our knowledge of biological molecules typically studied in the wet laboratory [1, 2, 3]. Over the past half century, great algorithmic advances have been made in characterizing the sequence-structure-function relationship in biomolecules central to the inner workings of a cell, such as peptides, proteins, RNA, and DNA [4, 5]. Nowadays, some of the most complex yet ubiquitous molecules in the cell, peptides and proteins, are routinely the subject of computational studies focused on unraveling the pervasive role of these molecules in the healthy and diseased cell.

Often, elucidating the molecular basis of disease rests on the ability of computational approaches to capture the intrinsic dynamics of these molecules [6, 7, 8]. Many simulation packages have been developed for this purpose over the years, such as CHARMM [9], AMBER [10], GROMOS [11], and others. By applying Newton’s laws of motion on molecular mechanics, these packages follow movements of the hundreds of thousands of the atoms constituting a peptide or protein molecule as the molecule figuratively tumbles down an energy landscape drawn to potential energy minima [12]. The process, based on repeated derivations and integrations over time, is time-consuming. Moreover, thousands or more end-point structures, positions for the atoms of the molecule, need to be collected in order to obtain a comprehensive picture of the intrinsic atomic motions and of the structural flexibility of a peptide and protein under investigation.

On the other hand, the focus on decoding genomes of species paid off but also exposed a growing number of protein sequences for which no structural or functional characterization exists [13]. With the time demands of molecular dynamic simulation packages making it impractical to apply these packages on every newly-decoded protein sequence, significant computational research was devoted to alternative approaches [14]. Stochastic optimization methods based on heuristics rather than following Newton’s laws of motions were and continue to be proposed to effectively sample many low-energy (end-point) structures of a protein or peptide without having to navigate the energy landscape. These methods are efficient albeit less accurate and are now the preferred tools to compute the structure of a protein sequence when no other template structures are available [15]. Such methods, which are said to address the problem of de novo structure prediction, are able to generate thousands or more structures of a protein given only knowledge of the protein’s amino-acid sequence [16, 17, 18, 19, 20, 21, 22, 23].

Whether the computational method employed to model different low-energy structures is based on simulating the structural dynamics or on heuristic-based stochastic optimization, large volumes of structural data are produced. A natural question arises at the end of application of such methods. Given N structures produced for a peptide or protein under investigation, with N being in the thousands and with one structure being an instantiation of cartesian coordinates for hundreds or thousands of atoms, what is one supposed to do with the data? How should the data be interpreted? The interest is often on revealing the stable and semi-stable structural states of a flexible peptide and protein, and transitions between such states. While the thermodynamics hypothesis states that such states are associated with low rather than higher energies [24], they can be exposed if one can extract from the molecular simulation the collective, slow dynamics without essentially being distracted by the fast atomic motions.

Significant computational research across different scientific communities, chemistry, biology, mathematics, and computer science is pre-positioned on the hypothesis that motions of proteins and peptides over a long time scale can be described in terms of a few collective coordinates; the latter are also often referred to as reaction coordinates [25]. This hypothesis has been validated on many molecular systems [26, 27, 28]. Essentially, studies have shown that even though the space of different structures of a protein or peptide may be indeed high-dimensional, for instance 3M dimensions for a protein of M atoms, the actual structures that comply with the physics-based constraints and are thus of relevance for biological function, comprise a much lower-dimensional space. Many of the coordinates may be dependent variables; for instance, bonded atoms will move together. Inspection of variable couplings needs to elucidate the independent ones. In essence, the majority of the 3M-dimensional space is ambient space, and dimensionality reduction techniques need to be developed to extract the non-noise from the noise coordinates.

Extracting the non-noise subspace is the subject of much research on dimensionality reduction. One of the most intuitive techniques is Principal Component Analysis (PCA), which identifies new orthogonal axes that preserve variance in the data [29]. PCA, however, as we describe in section 2, assumes that the space is linear and cannot be applied to reduce non-linear spaces. Other methods have been developed to address non-linear dimensionality reduction, such as kernel PCA [30], Isomap [31, 32], DPES-ScIMap [33], and LLE [34]. More recently, a new set of methods based on the principle of diffusion, have been proposed as more rigorous than Isomap and its variants on estimating the effective dimensionality of a molecular structure space. The locally-scaled diffusion map (LDFMap) [35], proposed as an improvement of the global diffusion map (GDFMap) method originally put forth in [36], has been demonstrated to estimate such dimensionality and reveal the major stable and semi-stable structural states of structure data computed via molecular dynamic simulation.

In our previous Siemens submission, we investigated LDFMap on conformational spaces produced via molecular dynamic simulation as well as those produced via de novo structural prediction techniques, and pitched it against PCA, finding that the LDFMap projections were qualitatively superior to those produced by PCA. We additionally reported the arbitrariness of results that could be obtained via GDFMap. For these reasons, in this submission we focus on collective variables that can be extracted from molecular structure data via linear techniques, such as PCA and non-linear diffusion-based techniques, such as LDFMap.

Specifically, we investigate feature-based clustering in a novel setting, on protein conformations obtained by the Rosetta de novo structure prediction protocol [20], to address the challenging problem of decoy selection. Clustering for decoy selection typically uses pairwise least root-mean square-deviations (lRMSD), which we use as baseline to evaluate feature-based clustering. We pitch LDFMap and PCA against each other, investigating the quality of their clusters with comparison to the lRMSD baseline.

Our analysis here contains a qualitative and quantitative component. Qualitatively, we look at low-dimensional embeddings of conformations obtained by the Rosetta protocol. These conformations are endpoints of independent minimizations, so they are local minima. We investigate any potential co-localization of native-like conformations in these embeddings. It is worth noting that our ensembles are large, about five times more than recommended by the Rosetta developers. This decision is based on the fact that the Rosetta protocol is a stochastic minimization protocol and has no guarantee of locating the global minimum or the native basin. The quantitative component of our analysis investigates whether the collective variables can be used as features in clustering-based decoy selection. We investigate and compare a diverse array of clustering protocols via the publicly-available weka platform [37]. The comparison focuses on several measures of performance, including rank and a normalized information score.

To the best of our knowledge, this is the first study to investigate feature-based clustering for the purpose of decoy selection and to do so in a comprehensive setting, considering diverse features and diverse clustering algorithms. We show that feature-based clustering creates clusters superior to lRMSD based methods. We also demonstrate that the clusters generated via LDFMap features outperform those generated via PCA. In addition, we demonstrate that agglomerative clustering is the one with reliable top performance across conformational ensembles of several, diverse proteins.

Methodology

As summarized briefly in section 1, we investigate here feature-based clustering of physically realistic protein structures produced with de novo structure prediction protocols. There are two important avenues to consider. First, it is not exactly clear how to summarize a three-dimensional protein structure; that is, what features and characteristics to extract from it that might facilitate a clustering algorithm to group together characteristics present only in near-native structures and away from characteristics present in non-native structures. So, we first describe in this section three different considerations of features. The second avenue we consider is clustering. There are by now many different clustering algorithms, as clustering itself is an ill-defined task. We consider an array of diverse clustering algorithms with implementations available in the weka software platform. These algorithms are briefly summarized after our discussion of features. 

Extracting Features from Three-Dimensional Protein Structure

It is worth noting that significant research in computational molecular biology and biophysics is devoted to extracting features from protein structures. These go by different names, such as reaction coordinates, collective variables, and more. Often, work focused on a specific protein system uses intimate knowledge about that system to construct features that allow separating structures participating in an event of choice and interest from those that do not.

In this project, we do not use information or knowledge specific to a protein system. Instead, the goal is to devise a decoy selection methodology that is general and works across the board. For this reason, we investigate three types of features that can be extracted from a three-dimensional protein structure. The first set of features are just the coordinates of the heavy atoms (non-hydrogen atoms). This is typically the information used by most decoy selection algorithms. A Euclidean based distance function is used to measure the dissimilarity between two structures represented as vectors of x,y,z coordinates of heavy atoms.

The distance function is used to determine which structures are more similar and should be grouped together in a cluster and which ones should be placed in different clusters. The distance function typically used is the least root-mean-squared-deviation (lRMSD), which first finds the optimal superposition of two structures that removes differences due to rigid body motions (translation and rotation in three-dimensional space). The ’least’ attribute refers to this process. After such a superimposition, the deviations in x, y, and z between corresponding heavy atoms are summed up, divided by the number of atoms, and the square root is offered as the lRMSD between two structures. Computing lRMSD is an expensive process due to the superimposition that needs to occur prior to calculating it.

In contrast to lRMSD and coordinates readily extracted as features from a protein structure, we consider here two additional ways to extract features. There is precious information about a protein’s structure space that is encoded in the multitude of protein structures returned from a de novo structure prediction algorithm. Collective coordinates can be extracted that tell us about the shape and dimensionality of the space. So, three different ways are considered here to extract collective coordinates. First, the PCA method is used to extract Principal Components (PCs) as features. PCA is a linear-dimensionality reduction method that assumes the structure space is a hypercube. PCA is effective when the number of PCs that cumulatively capture most of the variance in the original data is low; in such a case, the PCs can be considered to be effective new dimensions or basis vectors, and the projections of the protein structures over the PCs offer new coordinates in the new basis and can thus be used as features for clustering.

PCA

PCA is a dimensionality reduction technique that reduces possibly-correlated data onto a new orthogonal basis set of principal components. Due to its popularity, we will omit the details of the implementation, which can be found in [38] and our previous Siemens submission. From a procedural point of view, we run PCA on the heavy-atom coordinates extracted from protein conformations obtained via the Rosetta protocol. These structures are projected onto the top (based on eigenvalues sorted in descending order) 3000 eigenvectors/PCs. Effectively, each protein structure is represented as a vector of 3000 features.

PCA assumes the space is linear, and this assumption may not always be valid. Hence, we consider collective coordinates that are able to capture the non-linearity of the structure space. There are many non-linear dimensionality methods one can consider, such as kernel-based PCA, isomap, locally-linear embedding (LLE), but here we consider diffusion-based methods. The reason we do so is that recent work has shown these methods to be more powerful among others. As summarized in section 1, we consider global and local versions of such methods, which we have implemented as described in [39]. Our analysis, as presented in our previous Siemens report shows that LDFMap is indeed more accurate than GSDMap. Hence, we only employ the collective coordinates extracted through LDFMap as features. For the sake of clarity, we summarize both GDFMap and LDFMap and how features can be extracted through them for each protein structure in an ensemble of structures obtained via de novo structure prediction algorithms.

GDFMap

Diffusion map methods are different from PCA in that they allow capturing nonlinearity in data. In the global diffusion map method, first introduced by [40], points are represented as pairwise lRMSD from one another rather than cartesian coordinates. After this similarity matrix is constructed, a matrix Kε is created, in which K ε [i, j]=e(||xi−xj||^2)/(2e^2). This matrix gives a notion of transitional probability between any two structures xi and xj, If the distance ||xi−xj|| between the two structures is large, their transitional probability is lower. Furthermore, the 2ε2 term exists to provide a notion of local scale, which adjusts the impact of the distance penalty. This ε parameter can be interpreted to be the distance around each point Xi that is considered locally flat and therefore linearly reducible. As described in [40], a renormalization of Kε yields a Markov transition matrix that represents the true transitional probability between any two structures. An eigendecomposition of this transition matrix leads to eigenvalues and eigenvectors that are an alternative reduction of the data.

As described above, the GDFMap is able to capture nonlinearity in data because it limits its assumptions of linearity to a ball of radius ε around each point. The issue arises, how does one choose this value? Why is it constant for for all points? The work in [39] introduces a method of determining the parameter ε for each data point, as opposed to arbitrarily picking a value for all data points. It does so by attempting an increasingly large ε radius balls, conducting multidimensional scaling (MDS), determining where a natural cutoff is obtained between noise and non-noise eigenvalues, then assigning the ε parameter to the value at which the noise eigenvalues clearly separate from the non-noise ones [39]. In this participant’s last Siemens paper, we show that in the GDFMap, different values of epsilon lead to vastly different results and therefore the GDFMap math is not a useful technique, and the LDFMap should be used instead. After the ε parameter is determined, LDFMap map continues as the GDFMap does. The final eigendecomposition of the Markov matrix is too computationally expensive for our large, 50,000 decoy datasets. Fortunately, not all eigenvectors are needed; our study from last year found that the the first 6% of the eigenvectors from the LDFMap captured, on average, 90% of the accumulated variance of the original dataset. We used ARPACK, a Fortran library for large-scale eigenvalue problems [41], to extract just the first 3000 eigenvalues and eigenvectors rather than conduct a full eigendecomposition. This results in massive speedups but a loss of information on how much accumulated variance is actually captured.

PCA and LDFMap provide us with projections of the protein structures on each PC or DC eigenvector. These projections are features, and thus each structure is now reduced to a feature vector.

Clustering Algorithms

The clustering algorithms summarized below use Euclidean distance to measure the distance between two feature vectors when fed features vectors obtained via PCA or LDFMap. When these algorithms are fed to the raw heavy-atom coordinates used as baseline features for a comparison, the distance function used is lRMSD.

K-Means

K-Means is a simple, widely used clustering algorithm that separates a data set into K groups of equal variance. Its optimization objective seeks to minimize the squared distance between points within each cluster. Computationally, the problem is NP-hard, but can be vastly sped up by heuristics. After a random initialization of K cluster centroids, the algorithms consists of two steps. First, all points are assigned to its nearest centroid based on Euclidean distance . Then, the K new centroids are recalculated based on the mean of the points associated with each respective cluster. These two steps are repeated until the clusters cease to change. In this project, we use the minibatch version of K-means [42] available in Weka, as it offers drastic speedups on large datasets through subsampling, while producing results with little difference in quality compared to the traditional algorithm [43]. While K-means is simple in implementation and can scale easily with the mini-batch modification, the algorithm requires a priori knowledge of the number of clusters, K. We set this value to 10 in our analysis.

Affinity Propagation

First introduced by [44] in 2007, affinity propagation addresses many of the shortcomings of K-means, namely in that it chooses the number of clusters based on the data provided. Affinity propagation is also immune to poor random initializations through message passing based clustering of a similarity matrix [45].

The algorithm begins by assuming all points are cluster centers, or exemplars and creates two matrices: a “responsibility” matrix R and a “availability” matrix A. Each element r(i,k) quantifies the evidence that sample k should be the exemplar for sample i, while each element a(i,k) quantifies the evidence that sample i should belong to exemplar k, while taking into account the responsibilities of all other exemplars. r(i,k) is initially set to s(i,k), the similarity between samples i and k. In later iterations, as more samples are assigned to exemplars other than i and their availabilities decrease, i’s responsibility for k is effectively decreased and is less likely to be chosen as an exemplar.

More formally, the responsibility matrix is computed via the following rule:

s(k,k) may be set to an optional “preference” parameter, representing the likelihood that a given sample should be chosen as an exemplar. There was no logical method for determining a preference in our study and so this field was left as 0. The availability, a(i,k) of a particular sample i is set to the sample’s self-responsibility, r(k,k), which quantifies whether an exemplar should remain an exemplar or would be a better fit belonging to another exemplar, plus the sum of the responsibilities from all other samples i’ to sample k, which quantifies how much other exemplars want k.

Only positive responsibilities are summed because we care about how well an exemplar explains its own data points, and not how poorly it explains samples that end up belonging to other exemplars. The entire sum itself is not allowed to go above zero as to prevent a snowball effect that would lead to drastically uneven cluster sizes. As the responsibility and availability matrix are repeatedly updated, a set of ”true” exemplars emerges, computed in the following manner. Given a point i, consider the k that maximizes a(i,k)+r(i,k). Point i belongs to exemplar k, and if i = k, i itself is an exemplar. The program terminates when this set remains unchanged for a set number of iterations, which we set to 15 in our analysis. Affinity propagation scales quadratically if the similarity matrix is not precomputed, requiring us to select the 20,000 lowest-energy decoys for clustering.

Agglomerative Clustering

Agglomerative clustering is a relatively simple hierarchical clustering algorithm. The algorithm begins much like affinity propagation, by assuming each sample is its own cluster, then iteratively merges clusters based on a distance measure to compare individual samples and a linkage criterion to compare different clusters. We use lRMSD as the distance measure, and an average linkage criterion to measure the similarity between two clusters, defined as the average pairwise distance between each sample in the two clusters in comparison. The closest clusters by these criteria are successively merged until the desired number of clusters, which we choose as 10, is attained. Agglomerative clustering, whose implementation is described further in [46], has a time complexity O(N3), where N is the number of samples in consideration. This required us to subsample the 15,000 lowest energy decoys before clustering.

Other clustering algorithms, including density-based spatial clustering of applications with noise (DBSCAN) and mean shift were considered but not used. DBSCAN was unable to find more than a single cluster; this is likely due to the fact that DBSCAN is intended for data where clusters have similar density [47]. Mean shift was not used because it is not scalable to high dimensions [48]. Other clustering algorithms, such as SPICKER [49], were considered, as well. SPICKER is an algorithm based on pairwise lRMSD calculations. It carries, a heavy memory and computational cost demand as a result , and could not be applied to more than 10,000 conformations.

Evaluation

Once the various clustering algorithms have been run on each of the feature vectors, a natural question arises: how does one evaluate the quality of the clusters produced? The proteins studied in our analysis have known native structures, so we can calculate the lRMSD of each decoy from the native structure for use in our metric. We define a near-native decoy as one in the lowest 25% lRMSD from the native, a medium distance decoy as one between 25% and 75%, and a far decoy as greater than 75%. With these definitions of classes, one simplistic clustering evaluation we could use is purity. Purity, as defined in [50], rewards intra-cluster similarity and punishes inter-cluster similarity.

where Ω is the set of clusters {w1,w2,...,wk} and C is the set of classes {c1,c2,...,cj}. Purity by itself is a flawed metric, however, as purity tends to increase as the number of clusters increases and reaches a perfect score when the number of clusters is equal to the number of samples – which is evidently not a good clustering. We solve this problem by using a more comprehensive metric, normalized mutual information (NMI), discussed below [50].

NMI

I, the mutual information, captures the amount of knowledge about the classes we are able to extract from the clusters alone.

Ranks

 We use two other measures of performance in addition to NMI: rank of the cluster with lowest lRMSD to native in a sorted ordering, where clusters are sorted by population in a descending order; rank of cluster in this ordering with the most near-native decoys, as defined above. These measures, while not as sophisticated as NMI, are more relevant to our interests; our ultimate goal is to blindly locate near-native clusters, and we do not necessarily care about the organization of structures far from native. Ideally, the two ranks should be low and close to one another, as this would describe a clustering algorithm that reliably produces clusters such that the most populous cluster contains the closest-to-native decoy. Rank can easily be fooled, however. For example, an algorithm that produced a single populated cluster would receive a rank of 1, 1 – but would clearly not be a successful algorithm.

Results

Figure 1. Results presented here are the structural data for 1HHP, the aspartyl protease from HIV-1 isolate BRU, projected on the top two diffusion coordinates. The amount of variance captured by the coordinates is unknown due to the use of ARPACK. Points are color coded based on their lRMSD from the native, marked by the red ‘x’.

Figure 2. Results presented here are the structural data for 1HHP, projected on the top two principal components, which capture approximately 36% of the accumulated variance.

Our first application of PCA and LDFMap was in in reducing the data provided by de novo structural prediction methods. Figure 1 shows decoys of 1HHP projected onto the first two diffusion coordinates. It is unknown how much accumulated variance these two capture as ARPACK only gave the 3000 eigenvalues and eigenvectors. The decoys are color coded by their lRMSD distance from the native, marked as a red ‘x’. As depicted in Figure 1, there is a clear gradient from structures with a large lRMSD from the native to structures with a small lRMSD, and the native structure itself is located close to these structures in the reduced space. Figure 2, depicting the same data reduced via PCA and projected onto the first two principal components also has the same properties. This is to be expected and confirms that the reduction techniques were applied correctly.

Figure 3. Results presented here the cluster sizes of k-means applied to 1HHP projected on the top 3000 diffusion coordinates. The cluster highlighted in red contains the native structure.

Figure 4. Here the decoys of 1HHP from the previous figure are broken down into three classes: close, medium, and far lRMSD from the native structure.

Now that collective features have been identified, we can apply clustering. Figure 3 shows the cluster counts resulting from k-means clustering on the diffusion coordinate embedding of 1HHP. We can see that the native structure, colored red, is the 9th most populous cluster, a poor result, but one typical of k-means. Figure 4 breaks down the plot further into “close”, “medium”, and “far” sets, where close contains the 25% lowest lRMSD structures, medium contains the middle 50% and far contains the largest 25%. Figure 4 shows that clusters are primarily homogeneous in that they are either primarily medium/far distance or primarily close/medium distance, which is a good result. The rank of the cluster with the most near-native structures, however, is two, which is not an ideal result.

A similar analysis is applied to all 9 proteins for each of the 3 clustering algorithms for each of the 3 coordinate types; the results of which are summarized by normalized mutual information and rank values stored in tables 1 and 2.

Table 1. The results presented here summarize the quality of clusters obtained via the three clustering algorithms (columns) on various protein conformations generated via de novo structural prediction techniques. The clusters generated from the principal components are highlighted red; from diffusion coordinates, green; and from lRMSD, green. Ideally a low rank and a high NMI is desired.

Table 2. The results presented are identical to table one, except rather than use NMI as a metric for the quality of a cluster, rank of the cluster with the native structure and rank of cluster with the most near-native structures are provided instead. For example, “6, 3 of 1” means the native structure is in the 6th most populous cluster and the cluster with the most nearnative, defined as structures with the lowest 25% of lRMSDs, is the 3rd most populous cluster out of 10 total clusters.

First, we analyze the clustering algorithms in general. While k-means obtains the highest NMI’s for all three coordinates, it consistently obtains poor ranks and is therefore not a good clustering technique for this application. While affinity propagation sometimes resulted in very high quality clusters both in terms of rank and NMI, such as for 1PGB and 1SAP, its effectiveness varied greatly from protein to protein and, on average, performed worse than agglomerative clustering in both rank and NMI. Agglomerative clustering performed relatively reliably, obtained median ranks slightly better than affinity propagation and average NMI somewhat worse than k-means. Given the terrible ranks of k-means and the instability of affinity propagation, however, agglomerative is clearly the leading technique for our application. Moreover, it is interesting to note that for proteins such as 1ALY and 1BQ9, poor results were obtained regardless of the clustering algorithm or coordinates used.

Now that it has been established that k-means and affinity propagation are poor clustering choices, we can focus our attention on the more important evaluation of the different coordinate types. While lRMSDs attained high ranks in agglomerative clustering, it obtained the poorest NMI’s by far. This is because it tended to have a single populated cluster, and the rest of the clusters were sparsely populated, as described in section 2.3.2. The PC’s obtain good ranks but poor NMI’s, likely due to the formation of too few clusters, which leads to polluted clusters. Finally DC’s achieve very high NMI’s as well as high median ranks. Agglomerative clustering on the DC’s is the only combination to obtain both good rank and NMI simultaneously, so DC is likely the best coordinate to use.

We can strongly test this hypothesis using a 2-Sample T-Test on the NMI scores for agglomerative clustering. The test confirms that the DC’s are superior to the RMSD’s with a false positive likelihood of just 0.0247. An additional test finds that the DC’s are superior to the PC’s with a false positive probability of 0.072. It is important to note, however, that these statistics leave out the rank observation, which offers further evidence that the DC’s are superior to both the PC’s and the RMSD’s.

Discussions 

The analysis of a diverse array of clustering algorithms and coordinate representations yields several insights about the effectiveness of these techniques. It has provided strong evidence that the LDFMap, with its ability to capture the non-linearity in data, provides a better embedding from which to cluster than PCA. Moreover, it has shown that agglomerative clustering shows the most promise as a decoy-selection tool and that k-means and affinity propagation are unreliable for this purpose. In general, it has proven that feature-based clustering is superior to the traditionally used lRMSD’s. It is important to note that the detailed analysis presented in this paper is, to the extent of our knowledge, the first attempt to evaluate the effectiveness of various feature-based clustering algorithms of molecular data obtained from de novo structural prediction protocols. The results presented in this paper open an intriguing direction of research in the open problem of blind decoy selection, and is worthy of further research.

Conclusions

The results presented in this paper provide strong evidence for further research in the use of diffusion map based methods for decoy selection. While we were able to reduce the entire data set under both the LDFMap and PCA reductions by utilizing MPI in Fortran, we were forced to use a priori filtering of the reduced data to reduce computational time of the affinity propagation and agglomerative clustering algorithms, which were written in single-threaded Cython. Future studies could further use parallel and distributed computing to conduct clustering on complete decoy sets, which may yield higher quality clustering. Further study could analyze a larger set of proteins to confirm the results of previous determined ones.   


References

[1] W. F. van Gunsteren and H. J. C. Berendsen. “Computer simulation of molecular dynamics: methodology, applications and perspectives in chemistry”. In: Angew. Chem. Int. 29.9 (1990), pp. 992–1023.

[2] O. M. Becker and M. Karplus. “The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics”. In: J. Chem. Phys. 106.4 (1997), pp. 1495–1517.

[3] M. Karplus and J. A. McCammon. “Molecular dynamics simulations of biomolecules”. In: Nat. Struct. Biol. 9.9 (2002), pp. 646–652.

[4] M. Karplus and J. Kuriyan. “Molecular Dynamics and protein function”. In: Proc. Natl. Acad. Sci. USA 102.19 (2005), pp. 6679–6685.3

[5] W. F. van Gunsteren et al. “Biomolecular modeling: Goals, problems, perspectives”. In: Angew. Chem. Int. Ed. Engl. 45.25 (2006), pp. 4064–4092.

[6] D. C. Chatfield, A. Szabo, and B. R. Brooks. “Molecular dynamics of staphylococcal nuclease: comparison of simulation with 15N and 13C NMR relaxation data”. In: J. Am. Chem. Soc. 120.21 (1998), pp. 5301–5311.

[7] A. Bhinge et al. “Accurate detection of protein:ligand binding sites using molecular dynamics simulations”. In: Structure 12.11 (2004), pp. 1989–1999.

[8] Rudy Clausen and Amarda Shehu. “A multiscale hybrid evolutionary algorithm to obtain sample-based representations of multi-basin protein energy landscapes”. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM. 2014, pp. 269–278.

[9] B. R. Brooks et al. “CHARMM: a program for macromolecular energy, minimization, anddynamics calculations”. In: J. Comput. Chem. 4.2 (1983), pp. 187–217.

[10] D.A. Case et al. AMBER 14. University of California, 2014.

[11] Walter R. P. Scott et al. “The GROMOS Biomolecular Simulation Program”. In: J. Phys. Chem. A 103 (1999), pp. 3596–3607.

[12] T. Hansson, C. Oostenbrink, and W. F. van Gunsteren. “Molecular dynamics simulations”. In: Curr. Opinion Struct. Biol. 12.2 (2002), pp. 190–196.

[13] M. Levitt. “Growth of novel protein structural data”. In: Proc. Natl. Acad. Sci. USA 104.9 (2007), pp. 3183–3188.

[14] Z. Li and H. A. Scheraga. “Monte Carlo-minimization approach to the multiple-minima problem in protein folding”. In: Proc. Natl. Acad. Sci. USA 84.19 (1987), pp. 6611–6615.

[15] A. Shehu. “Conformational Search for the Protein Native State”. In: Protein Structure Prediction:Method and Algorithms. Ed. by H. Rangwala and G. Karypis. Fairfax, VA: Wiley Book Series on Bioinformatics, 2010. Chap. 21.

[16] C. A. Rohl et al. “Protein structure prediction using Rosetta”. In: Methods Enzymol. 383 (2004), pp. 66–93.

[17] T. J. Brunette and O. Brock. “Guiding conformation space search with an all-atom energy potential”. In: Proteins: Struct. Funct. Bioinf. 73.4 (2009), pp. 958–972.

[18] A. Shmygelska and M. Levitt. “Generalized ensemble methods for de novo structure prediction”. In: Proc. Natl. Acad. Sci. USA 106.5 (2009), pp. 94305–95126.

[19] J. DeBartolo et al. “Protein structure prediction enhanced with evolutionary diversity: SPEED”. In: Protein Sci. 19.3 (2010), pp. 520–534.

[20] A. Leaver-Fay et al. “ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules”. In: Methods Enzymol 487 (2011), pp. 545–574.

[21] D. Xu and Y. Zhang. “Ab initio protein structure assembly using continuous structure frag- ments and optimized knowledge-based force field”. In: Proteins: Struct. Funct. Bioinf. 80.7 (2012), pp. 1715–1735.

[22] Brian Olson and Amarda Shehu. “Populating local minima in the protein conformational space”. In: Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on. IEEE. 2011, pp. 114–117.

[23] Brian Olson and Amarda Shehu. “Multi-objective optimization techniques for conformational sampling in template-free protein structure prediction”. In: Intl Conf on Bioinf and Comp Biol (BICoB), Las Vegas, NV. 2014.

[24] C. B. Anfinsen. “Principles that govern the folding of protein chains”. In: Science 181.4096 (1973), pp. 223–230.

[25] C. Clementi. “Coarse-grained models of protein folding: Toy-models or predictive tools?” In: Curr. Opinion Struct. Biol. 18 (2008), pp. 10–15.

[26] B. Qi et al. “Extracting physically intuitive reaction coordinates from transition networks of a b-sheet miniprotein”. In: J. Phys. Chem. 114 (2010), pp. 6979–6989.

[27] R. B. Best and G. Hummer. “Coordinate-dependent diffusion in protein folding”. In: Proc. Natl. Acad. Sci. USA 107 (2010), pp. 1088–1093.

[28] A. Berezhkovskii and A. Szabo. “One-dimensional reaction coordinates for diffusive activated rate processes in many dimensions”. In: J. Chem. Phys. 122 (2005), p. 014503.

[29] J. Shelns. A tutorial on Principal Component Analysis. University of California, San Diego, 2003.

[30] K.-R.M¨uller et al. “An Introduction to Kernel-Based Learning Algorithms”. In: IEEE Trans Neural Networks 12.2 (2001), pp. 181–201.

[31] J. B. Tenenbaum, V. de Silva, and J. C. Langford. “A Global Geometric Framework for Nonlinear Dimensionality Reduction”. In: Science 22 (2000), pp. 2319–2323.

[32] H. Stamati, C. Clementi, and L. E. Kavraki. “Application of nonlinear dimensionality reduction to characterize the conformational landscape”. In: Proteins: Struct. Funct. Bioinf. 78.2 (2010), pp. 223–235.

[33] E. Plaku et al. “Fast and Reliable Analysis of Molecular Motions Using Proximity Relations and Dimensionality Reduction”. In: Proteins: Struct. Funct. Bioinf. 67.4 (2007), pp. 897–907.

[34] S. T. Roweis and L. K. Saul. “Nonlinear dimensionality reduction by locally linear embedding”. In: Science 22 (2000), pp. 2323–2326.

[35] M.A. Rohrdanz et al. “Determination of reaction coordinates via locally scaled diffusion map”. In: J. Chem. Phys. 134.12 (2011), p. 124116.

[36] R. R. Coifman and S. Lafon. “Diffusion maps”. In: Appl Comput Harmon 21 (2006), pp. 5–30.

[37] Waikato Machine Learning Group. WEKA. 2010. URL: http://weka.org.

[38] Jonathon Shlens. “A tutorial on principal component analysis”. In: arXiv preprint arXiv:1404.1100 (2014).

[39] M.A. Rohrdanz et al. “Determination of reaction coordinates via locally scaled diffusionmap”. In: J. Chem. Phys. 134.12 (2011), p. 124116.

[40] Ronald R Coifman and St´ephane Lafon. “Diffusion maps”. In: Applied and computational harmonic analysis 21.1 (2006), pp. 5–30.

[41] KJ Maschho and DC Sorensen. “P ARPACK: An Efficient Portable Large Scale Eigenvalue Package for Distributed Memory Parallel Architectures”. In: Applied Parallel Computing in Industrial Problems and Optimization, Lecture Notes in Computer Science 1184 (1996).

[42] David Sculley. “Web-scale k-means clustering”. In: Proceedings of the 19th international conference on World wide web. ACM. 2010, pp. 1177–1178.

[43] David Arthur and Sergei Vassilvitskii. “k-means++: The advantages of careful seeding”. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics. 2007, pp. 1027–1035.

[44] Brendan J Frey and Delbert Dueck. “Clustering by passing messages between data points”. In: science 315.5814 (2007), pp. 972–976.

[45] Delbert Dueck. “Affinity Propagation:Clustering Data by Passing Messages”. PhD thesis. University of Toronto, 2009.

[46] Ian Davidson and SS Ravi. “Agglomerative hierarchical clustering with constraints: Theoretical and empirical results”. In: Knowledge Discovery in Databases: PKDD 2005. Springer, 2005, pp. 59–70.

[47] Hans-Peter Kriegel et al. “Density-based clustering”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.3 (2011), pp. 231–240.

[48] D. Comaniciu and P. Meer. “Mean shift: a robust approach toward feature space analysis”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.5 (May 2002), pp. 603–619. ISSN: 0162-8828.

[49] Y. Zhang and J. Skolnick. “SPICKER: A clustering approach to identify near-native protein folds”. In: J Comput Chem 25.6 (May 2004), pp. 865–871.

[50] Prabhakar Raghavan Christopher D. Manning and Hinrich Sch¨utze. Introduction to Information Retrieval. Cambridge University Press., 2008.