Molecular Dynamics (MD) simulates the behaviour of molecules which are allowed to interact with one another over a fixed period of time; it outputs what is known as a trajectory of the behaviour observed. This trajectory is a record of the movement of every atom during the simulation, with respective changes in position over time calculated by solving Newton’s equations of motion. Since the effects of molecular interactions occur so rapidly, positions are often updated at the femtosecond timescale. Therefore, trajectories containing even just a few nanoseconds worth of time frames amount to millions of locational data points to observe.
MD simulations play an important role for researchers interested in the effects molecular conformations can have in their applications, some of which include drug-receptor interaction and protein folding. Consequently, it is imperative that effective methods exist to gather insight from trajectories as efficiently as possible. A common way to do so is through the use of clustering algorithms that can partition the trajectory into a set of dominant conformations that are deemed structurally unique from one another. This project looks to explore different clustering techniques for clustering flexible carbohydrate molecules like polysaccharides.
We aim to investigate and compare the performance of the various clustering algorithms to identify which are effective in clustering trajectories of flexible molecules, specifically, carbohydrate molecules. In addition, we investigate the use of dimensionality reduction techniques on the trajectory data and examine whether this improves clustering results. An effective clustering algorithm is one that generates high-quality clusters with good partitioning. This means that the clusters have high intra-cluster similarity and low inter-cluster similarity. Intra-cluster similarity indicates how well a cluster is formed and how similar the objects within a specific cluster are. The inter-cluster similarity is used to determine how dissimilar different clusters are from one another. This is also referred to as cluster separation. We aim to address the following research question: which clustering algorithms are most effective in clustering MD simulation data of highly flexible molecules? Each member has also outlined their individual research questions which pertain specifically to their contribution.
Nicholas Limbert - HIERARCHICAL | QT
Wen Kang Lu - HDBSCAN | UMAP
Robyn McKenzie - HDBSCAN | IMWK-MEANS
Clustering analysis provides a means to classify similar molecular conformations into a limited number of distinct groups thereby reducing the amount of information to be manually analysed. A wide variety of readily available clustering algorithms have not been tested on highly flexible molecules.
Hierarchical clustering is a well-developed clustering technique with numerous resources available. We utilise four different linkage criteria for hierarchical clustering - single, complete, average and Ward’s method. These are standard, widely available methods that provide an intuitive understanding of how clusters are formed.
The Quality Threshold (QT) algorithm aims to generate high-quality clusters by iteratively adding observations to a cluster while not exceeding the pre-defined cluster diameter or threshold. Observations are added to minimise the cluster diameter. The Quality Threshold (QT) algorithm has previously been implemented for clustering a range of MD data, however, there are some inconsistencies related to certain implementations. We, therefore, aim to make use of the implementation from González-Alemán et al. which is in line with the originally proposed algorithm from Heyer et al. In addition, we make use of a vectorised version by Melvin et al. in order to understand their differences and determine the effectiveness.
Intelligent Mincowski Weighted K-Means (iMWK-Means) is a k-means variant which does not require a desired cluster count, as it begins by overestimating the number of clusters. It is also deterministic, as it intelligently chooses initial centroids, rather than selecting random values. Although iMWK-Means cannot directly label instances as noise, as HDBSCAN can, it assigns higher weights to the instances closest to the centroids, and lower weights to those further away. The result of this is that noise and outliers have less of an impact on the final clustering.
The Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm makes use of a density-based hierarchy where the most dominant clusters are extracted, i.e, it clusters based on regions of high data density and separated by areas of lower data density. Data that cannot be placed in clusters is labelled as noise. The algorithm is relatively fast when compared to other clustering methods and has already seen experimentation in the field of biochemistry.
One issue with HDBSCAN, however, is that it suffers from "the curse of dimensionality". The phenomenon is observed in the case where some high dimensional data has an insufficient amount of recorded data points. When dimensionality becomes increasingly large, the amount of data required to form the same meaningful patterns and groups as lower dimensions increases at an exponential rate. This occurs because the feature space itself increases exponentially, so the probability of two data points being similar experiences exponential decay - the data simply becomes too sparse. Since each frame of an MD trajectory consists of a molecule’s locational data in a three dimensional space, the number of dimensions for a given frame is 3N, where N is the number of atoms present in the molecule. Frames can often contain hundreds if not thousands of atoms, which gives rise to this phenomenon.
Uniform Manifold Approximation and Projection (UMAP) is a state-of-the-art dimensionality reduction technique. UMAP looks to estimate the underlying topology of some given data and project it to a manifold at some lower dimension, i.e., it aims to preserve as much of the variance explained in the original data while reducing its dimensional complexity.
We developed a framework whereby the user can specify instructions, including the input file, algorithm and validation index, via command-line arguments. As the result of each clustering algorithm is dependent on the selection of the atoms and frames to be clustered on, the framework includes preprocessing functions to facilitate this as well as post-processing functions.
The framework is structured as a pipeline to allow for additional algorithms, and other functionality, to be easily implemented at a later stage. Python was selected as the programming language as it has access to many data science and machine learning libraries, including the algorithms needed. Additionally, Python code is more easily understood than that of many other languages and will be simple to maintain and extend in the future.
The program accepts a Protein Data Bank (.pdb) file as the input dataset. These files contain three-dimensional structures of molecules and their change in conformation over a time period.
This project is focused on clustering highly flexible molecules, therefore we will use MD simulation data of carbohydrates. The user will provide the input file name as a command-line argument.
The user may wish to preprocess the files. We implement features which allow atom selection, frame selection and frame downsampling prior to clustering. The user may also generate new trajectory files from existing ones using these preprocessing features. The preprocessing phase outputs the edited trajectory data in the format required by the clustering algorithm implementations.
Dimensionality reduction in the form of UMAP is also considered as a preprocessing tool if used. UMAP takes in a trajectory file and produces an array readable by the clustering algorithms. This array has has lower dimensionality and is easier to work with.
The preprocessed data is clustered with the algorithm specified by the user. The user has a range of clustering algorithms to choose from. These include Hierarchical (various linkage criterion), Quality Threshold variations, HDBSCAN and iMWK-Means. All clustering algorithms are configured to accept data in the format output by the preprocessing phase. The clustering output is also uniform across each algorithm implementation so that the validation techniques and visualisation can be used.
There are multiple validation measures available to the user. We recommend using multiple Cluster Validation Indices (CVIs). If the user has specified a CVI, or set of CVIs, to be used, this will be applied to the output and results will be written to a file. We have implemented the Calinski Harabasz, Davies-Bouldin and Silhouette indices as validation measures.
The output from each clustering job includes a text file with cluster labels per frame that allows the user to see which frames have been placed in which cluster. The CVI results, noise counts, cluster counts and atom selection statements are also written to a text file. In addition, the user may also specify whether they want to output the largest 𝑛 clusters to .pdb files. This allows the user to easily view the dominant confirmations using trajectory visualisation tools such as VMD.
The algorithms were applied to trajectories of four molecules - MenY, MenW, S.flexneri Y 3RU and S.flexneri 6 6RU. MenY and MenW have a similar structure, but MenY is generally stable while MenW has been previously shown to visit multiple distinct conformations. Compared to MenW and MenY, S.flexneri Y 3RU and S.flexneri 6 6RU are substantially more flexible, and we expect them to exhibit a number of different conformations.
Almost all of the algorithms require one or two user-supplied parameters. In these cases the parameters had to be tuned by running the algorithms multiple times with different arguments.
The overall results obtained from the majority of hierarchical clustering for MenW and MenY indicate that no individual linkage criterion produces significantly better results. When inspecting the clusters visually through VMD it is evident that many clusters contain outliers and significantly different conformations. This would negatively impact the CVI’s values and indicate why the values are relatively poor.
The validation indices should only be used to compare results with an individual varying parameter, in our case this would be the linkage criteria. While this allows us to determine which linkage may be most suitable it does not thoroughly verify or validate our results.
Linkage criteria (Ward’s and average) that take multiple data points into account (compared to single and complete which only use single data points) produce significantly better results for all statistics. Our results demonstrate that some linkage criteria are more acceptable than others.
QT original produces the best results for MenW with a cutoff value of 3.5 Å. It has a high Calinski-Harabasz index, and a relatively low amount of noise compared to lower cutoff values. The vectorised version produces a smaller number of clusters for the same cutoff values. Overall the original version produces more favourable CVI values and cluster counts for all cutoff values when compared to the vectorised version.
MenY represents a more stable molecule as there is consistently an individual large cluster with one smaller yet significant cluster also present. All evidence from both QT and QT vector suggests that there is a single dominant conformation present for MenY. Once again a cutoff value of 3.5Å for the original version produces more favourable CVI’s with a reduced amount of noise.
The S. flexneri molecules are highly flexible with many different conformations. S. flexneri 6 6R has many more atoms to cluster on compared to S. flexneri Y 3RU. Visual representations indicate that both algorithms can produce acceptable clusters of conformations provided the correct cutoff values are selected.
Below are comparisons between clustering (HDBSCAN) meningococcal W (MenW) with and without UMAP as a preprocessing step. Plots generated by UMAP are used to visualise how well the clusters were identified. Note that the cluster label -1 represents the noise cluster.
Here we can see that many of the frames in MenW were considered as noise (red points).
So, while HDBSCAN is able to find the main clusters, determining conformation dominance is an issue since many of the frames were not clustered.
When looking at the result output, we found that the noise cluster contained almost 40% of the trajectory.
That means, 40% of the frames were not labeled as part of one of the clusters.
The validity indices for estimating clustering performance were poor for this approach.
This further suggested that improvements could be made.
When using UMAP to reduce the dimensionality of MenW first, we find considerably improved results.
Here we can see that far fewer frames in MenW were considered as noise.
When examining the result output, we found that less that 1% of the frames in MenW were classified as noise.
Therefore, conformation dominance would be more accurately determined with this method as almost all frames were accounted for.
The validity indices for this modified approach were also far better than when using HDBSCAN alone.
This further suggests that these clustering results are an improvement.
The fact that HDBSCAN can disregard frames as noise means that it can produce cohesive clusters containing only highly similar frames. For MenY, it identifies a single dominant cluster which is well-formed as it contains few frames which stray from the concentrated area. For MenW, it identifies four distinct clusters, however, a large proportion of frames (28%) have been labelled as noise. Ideally, it is preferable for as few frames to be labelled as noise as possible. If more frames are conserved and placed into clusters, the result is a more complete picture of the conformations of the molecule.
The clusters produced by HDBSCAN for MenW and MenY indicate the difference in stability between MenY, with a single conformation, and MenW, with multiple conformations. The single MenY cluster also bares resemblence to the second MenW cluster. As the two molecules are similar in structure, it makes sense that they will assume a similar conformation some of the time, despite the obvious difference in their respective levels of stability.
As iMWK-Means cannot classify frames as noise, the clusters it produces often include more frame variation than HDBSCAN. This is because all frames, even those which may be outliers, must be placed in a cluster. In the case of MenW, iMWK-Means identifies three distinct clusters, however they are not as distinct as the four clusters produced by HDBSCAN and contain more variance. As the HDBSCAN clusters to not account for all of the data, due to the high proportion classfied as noise, the iMWK-Means clusters can be analysed for a more complete overview of the molecule, althought the quality of the clusters is not as good.
For molecules that are generally stable iMWK-Means can detect subtle differences that HDBSCAN does not. This is evident in the results for MenY, where iMWK-Means produces two clusters with a subtle difference.
Hierarchical clustering is unable to classify noise separately, however, it is still able to produce clusters with relatively good CVI values. A limitation to Hierarchical clustering is that it requires prior knowledge about the number of conformations within a simulation. This information is usually not available unless prior research has been conducted on the specified data set. As expected, single linkage is unable to produce clusters of significance and merges most frames into one cluster. Complete, however can produce reasonable results given that only individual data points are taken into account when merging clusters. Complete also has better CVI values and a greater cluster count spread to those produced by single linkage. Ward’s method and average linkage produce much better results in comparison to single and complete linkages. Linkage criteria (Ward’s and average) that take multiple data points into account produce significantly better results and should be implemented when using Hierarchical clustering.
Quality Threshold produces good clusters for highly flexible molecules as is evident from the results obtained for S. flexneri, MenW and to an extent MenY. Notably, results are only informative once the correct cutoff value has been found. Lower cut-off values impose a stricter constraint on the QT variations and produce more noise data. A balance between clusters produced and noise should be determined. Minimising noise by increasing the cut-off value leads to larger clusters containing dissimilar conformations.
It is evident that both QT variants (original and vectorised) can cluster highly flexible molecules and can produce high-quality clusters with minimal noise albeit the correct cut-off values have to be determined. Selection of a cut-off value should be based on iteratively evaluating different CVI’s while changing the cut-off values. One should evaluate the noise and clusters produced to determine the most suitable cut-off value. This further attests to QT’s existing popularity within the domain of clustering MD data.
Without UMAP, HDBSCAN was found to provide good conformational results but was unable to accurately determine dominance due to its inability to cluster many of the frames. By using UMAP to preprocess the trajectories beforehand, HDBSCAN was able to cluster almost all of the frames present in both meningococcal Y and W trajectories while still accurately determining their conformational behaviours, allowing us to deduce the dominance of each conformation present. These findings suggest that using UMAP as a preprocessing step greatly benefits the clustering of MD trajectories with HDBSCAN, as it no longer suffers from sparse data resulting from high dimensionality.
iMWK-Means is useful for detecting finer details in generally stable molecules, and was able to detect a difference in conformation for MenY where HDBSCAN could not. Although iMWK-Means can also produce useful clusters for less stable molecules, in these cases it will work well with HDBSCAN, as HDBSCAN produces more distinct clusters with lower variance.
HDBSCAN can produce clusters with low variation due to its ability to detect and classify noise. This means that it can produce highly cohesive clusters. However, for less stable molecules, a large proportion of frames can be labelled as noise. Once again using HDBSCAN with iMWK-Means can be helpful, as iMWK-Means does not classify frames as noise. The iMWK-Means clusters will therefore account for all of the frames, including those that HDBSCAN classified as noise.