Contributions

Gianluca

Manually- and Automatically-Optimised Learning Algorithms for Improved Warfarin Dosing

This research evaluated the accuracy of 17 learning algorithms on two datasets. The first 10 algorithms used default or manually-optimised hyperparameters, but the remaining 7 algorithms were developed using genetic programming. These automated algorithms produced the most accurate models and out-performed the best published results in this field. This research also examined the effects of different parameter sets and missing data treatments on model accuracy, which informed guidelines on how to implement dosing models in a South African clinical context.

Data Gathering and Transformation

Two datasets of warfarin records were used for this study. The globally-standard IWPC dataset was used for comparing new techniques to those in the literature, whilst the proprietary PathCare dataset was used to evaluate model performance against human experts in a South African context. The PathCare data was provided as a mySQL dump file, so extensive data cleaning and extraction was performed to produce a research-focused .CSV file.

Machine learning algorithms require a two-dimensional matrix of input values and a one-dimensional vector of target values. To perform the requisite mathematical operations, all values must be numeric. Whilst some data fields – INR values, height, and warfarin dose – were already in continuous format, other data was in categorical or text format and required vectorisation. Bespoke algorithms were created to perform the vectorisation in a manner that encoded context-specific information, such as drug names.

Dataset Splitting

Selection was performed at random from the datasets to ensure that the distributions were similar between a training set and its corresponding validation set. The PathCare dataset was scaled and split in the same ratio as the IWPC data, producing datasets of near-identical size and value distributions. This allowed direct comparison between both datasets with limited confounding factors. A novel technique was used to handle records of patients that appeared in both sets by swapping them out with a randomly-chosen record from the other set. This procedure emulated the effect of splitting the PathCare data patient-wise, but without the arduous task of ensuring that the 80/20 split ratio was preserved.

Experiments

Experiment 1: Humans vs. Algorithms. This experiment evaluated the accuracy of experienced human clinicians compared to ML models. This was possible due to the fact that the PathCare dataset contained records of multiple visits for a small subset of patients. Those records were compared with the performance of models trained on a subset of the data. Because no clinical experiment could be run whereby models dosed real patients, predictions were compared to final therapeutic doses. Because those final therapeutic doses were only available in cases where the patient was successfully dosed, estimated therapeutic doses were needed for the remaining cases. A novel technique was developed to impute those estimates with random sampling from a Gaussian distribution with the same mean and variance as the PathCare data.

Experiment 2: Data-Manipulations. This two-part experiment evaluated the effects of data-manipulations on the resulting models, investigating the use of three different parameter sets for each dataset, as well as different treatments for missing data.

Experiment 3: New Techniques. This experiment evaluated new techniques for warfarin dose prediction by comparing them to the best results in the academic literature (Liu et al. 2015). The techniques evaluated included manually optimising promising learning algorithms, as well as using an autoML approach to optimise learning pipelines with genetic programming.

Training ML Models

Most algorithms were implemented using the scikit-learn library in Python 3.6 and optimised by tuning the hyperparameters. In many cases, two instances of the algorithm were used, where the first was manually optimised for the PathCare data and the second for the IWPC data. This allowed the best performances for each algorithm on each dataset to be compared directly. Feature selection was performed manually using domain knowledge. Manually-optimised algorithms were enhanced with preprocessing tools like StandardScaler and RobustScaler.

AutoML With Genetic Programming

A Tree-based Pipeline Optimisation Tool (TPOT) was used to generate high-performing pipelines through genetic programming. Cleaned versions of the training sets were given as input and many generations of supervised learning yielded the best performers – optimised meta-algorithms that would likely never have been found through manual implementation and tuning. TPOT accepts bespoke scoring functions as its fitness function. The functions for PW20 and MAE were tested, as was a hybrid of the two.

Evaluating performance

INR has a therapeutic range of 2.0-3.0 in most patients and using tighter target ranges for maintenance dosing does not achieve any therapeutic advantage. The chosen metrics accounted for that, and were consistent with metrics used in related studies, allowing direct comparisons.

Two techniques were used to estimate performance during training. The first was standard k-fold cross-validation and the second was Monte Carlo cross-validation. This combined approach was utilised to facilitate rapid yet robust evaluation of models during manual optimisation.

Findings

The experiments demonstrated that learning algorithms can produce models at least as effective as human experts at prescribing warfarin maintenance doses.

Currently, PathCare collects very few of the clinical metrics relevant to accurate warfarin dosing. The results indicate that implementing new policies to collect height, weight, race, and smoking status would be of relative ease, but has been shown in both this and many other studies to improve the dosing accuracy of models drastically.

Imputation methods were found to be an effective means of dealing with missing data. This is especially important in warfarin dose prediction, where datasets are both small and incomplete. Moreover, the ability to impute missing features is essential to clinical application of these models, as it allows them to dose future patients even if some parameters are not available – which is a frequent occurrence in clinical practice.

This study found that autoML techniques – in this case optimisation through genetic programming – were an effective method of producing accurate models with limited domain knowledge. This eliminates the need for machine learning expertise, which improves both the resource efficiency and availability of warfarin dose predictions. The generated learners were not only simpler to attain than manually-tuned algorithms, but also performed better. If this trend is not unique to warfarin dosing, it suggests that autoML is a promising method for the future of automated dosing in general, which is of huge importance to computer science.

Neville

Automated Warfain Dosing for South Africans with Evolutionary Ensemble Learning

The main objective of this research was to determine the efficacy of Ensemble Machine Learning methods at determining a patient’s ideal Warfarin dose based on various subsets of available data. Ensemble methods in the context of machine learning involve the aggregation of multiple “base” machine learning models to achieve better results than their individual use. This definition is purposefully broad in that this can occur in a multitude of configurations. The most notable possibilities for variation are as follows:

  • The type of base model can either remain uniform or heterogeneous.
  • Each of the base models can be trained with the entirety of the dataset or a sample thereof.
  • Each of the base models can be trained using all or some of the features included in the dataset.
  • The sampling of data points or features can be done with or without replacement.
  • The final output of the ensemble can either entail a simple averaging of the base models' predictions, a weighted average, or the result of passing the input through a sequence of estimators.

Here we will focus on homogeneous ensembles that make use of simple averaging. In the case that this is done alongside sampling the dataset with replacement, this is known as bootstrap aggregating (Bagging). When features are sampled with replacement, this is referred to as the Random Subspaces method. Finally, a combination of these two is known as the Random Patches method. The parameters of three ensembles containing multilayer perceptrons (using bagging, random subspaces and random patches respectively) were optimised with the use of a (mu + lambda) Evolutionary Strategy. Unoptimised versions of linear regression, a single multilayer perceptron, random forest and gradient boosting were also evaluated.

All of the models performed better on a local dataset compared to an international one (when features were limited to age, sex, and INR value) – this can be explained by implicit similarities in the population.

Compared to the aforementioned performance on the international dataset, five out of seven models fared better by as much as 8% when the feature set was expanded to include race, height, weight, and whether or not the patient was a current smoker. Unexpectedly, however, all models performed considerably worse when the feature set was further expanded to include genetic factors. In fact, the model which performed best during the training and validation phase (MLP Random Patches) performed worst on this test. Several explanations may exist: the models were more liable to become overfitted once the feature set became large enough, the training and testing data were imbalanced, or that genotypic data is in fact not highly correlated to a patient’s ideal Warfarin dose. From the above we can conclude that it is worthwhile to use localised data for this application. Furthermore, the additional collection of basic clinical information such as race, height, weight and smoking habits can substantially improve the performance of Machine Learning models when predicting Warfarin dose. Comparison between the results of the various models indicate that simpler methods such as linear regression can contend with more sophisticated techniques on datasets of this relatively small scale.