BatchMatch’s Major Capabilities
Ability to Use a Reference Batch
BatchMatch provides the user with options for using one, a few, or all batches to generate a single reference batch. This allows the batch effect between the reference batch and the non-reference batch(es) to be removed. When an existing dataset is used to develop a predictive model for future datasets, it is necessary to keep the existing data unchanged as a reference to evaluate and remove the potential batch effect of the future datasets.
Applicability to Multiple Batches
The software has the capability to evaluate and remove potential batch effect in datasets with multiple batches. This allows the user to combine and study datasets with two or more batches.
Applicability to Multiple Classes
BatchMatch is applicable to multi-class problems such as those involving multiple compounds, dosage levels, and/or time points for typical toxicogenomics studies. The biological differences in all classes are preserved while the batch effect is removed.
Optional Use of Class Labels
The software provides various options for using or not using the sample class labels in each batch. This is useful for studies with different objectives. For identifying potential biomarkers or developing classification models with existing datasets, the use of sample class label information for all batches is recommended. For removing the batch effects between the existing set (for which all the class label information is known, and based upon which a predictive model is to be developed) and the future set (for which the class labels are unknown and need to be predicted), users have the option of using class label information ONLY from the reference batch or NOT using class label information at all.
Consideration of the Unbalanced Class-Sample-Size Problem
Special treatment has been implemented in BatchMatch for the unbalanced class-sample-size problem commonly encountered in biomedical datasets. For example, the number of control samples could be much greater than the number of treated samples in one batch, and vice versa in another. In this case, special treatment is necessary to avoid removing biological differences inappropriately during the batch effect removal process.
Reduction of Sensitivity to Outliers
A sample that behaves very differently from other samples within the same group is an outlier, which may unduly bias the batch effect quantification and removal. Special considerations have been made in BatchMatch that reduce the effect of this bias.
Tools to Quantify Batch Effect and Evaluate the Usefulness of Removing It
Three sets of tools can be used to identify the batch effect and evaluate the usefulness of removing it:
(1) Four types of visualization tools are available to measure the separation of batches from various perspectives: Correlation Coefficient Heat Maps, Principal Components Analysis (PCA), Hierarchical Clustering, and ANOVA (Analysis Of Variance). Four corresponding measures are provided to quantify the batch effect. They are: Silhouette width, Hotelling T square p-value, H.C. rank sum p-value, and the ratio of variances due to batch effect and biological effect.
(2) Concordance of two feature lists: one list is selected using the reference batch(es) only, and the other is selected using the combination of the reference and the non-reference batch(es). Comparing the concordance with and without batch effect removal gives an indication of whether or not the reproducibility of the selected features improves. Successful batch effect removal improves this reproducibility.
(3) Cross-batch prediction performance: a reference batch is used to develop a model for the prediction of samples from the non-reference batch(es), and the prediction performances with and without batch effect removal are compared. Successful batch effect removal enhances the cross-batch prediction performance.