Batch integration graph
Removing batch effects while preserving biological variation (graph output)
Description
This is a sub-task of the overall batch integration task. Batch (or data) integration methods integrate datasets across batches that arise from various biological and technical sources. Methods that integrate batches typically have three different types of output: a corrected feature matrix, a joint embedding across batches, and/or an integrated cell-cell similarity graph (e.g., a kNN graph). This sub-task focuses on all methods that can output integrated graphs, and includes methods that canonically output the other two data formats with subsequent postprocessing to generate a graph. Other sub-tasks for batch integration can be found for:
This sub-task was taken from a benchmarking study of data integration methods.
Summary
Metrics
- ARI1: ARI (Adjusted Rand Index) compares the overlap of two clusterings. It considers both correct clustering overlaps while also counting correct disagreements between two clustering.
- Graph connectivity1: The graph connectivity metric assesses whether the kNN graph representation, G, of the integrated data connects all cells with the same cell identity label.
- Isolated label F11: Isolated cell labels are identified as the labels present in the least number of batches in the integration task. The score evaluates how well these isolated labels separate from other cell identities based on clustering.
- NMI1: NMI compares the overlap of two clusterings. We used NMI to compare the cell-type labels with Louvain clusters computed on the integrated dataset.
Results
Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.
Details
Methods
- BBKNN (full/scaled)8: BBKNN or batch balanced k nearest neighbours graph is built for each cell by identifying its k nearest neighbours within each defined batch separately, creating independent neighbour sets for each cell in each batch. These sets are then combined and processed with the UMAP algorithm for visualisation. Links: Docs.
- BBKNN (full/unscaled)8: BBKNN or batch balanced k nearest neighbours graph is built for each cell by identifying its k nearest neighbours within each defined batch separately, creating independent neighbour sets for each cell in each batch. These sets are then combined and processed with the UMAP algorithm for visualisation. Links: Docs.
- BBKNN (hvg/scaled)8: BBKNN or batch balanced k nearest neighbours graph is built for each cell by identifying its k nearest neighbours within each defined batch separately, creating independent neighbour sets for each cell in each batch. These sets are then combined and processed with the UMAP algorithm for visualisation. Links: Docs.
- BBKNN (hvg/unscaled)8: BBKNN or batch balanced k nearest neighbours graph is built for each cell by identifying its k nearest neighbours within each defined batch separately, creating independent neighbour sets for each cell in each batch. These sets are then combined and processed with the UMAP algorithm for visualisation. Links: Docs.
- Combat (full/scaled)4: ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.
- Combat (full/unscaled)4: ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.
- Combat (hvg/scaled)4: ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.
- Combat (hvg/unscaled)4: ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.
- FastMNN embed (full/scaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN embed (full/unscaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN embed (hvg/scaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN embed (hvg/unscaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN feature (full/scaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN feature (full/unscaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN feature (hvg/scaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- FastMNN feature (hvg/unscaled)10: fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.
- Harmony (full/scaled)7: Harmony is a method that uses PCA to group the cells into multi-dataset clusters, and then computes cluster-specific linear correction factors. Each cell is then corrected by its cell-specific linear factor using the cluster-weighted average. The method keeps iterating these four steps until cell clusters are stable. Links: Docs.
- Harmony (full/unscaled)7: Harmony is a method that uses PCA to group the cells into multi-dataset clusters, and then computes cluster-specific linear correction factors. Each cell is then corrected by its cell-specific linear factor using the cluster-weighted average. The method keeps iterating these four steps until cell clusters are stable. Links: Docs.
- Harmony (hvg/scaled)7: Harmony is a method that uses PCA to group the cells into multi-dataset clusters, and then computes cluster-specific linear correction factors. Each cell is then corrected by its cell-specific linear factor using the cluster-weighted average. The method keeps iterating these four steps until cell clusters are stable. Links: Docs.
- Harmony (hvg/unscaled)7: Harmony is a method that uses PCA to group the cells into multi-dataset clusters, and then computes cluster-specific linear correction factors. Each cell is then corrected by its cell-specific linear factor using the cluster-weighted average. The method keeps iterating these four steps until cell clusters are stable. Links: Docs.
Baseline methods
- Random Integration by Batch: Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each batch label.
- Random Graph by Celltype: Cells are embedded as a one-hot encoding of celltype labels. A graph is then built on this embedding.
- Random Integration by Celltype: Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each celltype label.
- No Integration: Cells are embedded by PCA on the unintegrated data. A graph is built on this PCA embedding.
- Random Integration: Feature values, embedding coordinates, and graph connectivity are all randomly permuted.
Datasets
- Immune (by batch)1: Human immune cells from peripheral blood and bone marrow taken from 5 datasets comprising 10 batches across technologies (10X, Smart-seq2).
- Lung (Viera Braga et al.)1: Human lung scRNA-seq data from 3 datasets with 32,472 cells. From Vieira Braga et al. Technologies: 10X and Drop-seq.
- Pancreas (by batch)1: Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq).
Download raw data
Task info Method info Metric info Dataset info Results Quality control
Quality control results
✓ All checks succeeded!