Agglomerative hierarchical clustering is a method of clustering that can produce a tree-like structure. Here, we construct a distance matrix based, typically, on copy number states of bins, then feed that distance matrix to AGNES clustering.
Usage
build_aggo_tree(
reads_df,
cut_k = 8,
state_col = "state",
sample_col = "cell_id",
chrom_col = "chr",
by_ploidy_change = FALSE,
by_cn_change = FALSE
)Arguments
- reads_df
data.frame. bin based reads data
- cut_k
int. level to cut the tree at.
- state_col
str. Column to use for the clustering. Default: "state"
- sample_col
str. Column of the sample labels. Default: "cell_id"
- by_ploidy_change
bool. Use CN change from cell mode ploidy as the feature to cluster sample by. This can help group clones that are WGD of lower ploidy clones.
- by_cn_change
bool. cluster based on changes in CN state along chromosomes.
Value
list. Two elements. $phylo: the tree; $clones: tibble of clone ids of tip labels based on tree cutting.
Details
This function will also do a preliminary tree cutting to provide groups
within the tree, which could be interpreted as copy-number clones of cells.
You can always re-cut the returned tree with stats::cutree(), which is all
that is used internally here.
There are 3 options of values to use for clustering:
states
by_cn_change: changes from one state to another state are marked as 1, and all other bins as 0. In essence, it's marking the breakpoints and clustering based on shared breakpoints.
by_ploidy_change: finds mode ploidy of each cell, and then bin states are converted into their difference from the mode ploidy.
Options 2 and 3 seem to work well to group WGD cells of lower ploidy clones together into clades. The also circumvent the problem that a straight distance of states doesn't always correspond to biological reality. For example, from a state of 2, a state of 3 and and state of 4 can both be 1 mutational step away, the former being an amplification event and the latter being a WGD event. Thus, bins of 3 and 4 are potentially equally distant from a bin of 2, making distance based metrics using CN states dubious (see example).