build tree with AGNES clustering — build_aggo

Agglomerative hierarchical clustering is a method of clustering that can produce a tree-like structure. Here, we construct a distance matrix based, typically, on copy number states of bins, then feed that distance matrix to AGNES clustering.

Usage

build_aggo_tree(
  reads_df,
  cut_k = 8,
  state_col = "state",
  sample_col = "cell_id",
  chrom_col = "chr",
  by_ploidy_change = FALSE,
  by_cn_change = FALSE
)

Arguments

reads_df: data.frame. bin based reads data
cut_k: int. level to cut the tree at.
state_col: str. Column to use for the clustering. Default: "state"
sample_col: str. Column of the sample labels. Default: "cell_id"
by_ploidy_change: bool. Use CN change from cell mode ploidy as the feature to cluster sample by. This can help group clones that are WGD of lower ploidy clones.
by_cn_change: bool. cluster based on changes in CN state along chromosomes.

Value

list. Two elements. $phylo: the tree; $clones: tibble of clone ids of tip labels based on tree cutting.

Details

This function will also do a preliminary tree cutting to provide groups within the tree, which could be interpreted as copy-number clones of cells. You can always re-cut the returned tree with stats::cutree(), which is all that is used internally here.

There are 3 options of values to use for clustering:

states
by_cn_change: changes from one state to another state are marked as 1, and all other bins as 0. In essence, it's marking the breakpoints and clustering based on shared breakpoints.
by_ploidy_change: finds mode ploidy of each cell, and then bin states are converted into their difference from the mode ploidy.

Options 2 and 3 seem to work well to group WGD cells of lower ploidy clones together into clades. The also circumvent the problem that a straight distance of states doesn't always correspond to biological reality. For example, from a state of 2, a state of 3 and and state of 4 can both be 1 mutational step away, the former being an amplification event and the latter being a WGD event. Thus, bins of 3 and 4 are potentially equally distant from a bin of 2, making distance based metrics using CN states dubious (see example).

Examples

mm <- matrix(c(2, 3, 4), byrow = TRUE)
rownames(mm) <- c(paste0("cell-", LETTERS[1:3]))
dist(mm)
#>        cell-A cell-B
#> cell-B      1       
#> cell-C      2      1