measure string distances between sibling tips — compute_tip_sibling

Basically, this function is useful for asking if one tree groups more similar tips together better than another tree.

Usage

compute_tip_sibling_distances(
  states_df,
  tree,
  states_col = "state",
  cell_id_col = "cell_id"
)

Arguments

states_df: long format read bin state data
tree: phylo object to be checked
states_col: name of the column containing state data
cell_id_col: name of the column containing the tip labels

Details

For sibling tips, measure the distance between their states, treating the states across the genome as a string and obtaining a string distance.

States for a cell id are first converted to letters (to prevent double digit states from counting as 2 characters) and then made into a single string across the genome for each cell. I.e., 2 2 2 3 3 3 10 -> C C C D D D K see dlptools::map_states_to_letters() for details.

Then for each tip, it's sister tip is found and the string distance is measured. If the sister to a tip is a clade, the mean distance to all tips in the clade are found. E.g., in tree (A, (B, C)) the sister to A is both B & C. See dlptools::get_dist_to_sibs() for details.

Finally, a mean distance across all sibling clades is computed and returned.