This function extracts copy number features in the style of the paper:
Usage
extract_wu_features(
segs_df,
sample_col = "cell_id",
state_bin_max = 5,
bin_breaks = NA,
annotate_input = FALSE,
return_matrix = FALSE,
...
)Arguments
- segs_df
dataframe. CN segments
- sample_col
string. Name of the column with cell_id/other sample name
- state_bin_max
int. Maximum CN to consider for bins. All CNs of this value and higher are grouped together. Default of 5 follows paper.
- bin_breaks
floats, how to break up segment sizes. Bins will be one more than breaks. Defaults follow paper. Default is < 5Mb, 5–10Mb, > 10Mb specified as c(5e6, 10e6 + 1). Internally, base::cut() is used, so 2 splits produces 3 bins.
- annotate_input
boolean. return input dataframe annotating each segment with the feature categories it falls into.
- return_matrix
boolean. Return a cell-by-feature matrix of counts.
- ...
can pass change_split_val to alter critical value for AA/BB split
Details
Wu et al. 2025. Single-cell copy number alteration signature analysis reveals masked patterns and potential biomarkers for cancer. bioRxiv.
https://www.biorxiv.org/content/10.1101/2025.03.02.641098v1
They employ 4 base features, that they cross for 90 categories:
CN states: 5 bins, 0-1, 2, 3, 4, 5+
segment size: 3 bins, <5 MB, 5-10Mb, 10Mb,
segment shape: 3 bins, LL (low left, low right segment), HH (high left, high right segment), OT (other)
segment change: 2 bins: AA (difference between surrounding segments <= some critical value), or BB
Some issues:
whole chromosome amplifications/losses not captured
chromosomes need to have at least 3 changes to be truly reflected in these categories. Those with fewer are backfilled based on hardcoded rules
for segment change, the paper says they considered changes > 2 as BB, but in their actual code is set to 1. This function follows their code, but can be altered.