Counts of variants

This notebook analyzes the counts of the different variants.

Import Python modules:

[1]:
import os

import Bio.SeqIO

import altair as alt

import dms_variants.codonvarianttable

import pandas as pd

import yaml
[2]:
# allow more rows for Altair
_ = alt.data_transformers.disable_max_rows()

Get configuration information:

[3]:
# If you are running notebook interactively rather than in pipeline that handles
# working directories, you may have to first `os.chdir` to appropriate directory.

with open("config.yaml") as f:
    config = yaml.safe_load(f)

Read information on the barcode runs:

[4]:
barcode_runs = pd.read_csv(config["processed_barcode_runs"])

assert len(barcode_runs) == barcode_runs["library_sample"].nunique()

Barcode sequencing alignment stats

Read the “fates” of barcode reads (e.g., did they align?):

[5]:
fates = (
    pd.concat(
        [
            pd.read_csv(
                os.path.join(config["barcode_fates_dir"], f"{library_sample}.csv")
            )
            for library_sample in barcode_runs["library_sample"]
        ]
    )
    .merge(barcode_runs, on=["library", "sample"], validate="many_to_one")
    .drop(columns=["fastq_R1", "notes"])
    .assign(
        valid=lambda x: x["fate"] == "valid barcode",
        not_valid=lambda x: ~x["valid"],
    )
)

Plot the barcode fates in an interactive plot:

[6]:
selection_cols = [
    "exclude_after_counts",
    "antibody",
    "virus_batch",
    "sample_type",
    "date",
    "library",
]

selections = [
    alt.selection_point(
        fields=[col],
        bind=alt.binding_select(
            options=[None] + fates[col].dropna().unique().tolist(),
            labels=["all"] + [str(x) for x in fates[col].dropna().unique()],
            name=col,
        ),
    )
    for col in selection_cols
]

sample_types = barcode_runs["sample_type"].unique().tolist()

fate_chart = (
    alt.Chart(fates)
    .encode(
        x=alt.X("count", axis=alt.Axis(format=".2g"), scale=alt.Scale(nice=False)),
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "fate",
            scale=alt.Scale(domain=sorted(fates["fate"].unique(), reverse=True)),
        ),
        order="not_valid",
        tooltip=[
            alt.Tooltip(c, format=".3g") if c == "count" else c
            for c in fates.columns
            if c not in ["valid", "not_valid", "library_sample", "sample"]
        ],
    )
    .mark_bar()
    .properties(width=300, height=alt.Step(13))
    .configure_axis(labelLimit=500)
)

for selection in selections:
    fate_chart = fate_chart.add_params(selection).transform_filter(selection)

display(fate_chart)

Barcode counts

Read the counts of all valid and invalid barcodes:

[7]:
counts = pd.concat(
    [
        pd.read_csv(os.path.join(subdir, f"{library_sample}.csv")).assign(valid=valid)
        for library_sample in barcode_runs["library_sample"]
        for (subdir, valid) in [
            (config["barcode_counts_dir"], True),
            (config["barcode_counts_invalid_dir"], False),
        ]
    ]
)

Get the average number of counts per barcode, separating valid and invalid ones:

[8]:
avg_counts = (
    counts.groupby(["library", "sample", "valid"], as_index=False)
    .aggregate(
        mean_counts=pd.NamedAgg("count", "mean"),
        n_barcodes=pd.NamedAgg("barcode", "count"),
    )
    .merge(
        barcode_runs,
        on=["library", "sample"],
        validate="many_to_one",
    )
    .assign(valid=lambda x: x["valid"].astype(str))  # to make interactive chart work
    .drop(columns=["fastq_R1", "notes"])
)

Plot the average counts per barcode for both valid and invalid ones. The plot below is interactive: you can use the drop downs at bottom to select specific subsets, mouseover points for details, and click the legend to show only valid or invalid counts:

[9]:
validities = avg_counts["valid"].unique()
valid_selection = alt.selection_multi(
    fields=["valid"],
    bind="legend",
)

avg_counts_chart = (
    alt.Chart(avg_counts)
    .encode(
        x=alt.X("mean_counts", title="average counts per barcode"),
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "valid", title="valid barcode", scale=alt.Scale(domain=validities)
        ),
        shape=alt.Shape("valid", scale=alt.Scale(domain=validities)),
        opacity=alt.condition(valid_selection, alt.value(0.8), alt.value(0)),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"mean_counts", "n_barcodes"} else c
            for c in avg_counts.columns
            if c != "library_sample"
        ],
    )
    .mark_point(filled=True, size=50)
    .properties(width=200, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_params(*selections, valid_selection)
)
for selection in selections:
    avg_counts_chart = avg_counts_chart.transform_filter(selection)

avg_counts_chart
/fh/fast/bloom_j/computational_notebooks/ceradfor/2023/HIV_Envelope_BF520_DMS_CD4bs_sera/.snakemake/conda/056c8fe5fcb5561ce39a818628b75df6_/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated.  Use 'selection_point'
[9]:

Unique barcodes and other estimates of diversity

Calculate the inverse simpson index for each sample as well as the total number of unique barcodes per sample

[10]:
def inverse_simpson_index(arr):
    """
    Calculate inverse simpson index of sample.
    ISI = 1/D
    D = sum((ni/N)^2)

    """
    return 1 / sum((arr / sum(arr)) ** 2)


unique_bcs_and_ISI = (
    counts.query("valid")
    .query("count > 0")
    .groupby(["library", "sample"], as_index=False)
    .aggregate(
        inv_simpson_index=pd.NamedAgg("count", inverse_simpson_index),
        n_unique_barcodes=pd.NamedAgg("barcode", "nunique"),
    )
    .merge(barcode_runs, on=["library", "sample"], validate="one_to_one")
    .astype({"n_unique_barcodes": float})
    .drop(columns=["fastq_R1", "notes"])
)

Plot the inverse simpson metrics as an interactive plot. The number of unique barcodes is shown when mousing over a sample

[11]:
unique_barcodes_chart = (
    alt.Chart(unique_bcs_and_ISI)
    .encode(
        y=alt.Y("library_sample", title=None),
        x=alt.X("inv_simpson_index", title="inverse simpson index"),
        color=alt.Color(
            "sample_type",
            scale=alt.Scale(domain=unique_bcs_and_ISI["sample_type"].unique()),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g")
            if c in {"inv_simpson_index", "n_unique_barcodes"}
            else c
            for c in unique_bcs_and_ISI.columns
        ],
    )
    .mark_bar()
    .properties(width=275, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_params(*selections, valid_selection)
)

for selection in selections:
    unique_barcodes_chart = unique_barcodes_chart.add_params(
        selection
    ).transform_filter(selection)

display(unique_barcodes_chart)

Invalid barcodes and possible library-to-library contamination

Look at the top invalid barcodes.

First get all invalid barcodes along with where their counts rank among all (valid and invalid) barcodes for that library / sample:

[12]:
ranked_invalid = (
    counts.assign(
        overall_rank=lambda x: (
            x.groupby(["library", "sample"])["count"]
            .transform("rank", ascending=False, method="first")
            .astype(int)
        )
    )
    .query("not valid")
    .merge(barcode_runs)
    .sort_values(["library", "overall_rank"])
)

How many invalid barcodes are in top 10 most abundant? Plot this, and then list top barcode for any sample with an invalid in the top 10:

[13]:
# get top invalid barcode and number invalid in top 10
invalid_topn = (
    ranked_invalid.groupby(["library_sample", "library", "sample"])
    .aggregate(
        invalid_barcodes_in_top_10=pd.NamedAgg(
            "overall_rank", lambda s: len(s[s <= 10])
        ),
        top_invalid_rank=pd.NamedAgg("overall_rank", "first"),
        top_invalid_barcode=pd.NamedAgg("barcode", "first"),
    )
    .reset_index()
    .merge(barcode_runs.drop(columns=["fastq_R1", "notes"]), how="left")
)

# plot number invalid barcodes in top 10
topn_chart = (
    alt.Chart(invalid_topn)
    .encode(
        x=alt.X(
            "invalid_barcodes_in_top_10",
            title="invalid barcodes in top 10",
            scale=alt.Scale(domain=(0, 10)),
        ),
        y=alt.Y("library_sample", title=None),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"mean_counts", "n_barcodes"} else c
            for c in invalid_topn.columns
            if c != "library_sample"
        ],
    )
    .mark_bar()
    .properties(width=200, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_params(*selections, valid_selection)
)
for selection in selections:
    topn_chart = topn_chart.transform_filter(selection)
display(topn_chart)

# top barcode for any samples with an invalid in top 10
print("\nTop barcode for samples with an invalid barcode in top 10:")
display(
    invalid_topn.query("invalid_barcodes_in_top_10 > 0").reset_index(drop=True)[
        [
            "library",
            "sample",
            "invalid_barcodes_in_top_10",
            "top_invalid_rank",
            "top_invalid_barcode",
        ]
    ]
)

Top barcode for samples with an invalid barcode in top 10:
library sample invalid_barcodes_in_top_10 top_invalid_rank top_invalid_barcode
0 A 2022-07-20_rescue-2_VSVG_control_1 1 10 TCCGAATGTCCACTAG
1 A 2022-09-01_rescue-3_antibody_IDF033_3.0_1 1 10 GCCAGGGATACTACTC
2 A 2022-10-17_rescue-4_antibody_IDF033_2.0_1 1 9 GCCAGGGATACTACTC
3 A 2022-10-17_rescue-4_antibody_IDF033_3.0_1 1 9 GCCAGGGATACTACTC
4 B 2022-08-04_rescue-2_VSVG_control_1 2 5 TCCGCGAAAATCACGC
5 B 2022-08-04_rescue-2_VSVG_control_2 2 5 TCCGCGAAAATCACGC
6 B 2022-08-04_rescue-2_antibody_IDC561_0.75_1 1 5 GGCACGTACCTTCACC
7 B 2022-08-04_rescue-2_antibody_IDC561_3.0_1 1 10 TAGAAAATCCAACTAA
8 B 2022-08-04_rescue-2_no-antibody_control_1 1 7 GGCACGTACCTTCACC
9 B 2022-08-04_rescue-2_no-antibody_control_2 1 7 GGCACGTACCTTCACC
10 B 2022-09-12_rescue-1_VSVG_control_1 1 6 ACTAGATGCGGTCTTC
11 B 2022-09-27_rescue-3_VSVG_control_1 2 2 TTTATTAAGCACATCA

Get which libraries each barcode maps to:

[14]:
barcodes_by_library = (
    pd.read_csv(config["codon_variants"])
    .groupby(["barcode", "target"], as_index=False)
    .aggregate(
        libraries_w_barcode=pd.NamedAgg("library", lambda s: ", ".join(s.unique())),
        n_libraries_w_barcode=pd.NamedAgg("library", "nunique"),
    )
)

display(
    barcodes_by_library.groupby(["target", "libraries_w_barcode"]).aggregate(
        n_barcodes=pd.NamedAgg("barcode", "count")
    )
)
n_barcodes
target libraries_w_barcode
gene A 44749
A, B 2
B 42509
neut_standard A, B 8

Now look at the overall barcode counts for each sample and see how many map to the expected library or to some other library. Having many barcodes that map to a different library can be an indication of contamination unless there is a lot of expected overlap between the two libraries (which would be indicated in table above):

[15]:
counts_by_library = (
    counts.merge(barcodes_by_library, on="barcode", validate="many_to_one")
    .groupby(
        ["library", "sample", "libraries_w_barcode", "target", "n_libraries_w_barcode"],
        as_index=False,
    )
    .aggregate(n_counts=pd.NamedAgg("count", "sum"))
    .assign(
        frac_counts=lambda x: x["n_counts"]
        / x.groupby(["library", "sample"])["n_counts"].transform("sum"),
    )
    .merge(barcode_runs)
    .assign(
        category=lambda x: x["libraries_w_barcode"].where(
            x["target"] == "gene", x["target"]
        )
    )
    .drop(
        columns=[
            "fastq_R1",
            "notes",
            "antibody_concentration",
            "target",
            "libraries_w_barcode",
        ]
    )
)

Plot which libraries overall barcode counts map to for each sample:

[16]:
ordered_cats = (
    counts_by_library.sort_values(["n_libraries_w_barcode", "category"])["category"]
    .unique()
    .tolist()
)

category_selection = alt.selection_point(fields=["category"], bind="legend")

counts_by_library_chart = (
    alt.Chart(
        counts_by_library.assign(
            order=lambda x: x["category"].map(lambda s: ordered_cats.index(s))
        )
    )
    .encode(
        x=alt.X("frac_counts", scale=alt.Scale(domain=[0, 1])),
        y=alt.Y("library_sample", title=None),
        color=alt.Color("category", scale=alt.Scale(domain=ordered_cats)),
        order="order",
        tooltip=[
            alt.Tooltip(c, format=".2g") if c in {"n_counts", "frac_counts"} else c
            for c in counts_by_library.columns
            if c not in {"library_sample"}
        ],
    )
    .mark_bar()
    .properties(width=250, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_params(*selections, category_selection)
    .transform_filter(category_selection)
)
for selection in selections:
    counts_by_library_chart = counts_by_library_chart.transform_filter(selection)

counts_by_library_chart
[16]:

Get CodonVariantTable for valid variant counts

Get the variant counts for samples with sufficient valid barcode counts, and determine if there are any samples without sufficient counts that don’t havea exclude_after_counts specified:

[17]:
print(f"Requiring {config['min_avg_counts']=} average counts per variant\n")

valid_counts = counts.query("valid").drop(columns="valid")

avg_counts = (
    valid_counts.groupby(["library", "sample"], as_index=False)
    .aggregate(avg_counts=pd.NamedAgg("count", "mean"))
    .assign(adequate_counts=lambda x: x["avg_counts"] >= config["min_avg_counts"])
    .merge(barcode_runs, how="left", validate="one_to_one")
    .drop(columns=["fastq_R1", "notes"])
)

adequate_counts_selection = alt.selection_point(
    fields=["adequate_counts"],
    bind=alt.binding_select(
        options=[None] + avg_counts["adequate_counts"].unique().tolist(),
        labels=["all"] + [str(x) for x in avg_counts["adequate_counts"].unique()],
        name="adequate counts",
    ),
)

avg_counts_chart = (
    alt.Chart(avg_counts)
    .encode(
        x=alt.X("avg_counts"),
        y=alt.Y("library_sample", title=None),
        color="exclude_after_counts",
        tooltip=[
            alt.Tooltip(c, format=".2g") if c in {"avg_counts"} else c
            for c in avg_counts.columns
            if c not in {"library_sample"}
        ],
    )
    .mark_bar()
    .properties(width=250, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_params(adequate_counts_selection, *selections)
    .transform_filter(adequate_counts_selection)
)
for selection in selections:
    avg_counts_chart = avg_counts_chart.transform_filter(selection)
display(avg_counts_chart)
print(f"Saving this chart to {config['variant_avg_counts_plot']}")
avg_counts_chart.save(config["variant_avg_counts_plot"])

print(f"Writing average counts per variant to {config['variant_avg_counts_csv']}")
(
    avg_counts.assign(min_avg_counts=config["min_avg_counts"])[
        [
            "library",
            "sample",
            "avg_counts",
            "min_avg_counts",
            "adequate_counts",
            "exclude_after_counts",
        ]
    ].to_csv(config["variant_avg_counts_csv"], index=False, float_format="%.1f")
)

insufficient_counts = avg_counts.query(
    "(not adequate_counts) & (exclude_after_counts != 'yes')"
)
if len(insufficient_counts):
    print(
        f"\n{len(insufficient_counts)} have insufficient counts despite not "
        "having `exclude_after_counts` specified in `barcode_runs`"
    )
    display(insufficient_counts[["library", "sample", "avg_counts"]])
Requiring config['min_avg_counts']=15 average counts per variant

Saving this chart to results/variant_counts/avg_counts_per_variant.html
Writing average counts per variant to results/variant_counts/avg_counts_per_variant.csv

Now create a CodonVariantTable for the samples not specified for exclusion:

[18]:
geneseq = str(Bio.SeqIO.read(config["gene_sequence_codon"], "fasta").seq)

variants = dms_variants.codonvarianttable.CodonVariantTable(
    barcode_variant_file=config["codon_variants"],
    geneseq=geneseq,
    allowgaps=True,
    substitutions_are_codon=True,
    primary_target="gene",
    substitutions_col="codon_substitutions",
)

variants.add_sample_counts_df(
    valid_counts.merge(
        barcode_runs[["library", "sample", "exclude_after_counts"]],
        validate="many_to_one",
        how="left",
        on=["library", "sample"],
    ).query("exclude_after_counts == 'no'")
)

Average mutations per variant

Interactive plot of average codon mutations per variant:

[19]:
avg_muts = (
    variants.numCodonMutsByType(
        variant_type="all",
        libraries=variants.libraries,
    )
    .merge(barcode_runs, validate="many_to_one")
    .drop(
        columns=[
            "num_muts_count",
            "count",
            "antibody_concentration",
            "fastq_R1",
            "notes",
            "exclude_after_counts",
        ]
    )
    # remove categorical to enable plotting below
    .assign(mutation_type=lambda x: x["mutation_type"].tolist())
)
[20]:
mut_types_sort = avg_muts["mutation_type"].unique().tolist()

mut_type_selection = alt.selection_point(fields=["mutation_type"], bind="legend")

# we can remove exclude_after_counts from selections as we've now removed it
selections = [
    s for (s, c) in zip(selections, selection_cols) if c != "exclude_after_counts"
]

avg_muts_chart = (
    alt.Chart(
        avg_muts.assign(
            order=lambda x: x["mutation_type"].map(lambda m: mut_types_sort.index(m))
        ),
    )
    .encode(
        x=alt.X(
            "number",
            title="average mutations per variant",
            axis=alt.Axis(grid=False),
        ),
        y=alt.Y(
            "library_sample",
            title=None,
        ),
        color=alt.Color(
            "mutation_type",
            scale=alt.Scale(domain=mut_types_sort),
        ),
        order=alt.Order("order", sort="descending"),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in ["count", "number"] else c
            for c in avg_muts.columns
            if c != "library_sample"
        ],
    )
    .mark_bar()
    .properties(height=alt.Step(13), width=300)
    .configure_axis(labelLimit=500)
    .add_params(mut_type_selection, *selections)
    .transform_filter(mut_type_selection)
)

for selection in selections:
    avg_muts_chart = avg_muts_chart.transform_filter(selection)

avg_muts_chart
[20]:

Mutation frequency across the gene

Plot frequencies of mutations across the gene.

First, get the frequencies of mutations at each site:

[21]:
# read reference site numbering
site_numbering_map = pd.read_csv(config["site_numbering_map"])

# frequencies of mutations at each site
site_freqs = (
    variants.mutCounts(
        variant_type="all",
        mut_type="aa",
        libraries=variants.libraries,
    )
    .query("count > 0")
    .rename(columns={"site": "sequential_site"})
    .merge(
        site_numbering_map,
        how="left",
        on="sequential_site",
        validate="many_to_one",
    )
    .assign(
        wildtype=lambda x: x["mutation"].str[0],
        reference_site=lambda x: (
            x["reference_site"]
            if all(x["reference_site"] == x["reference_site"].astype(str))
            else x["reference_site"].astype("Int64")
        ),
    )
    .groupby(
        ["library", "sample", "sequential_site", "reference_site", "wildtype"],
        observed=True,
        as_index=False,
    )
    .aggregate(count=pd.NamedAgg("count", "sum"))
    .merge(
        variants.n_variants_df(
            libraries=variants.libraries, primary_target_only=True
        ).rename(columns={"count": "n_variants"})
    )
    .assign(percent=lambda x: 100 * x["count"] / x["n_variants"])
    .merge(barcode_runs[["library_sample", "library", "sample"]])
    .drop(columns=["n_variants", "count", "library", "sample"])
)

Now to make these plots smaller we pivot the data to wide form and remove all most of the data to a separate lookup table:

[22]:
# encode samples as integers
library_sample_code = {
    y: str(x) for x, y in enumerate(site_freqs["library_sample"].unique())
}

# wide version of site freqs
site_freqs_wide = (
    site_freqs.assign(
        library_sample_code=lambda x: x["library_sample"].map(library_sample_code)
    )
    .pivot_table(
        index=["sequential_site", "reference_site", "wildtype"],
        values="percent",
        columns="library_sample_code",
        fill_value=0,
    )
    .reset_index()
)

# lookup data frame that map sample codes to other information
library_sample_lookup = barcode_runs.assign(
    library_sample_code=lambda x: x["library_sample"].map(library_sample_code)
).query("library_sample_code.notnull()")

Now plot the frequencies:

[23]:
zoom_brush = alt.selection_interval(
    encodings=["x"],
    mark=alt.BrushConfig(stroke="black", strokeWidth=2),
)

zoom_bar = (
    alt.Chart(site_freqs[["sequential_site"]].drop_duplicates())
    .mark_rect(color="lightgrey")
    .encode(
        x=alt.X("sequential_site", title=None, scale=alt.Scale(nice=False, zero=False))
    )
    .add_params(zoom_brush)
    .properties(width=550, height=15, title="site zoom bar")
)

site_freqs_base = (
    alt.Chart(site_freqs_wide)
    .encode(
        x=alt.X("sequential_site", scale=alt.Scale(nice=False, zero=False)),
        y=alt.Y("percent:Q", title="% variants with mutation"),
        color=alt.Color("sample_type:N", legend=alt.Legend(orient="top")),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"percent:Q"} else c
            for c in ["sequential_site", "reference_site", "wildtype", "percent:Q"]
        ],
    )
    .properties(height=100, width=275)
)

site_freqs_chart = (
    alt.layer(
        site_freqs_base.mark_point(filled=True),
        site_freqs_base.mark_line(size=0.5),
        data=site_freqs_wide,
    )
    .transform_fold(
        list(library_sample_code.values()), ["library_sample_code", "percent"]
    )
    .transform_lookup(
        lookup="library_sample_code",
        from_=alt.LookupData(
            data=library_sample_lookup,
            key="library_sample_code",
            fields=[
                "date",
                "virus_batch",
                "library",
                "sample_type",
                "antibody",
                "sample",
                "library_sample",
            ],
        ),
    )
    .facet(facet=alt.Facet("library_sample:N", title=None), columns=3)
    .add_params(zoom_brush)
    .transform_filter(zoom_brush)
    .resolve_scale(y="independent")
)

site_freqs_zoom_chart = (zoom_bar & site_freqs_chart).configure_axis(grid=False)

site_freqs_zoom_chart
[23]:

How much do individual variants dominate counts?

We look to see which individual variants dominate the counts, doing this separately for each target:

[24]:
variant_counts = (
    variants.variant_count_df[
        ["library", "sample", "target", "barcode", "count", "aa_substitutions"]
    ]
    .merge(
        barcode_runs.drop(
            columns=[
                "fastq_R1",
                "notes",
                "antibody_concentration",
                "exclude_after_counts",
            ]
        )
    )
    .assign(
        percent=lambda x: 100
        * x["count"]
        / x.groupby(["library_sample", "target"])["count"].transform("sum")
    )
    .sort_values("percent", ascending=False)
)

Get the top 25 variants, and the 10th and 90th percentiles:

[25]:
top_n = 25

variant_counts_top_n = (
    variant_counts.groupby(["library_sample", "target"])
    .head(n=top_n)
    .merge(
        (
            variant_counts.groupby(
                ["library_sample", "target"], as_index=False
            ).aggregate(
                percentile_10=pd.NamedAgg("percent", lambda s: s.quantile(0.1)),
                percentile_90=pd.NamedAgg("percent", lambda s: s.quantile(0.9)),
                min_percent=pd.NamedAgg("percent", "min"),
                max_percent=pd.NamedAgg("percent", "max"),
            )
        ),
        validate="many_to_one",
    )
)

variant_counts_top_n.head()
[25]:
library sample target barcode count aa_substitutions date virus_batch sample_type antibody replicate library_sample percent percentile_10 percentile_90 min_percent max_percent
0 A 2022-07-20_rescue-2_no-antibody_control_3 neut_standard GTGCAGTAGTAAAGTA 3793 neut_standard 2022-07-20 rescue-2 no-antibody_control NaN 3 A_2022-07-20_rescue-2_no-antibody_control_3 17.956730 6.928467 16.366046 5.553188 17.95673
1 A 2022-07-20_rescue-2_no-antibody_control_3 neut_standard GCGATTCACGCGTTGG 3313 neut_standard 2022-07-20 rescue-2 no-antibody_control NaN 3 A_2022-07-20_rescue-2_no-antibody_control_3 15.684325 6.928467 16.366046 5.553188 17.95673
2 A 2022-07-20_rescue-2_no-antibody_control_3 neut_standard AGTAGACTCCCTCCAT 3249 neut_standard 2022-07-20 rescue-2 no-antibody_control NaN 3 A_2022-07-20_rescue-2_no-antibody_control_3 15.381338 6.928467 16.366046 5.553188 17.95673
3 A 2022-07-20_rescue-2_no-antibody_control_3 neut_standard CCATCTAGTGGCTAGG 2999 neut_standard 2022-07-20 rescue-2 no-antibody_control NaN 3 A_2022-07-20_rescue-2_no-antibody_control_3 14.197794 6.928467 16.366046 5.553188 17.95673
4 A 2022-07-20_rescue-2_no-antibody_control_3 neut_standard TACCCATGGATACGAT 2878 neut_standard 2022-07-20 rescue-2 no-antibody_control NaN 3 A_2022-07-20_rescue-2_no-antibody_control_3 13.624959 6.928467 16.366046 5.553188 17.95673

Now plot these data. The points show the top variants, and can be moused over for details. The bars show the 10th to 90th percentiles, and the lines the full span. Note that these are the percent frequencies of variants, not the fractional frequencies. Also, note that the x-axis is a symlog scale:

[26]:
variant_selector = alt.selection_point(
    on="mouseover",
    empty="none",
    fields=["barcode", "library", "target"],
)

target_selection = alt.selection_point(
    fields=["target"],
    bind=alt.binding_select(
        options=variant_counts_top_n["target"].unique(),
        name="target",
    ),
    value=[{"target": "gene"}],
)

variant_counts_top_n_base = (
    alt.Chart(variant_counts_top_n)
    .encode(
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "sample_type",
            scale=alt.Scale(domain=variant_counts_top_n["sample_type"].unique()),
        ),
    )
    .mark_point(filled=True)
)

variant_counts_top_n_points = variant_counts_top_n_base.encode(
    x=alt.X("percent", title="percent of library", scale=alt.Scale(type="symlog")),
    tooltip=[
        alt.Tooltip(c, format=".3g") if c.startswith("percent") else c
        for c in variant_counts_top_n.columns
        if c not in ["library_sample", "min_percent", "max_percent"]
    ],
    opacity=alt.condition(variant_selector, alt.value(1), alt.value(0.5)),
    strokeWidth=alt.condition(variant_selector, alt.value(2), alt.value(0)),
    size=alt.condition(variant_selector, alt.value(40), alt.value(25)),
).mark_point(filled=True, stroke="black")

variant_counts_top_n_bars = variant_counts_top_n_base.encode(
    alt.X("percentile_10"),
    alt.X2("percentile_90"),
).mark_bar(size=11)

variant_counts_top_n_rule = variant_counts_top_n_base.encode(
    alt.X("min_percent"),
    alt.X2("max_percent"),
).mark_rule()

variant_counts_top_n_chart = (
    (
        variant_counts_top_n_points
        + variant_counts_top_n_bars
        + variant_counts_top_n_rule
    )
    .configure_axis(labelLimit=500)
    .add_params(variant_selector, target_selection, *selections)
    .transform_filter(target_selection)
    .properties(height=alt.Step(15), width=275)
)
for selection in selections:
    variant_counts_top_n_chart = variant_counts_top_n_chart.transform_filter(selection)

variant_counts_top_n_chart
/fh/fast/bloom_j/computational_notebooks/ceradfor/2023/HIV_Envelope_BF520_DMS_CD4bs_sera/.snakemake/conda/056c8fe5fcb5561ce39a818628b75df6_/lib/python3.11/site-packages/altair/vegalite/v5/api.py:316: AltairDeprecationWarning: The value of 'empty' should be True or False.
[26]:

Check counts are consistent with variant count files

Make sure the variant counts analyzed here are the same ones written to CSV files by the pipeline for each library/sample:

[27]:
for library_sample, df in variants.variant_count_df.merge(
    barcode_runs,
    on=["library", "sample"],
    how="left",
    validate="many_to_one",
).groupby("library_sample"):
    f = os.path.join(config["variant_counts_dir"], f"{library_sample}.csv")
    print(f"Checking counts in {f}")
    assert os.path.isfile(f), f"cannot find {f}"
    pd.testing.assert_frame_equal(
        pd.read_csv(f, na_filter=None),
        df[
            [
                "barcode",
                "count",
                "codon_substitutions",
                "aa_substitutions",
                "variant_call_support",
            ]
        ]
        .sort_values(["count", "barcode"], ascending=[False, True])
        .reset_index(drop=True),
    )
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_VSVG_control_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_VSVG_control_2.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_1-18_2.0_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_1-18_2.0_2.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_1-18_4.0_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_PGT151_2.0_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_PGT151_2.0_2.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_antibody_PGT151_4.0_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_no-antibody_control_1.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_no-antibody_control_2.csv
Checking counts in results/variant_counts/A_2022-07-20_rescue-2_no-antibody_control_3.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_VSVG_control_1.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_VSVG_control_2.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_antibody_3BNC117_1.25_1.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_antibody_3BNC117_2.5_1.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_antibody_3BNC117_5.0_1.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_no-antibody_control_1.csv
Checking counts in results/variant_counts/A_2022-08-04_rescue-1_no-antibody_control_2.csv
Checking counts in results/variant_counts/A_2022-09-01_rescue-3_VSVG_control_1.csv
Checking counts in results/variant_counts/A_2022-09-01_rescue-3_no-antibody_control_1.csv
Checking counts in results/variant_counts/A_2022-09-01_rescue-3_no-antibody_control_2.csv
Checking counts in results/variant_counts/A_2022-09-27_rescue-3_VSVG_control_1.csv
Checking counts in results/variant_counts/A_2022-09-27_rescue-3_antibody_IDC561_2.0_1.csv
Checking counts in results/variant_counts/A_2022-09-27_rescue-3_antibody_IDC561_3.0_1.csv
Checking counts in results/variant_counts/A_2022-09-27_rescue-3_antibody_IDC561_4.0_1.csv
Checking counts in results/variant_counts/A_2022-09-27_rescue-3_no-antibody_control_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_VSVG_control_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_3BNC117_1.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_3BNC117_2.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_3BNC117_3.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC508_1.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC508_2.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC508_3.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC513_10.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC513_5.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDC513_7.5_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDF033_1.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDF033_2.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_antibody_IDF033_3.0_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_no-antibody_control_1.csv
Checking counts in results/variant_counts/A_2022-10-17_rescue-4_no-antibody_control_2.csv
Checking counts in results/variant_counts/B_2022-08-04_rescue-2_VSVG_control_1.csv
Checking counts in results/variant_counts/B_2022-08-04_rescue-2_VSVG_control_2.csv
Checking counts in results/variant_counts/B_2022-08-04_rescue-2_antibody_IDC561_3.0_1.csv
Checking counts in results/variant_counts/B_2022-08-04_rescue-2_no-antibody_control_1.csv
Checking counts in results/variant_counts/B_2022-08-04_rescue-2_no-antibody_control_2.csv
Checking counts in results/variant_counts/B_2022-09-12_rescue-1_VSVG_control_1.csv
Checking counts in results/variant_counts/B_2022-09-12_rescue-1_antibody_PGT151_4.0_1.csv
Checking counts in results/variant_counts/B_2022-09-12_rescue-1_antibody_PGT151_8.0_1.csv
Checking counts in results/variant_counts/B_2022-09-12_rescue-1_no-antibody_control_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_VSVG_control_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC508_1.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC508_2.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC508_3.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC561_2.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC561_3.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDC561_4.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDF033_1.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDF033_2.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_antibody_IDF033_3.0_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_no-antibody_control_1.csv
Checking counts in results/variant_counts/B_2022-09-27_rescue-3_no-antibody_control_2.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_1-18_2.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_1-18_3.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_1-18_4.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_3BNC117_5.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_IDC513_10.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_IDC513_5.0_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_antibody_IDC513_7.5_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_no-antibody_control_1.csv
Checking counts in results/variant_counts/B_2022-10-11_rescue-4_no-antibody_control_2.csv