My class work for bimm143 at UC San Diego
Cecilia Wang (A18625854)
Pertussis (whopping cough) is a common lung infection caused by the bacteria B.Pertussis. It can infect everyone but is most deadly for infants (under 1 year of age)
The CDC tracks the number of Pertussis cases:
library(ggplot2)
plot for Year vs.Number of Cases
ggplot(cdc) +
aes(Year,Cases) +
geom_point() +
geom_line()+ theme_light()

Q. Using the ggplot geom_vline() function add lines to your previous plot for the 1946 introduction of the wP vaccine and the 1996 switch to aP vaccine (see example in the hint below).
ggplot(cdc) +
aes(Year,Cases) +
geom_point() +
geom_line()+
geom_vline(xintercept = 1946, col="blue",lty=2)+
geom_vline(xintercept = 1996, col="red",lty=2)+
geom_vline(xintercept = 2020, col="purple",lty=2)+
theme_light()

After introduce the wP vaccine, the number of cases per year significantly decreased. However, after introduce aP vaccine, the number of cases per year slightly increase than before. This could due to multiple reasons: less newborn getting vaccine; the aP vaccine is less effective than wP vaccine due to the fewer antigens; waning immunity (immunity from aP vaccine decline faster than wP vaccine).
The CMI-PB project’s https://www.cmi-pb.org/ mission is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of Pertussis booster vaccination.
Basically, make a available large dataset on immune response to Pertussis. They use a booster vaccination as a proxy for Pertussis infection.
They make their data available as JSON format API. We can read this into
R with the read_json() function from the jsonlite package.
library(jsonlite)
subject <- read_json("https://www.cmi-pb.org/api/subject", simplifyVector = TRUE)
head(subject, 3)
subject_id infancy_vac biological_sex ethnicity race
1 1 wP Female Not Hispanic or Latino White
2 2 wP Female Not Hispanic or Latino White
3 3 wP Female Unknown White
year_of_birth date_of_boost dataset
1 1986-01-01 2016-09-12 2020_dataset
2 1968-01-01 2019-01-28 2020_dataset
3 1983-01-01 2016-10-10 2020_dataset
Q. How many aP and wP individuals are there in this
subjecttable?
Q. How many male and female are there?
Q. What is the break down of gender and race?
table(subject$infancy_vac)
aP wP
87 85
table(subject$biological_sex)
Female Male
112 60
table(subject$race, subject$biological_sex)
Female Male
American Indian/Alaska Native 0 1
Asian 32 12
Black or African American 2 3
More Than One Race 15 4
Native Hawaiian or Other Pacific Islander 1 1
Unknown or Not Reported 14 7
White 48 32
specimen <- read_json("http://cmi-pb.org/api/v5_1/specimen", simplifyVector = TRUE)
ab_titer <- read_json("http://cmi-pb.org/api/v5_1/plasma_ab_titer", simplifyVector = TRUE)
head(specimen,3)
specimen_id subject_id actual_day_relative_to_boost
1 1 1 -3
2 2 1 1
3 3 1 3
planned_day_relative_to_boost specimen_type visit
1 0 Blood 1
2 1 Blood 2
3 3 Blood 3
head(ab_titer,3)
specimen_id isotype is_antigen_specific antigen MFI MFI_normalised
1 1 IgE FALSE Total 1110.21154 2.493425
2 1 IgE FALSE Total 2708.91616 2.493425
3 1 IgG TRUE PT 68.56614 3.736992
unit lower_limit_of_detection
1 UG/ML 2.096133
2 IU/ML 29.170000
3 IU/ML 0.530000
To make sense of all this data we need to “join” (a.k.a. “merge” or
“link”) all these tables together. Only then will you know that a given
Ab measurement (from the ab_titier table ) was collected on a certain
data (from the spicement table ) from a certain wP or aP subject ( the
subject table).
We can use dplyr and the *_join() family of functions to do this.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Join specimen and subject tables to make a new merged data frame
meta <- inner_join(subject,specimen)
Joining with `by = join_by(subject_id)`
head(meta)
subject_id infancy_vac biological_sex ethnicity race
1 1 wP Female Not Hispanic or Latino White
2 1 wP Female Not Hispanic or Latino White
3 1 wP Female Not Hispanic or Latino White
4 1 wP Female Not Hispanic or Latino White
5 1 wP Female Not Hispanic or Latino White
6 1 wP Female Not Hispanic or Latino White
year_of_birth date_of_boost dataset specimen_id
1 1986-01-01 2016-09-12 2020_dataset 1
2 1986-01-01 2016-09-12 2020_dataset 2
3 1986-01-01 2016-09-12 2020_dataset 3
4 1986-01-01 2016-09-12 2020_dataset 4
5 1986-01-01 2016-09-12 2020_dataset 5
6 1986-01-01 2016-09-12 2020_dataset 6
actual_day_relative_to_boost planned_day_relative_to_boost specimen_type
1 -3 0 Blood
2 1 1 Blood
3 3 3 Blood
4 7 7 Blood
5 11 14 Blood
6 32 30 Blood
visit
1 1
2 2
3 3
4 4
5 5
6 6
One more inner_join() to join ab_titer with our metadata.
abdata <- inner_join(meta,ab_titer)
Joining with `by = join_by(specimen_id)`
head(abdata)
subject_id infancy_vac biological_sex ethnicity race
1 1 wP Female Not Hispanic or Latino White
2 1 wP Female Not Hispanic or Latino White
3 1 wP Female Not Hispanic or Latino White
4 1 wP Female Not Hispanic or Latino White
5 1 wP Female Not Hispanic or Latino White
6 1 wP Female Not Hispanic or Latino White
year_of_birth date_of_boost dataset specimen_id
1 1986-01-01 2016-09-12 2020_dataset 1
2 1986-01-01 2016-09-12 2020_dataset 1
3 1986-01-01 2016-09-12 2020_dataset 1
4 1986-01-01 2016-09-12 2020_dataset 1
5 1986-01-01 2016-09-12 2020_dataset 1
6 1986-01-01 2016-09-12 2020_dataset 1
actual_day_relative_to_boost planned_day_relative_to_boost specimen_type
1 -3 0 Blood
2 -3 0 Blood
3 -3 0 Blood
4 -3 0 Blood
5 -3 0 Blood
6 -3 0 Blood
visit isotype is_antigen_specific antigen MFI MFI_normalised unit
1 1 IgE FALSE Total 1110.21154 2.493425 UG/ML
2 1 IgE FALSE Total 2708.91616 2.493425 IU/ML
3 1 IgG TRUE PT 68.56614 3.736992 IU/ML
4 1 IgG TRUE PRN 332.12718 2.602350 IU/ML
5 1 IgG TRUE FHA 1887.12263 34.050956 IU/ML
6 1 IgE TRUE ACT 0.10000 1.000000 IU/ML
lower_limit_of_detection
1 2.096133
2 29.170000
3 0.530000
4 6.205949
5 4.679535
6 2.816431
Q. How many different Ab “isotypes” are in this dataset?
Q. How many different “antigen” values are measured?
table(abdata$isotype)
IgE IgG IgG1 IgG2 IgG3 IgG4
6698 7265 11993 12000 12000 12000
table(abdata$antigen)
ACT BETV1 DT FELD1 FHA FIM2/3 LOLP1 LOS Measles OVA
1970 1970 6318 1970 6712 6318 1970 1970 1970 6318
PD1 PRN PT PTM Total TT
1970 6712 6712 1970 788 6318
Let’s focus on IgG isotype
igg <- abdata |>
filter(isotype=="IgG")
head(igg)
subject_id infancy_vac biological_sex ethnicity race
1 1 wP Female Not Hispanic or Latino White
2 1 wP Female Not Hispanic or Latino White
3 1 wP Female Not Hispanic or Latino White
4 1 wP Female Not Hispanic or Latino White
5 1 wP Female Not Hispanic or Latino White
6 1 wP Female Not Hispanic or Latino White
year_of_birth date_of_boost dataset specimen_id
1 1986-01-01 2016-09-12 2020_dataset 1
2 1986-01-01 2016-09-12 2020_dataset 1
3 1986-01-01 2016-09-12 2020_dataset 1
4 1986-01-01 2016-09-12 2020_dataset 2
5 1986-01-01 2016-09-12 2020_dataset 2
6 1986-01-01 2016-09-12 2020_dataset 2
actual_day_relative_to_boost planned_day_relative_to_boost specimen_type
1 -3 0 Blood
2 -3 0 Blood
3 -3 0 Blood
4 1 1 Blood
5 1 1 Blood
6 1 1 Blood
visit isotype is_antigen_specific antigen MFI MFI_normalised unit
1 1 IgG TRUE PT 68.56614 3.736992 IU/ML
2 1 IgG TRUE PRN 332.12718 2.602350 IU/ML
3 1 IgG TRUE FHA 1887.12263 34.050956 IU/ML
4 2 IgG TRUE PT 41.38442 2.255534 IU/ML
5 2 IgG TRUE PRN 174.89761 1.370393 IU/ML
6 2 IgG TRUE FHA 246.00957 4.438960 IU/ML
lower_limit_of_detection
1 0.530000
2 6.205949
3 4.679535
4 0.530000
5 6.205949
6 4.679535
Make a plot of MFI-normalized values for all antigen values.
ggplot(igg) +
aes(MFI_normalised, antigen) +
geom_boxplot()

The antigens “PT”, “PRN”, FIM2/3”, and “FHA” appear to have the widest range of values.
Q. Is there a differnce for these responses between aP and wP individuals?
ggplot(igg) +
aes(MFI_normalised, antigen, col=infancy_vac) +
geom_boxplot()

ggplot(igg) +
aes(MFI_normalised, antigen) +
geom_boxplot() +
facet_wrap(~infancy_vac)

Q. Is there a difference with time (i.e. before booster shot vs after booster shot)
ggplot(igg) +
aes(MFI_normalised, antigen, col=infancy_vac) +
geom_boxplot() +
facet_wrap(~visit)

abdata.21 <- abdata %>% filter(dataset == "2021_dataset")
abdata.21 %>%
filter(isotype == "IgG", antigen == "PT") %>%
ggplot() +
aes(x=planned_day_relative_to_boost,
y=MFI_normalised,
col=infancy_vac,
group=subject_id) +
geom_point() +
geom_line() +
geom_vline(xintercept=0, linetype="dashed") +
geom_vline(xintercept=14, linetype="dashed") +
labs(title="2021 dataset IgG PT",
subtitle = "Dashed lines indicate day 0 (pre-boost) and 14 (apparent peak levels)") +
geom_smooth(aes(group = infancy_vac, color = infancy_vac), se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
