library(tidyverse)
library(ggplot2)
library(lubridate)

Analyzing COVID Data from CSSEGIS

John Hopkins University Center for Systems Science and Engineering (JHU CSSE) compiles and hosts global and US data on the COVID-19 pandemic. The data is split into cases and deaths. Through work done in class, the data has been tidied and joined to form aggregate tables for the global and US data.

The original load, analysis, models and graphs are below. For additional analysis, I decided to focus on answering the question: Which counties in Colorado fared the ‘worst’ during COVID?

DSTA In Class Analysis

Load Data

global_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
spec(global_cases)
global_deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
US_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
US_deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")

global_pop <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
)

Tidy Data

global_cases <- global_cases %>% pivot_longer(cols = -c(`Province/State`, `Country/Region`,Lat, Long), names_to="date", values_to = "cases") %>% select(-c(Lat,Long))

global_deaths <- global_deaths %>% pivot_longer(cols = -c(`Province/State`, `Country/Region`,Lat, Long), names_to="date", values_to = "deaths") %>% select(-c(Lat,Long))

US_cases <- US_cases %>% pivot_longer(cols = -c(`Province_State`, `Country_Region`,Lat, Long_,Combined_Key,Admin2,FIPS,code3,iso3,iso2,UID), names_to="date", values_to = "cases") %>% select(-c(Lat,Long_,UID,iso2,iso3,code3,FIPS))

US_deaths <- US_deaths %>% pivot_longer(cols = -c(`Province_State`, `Country_Region`,Lat, Long_,Combined_Key,Admin2,FIPS,code3,iso3,iso2,UID,Population), names_to="date", values_to = "deaths") %>% select(-c(Lat,Long_,UID,iso2,iso3,code3,FIPS))

# full joins
global <- global_cases %>% full_join(global_deaths) %>%
  rename(Country_Region = `Country/Region`, Province_State = `Province/State`) %>%
  mutate(date =mdy(date))

global <- global %>% unite("Combined_key", c(Province_State, Country_Region), sep=", ", na.rm=TRUE, remove=FALSE)

global <- global %>% left_join(global_pop, by =c("Province_State", "Country_Region")) %>% select(-c(UID,FIPS)) %>% select (Province_State, Country_Region, date, cases, deaths, Population, Combined_key)

global_pos <- global %>% filter(cases > 0)

us <- US_cases %>% full_join(US_deaths)

Visuals

us <- us %>% mutate(date = mdy(date))

US_by_state <- us %>% group_by(Province_State, Country_Region, date) %>% summarize(cases = sum(cases), deaths = sum(deaths), Population = sum(Population)) %>% mutate(deaths_per_mill = deaths *1000000 / Population) %>% select(Province_State, Country_Region, date, cases, deaths, deaths_per_mill, Population) %>% ungroup()

US_totals <- us %>% group_by(Country_Region, date) %>% summarize(cases = sum(cases), deaths = sum(deaths), Population = sum(Population)) %>% mutate(deaths_per_mill = deaths *1000000 / Population) %>% select(Country_Region, date, cases, deaths, deaths_per_mill, Population) %>% ungroup()

US_totals %>% filter(cases >0) %>%
  ggplot(aes(x = date, y = cases)) + 
  geom_line(aes(color= "cases")) +
  geom_point(aes(color = "cases"))+
  geom_line(aes(y = deaths, color = "deaths")) +
  geom_point(aes(y = deaths, color = "deaths")) +
  scale_y_log10() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90)) + labs(title = "COVID19 in US", y=NULL)

US_by_state %>% filter(cases >0, Province_State == 'Colorado') %>%
  ggplot(aes(x = date, y = cases)) + 
  geom_line(aes(color= "cases")) +
  geom_point(aes(color = "cases"))+
  geom_line(aes(y = deaths, color = "deaths")) +
  geom_point(aes(y = deaths, color = "deaths")) +
  scale_y_log10() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90)) + labs(title = "COVID19 in US", y=NULL)

Analyzing Data

US_by_state2 <- US_by_state %>%
  mutate(new_cases = cases - lag(cases), 
         new_deaths = deaths - lag(deaths)
         )

US_totals2 <- US_totals %>%
  mutate(new_cases = cases -lag(cases),
         new_deaths = deaths - lag(deaths))

US_totals2 %>% filter(cases >0) %>%
  ggplot(aes(x = date, y = new_cases)) + 
  geom_line(aes(color= "cases")) +
  geom_point(aes(color = "cases"))+
  geom_line(aes(y = new_deaths, color = "deaths")) +
  geom_point(aes(y = new_deaths, color = "deaths")) +
  scale_y_log10() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90)) + labs(title = "COVID19 in US", y=NULL)

US_by_state2 %>% filter(cases >0, Province_State == 'Colorado') %>%
  ggplot(aes(x = date, y = new_cases)) + 
  geom_line(aes(color= "cases")) +
  geom_point(aes(color = "cases"))+
  geom_line(aes(y = new_deaths, color = "deaths")) +
  geom_point(aes(y = new_deaths, color = "deaths")) +
  scale_y_log10() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90)) + labs(title = "COVID19 in US", y=NULL)

US_state_totals <- US_by_state2 %>% group_by(Province_State) %>% summarize (deaths = max(deaths), cases = max(cases), population = max(Population), cases_per_thou = 1000* cases/population, deaths_per_thou = 1000*deaths/population) %>% filter(cases >0, population >0)

US_state_totals %>% slice_min(deaths_per_thou, n=10) %>% select(deaths_per_thou, cases_per_thou, everything())

US_state_totals %>% slice_max(deaths_per_thou, n=10) %>% select(deaths_per_thou, cases_per_thou, everything())

Modeling

mod <- lm(deaths_per_thou ~ cases_per_thou, data = US_state_totals)

US_state_totals_wp <- US_state_totals %>% mutate(pred = predict(mod))

US_state_totals_wp %>% ggplot() +
  geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") + 
  geom_point(aes(x = cases_per_thou, y = pred), color = "red")

Additional Analysis

Which county in Colorado fared the ‘worst’ during COVID?

Filter Data

# Filter to CO, arrange by date
co <- us %>% filter(Province_State == 'Colorado')%>% mutate(cases_per_thou = 1000 * cases/Population, deaths_per_thou = 1000*deaths/Population) %>% select(-c('Country_Region', 'Combined_Key')) %>% arrange(mdy(date)) %>% filter(!(Admin2 %in% c('Out of CO', 'Unassigned')))

# Rename Columns
colnames(co)[colnames(co) %in% c("Admin2", "Province_State", "Population")] <- c("county", "state", "population")

# Lag for new cases and deaths
co <- co %>% group_by(county) %>% mutate(new_cases = cases - lag(cases), new_deaths = deaths - lag(deaths), new_cases_per_thou = cases_per_thou - lag(cases_per_thou), new_deaths_per_thou = deaths_per_thou - lag(deaths_per_thou)) 
co <- co %>% ungroup() %>% group_by(county) %>% mutate(avg_cases = cummean(ifelse(is.na(new_cases), 0, new_cases)), avg_deaths = cummean(ifelse(is.na(new_deaths), 0, new_deaths)), avg_case_p_thou = cummean(ifelse(is.na(new_cases_per_thou), 0, new_cases_per_thou)), avg_death_p_thou = cummean(ifelse(is.na(new_deaths_per_thou), 0, new_deaths_per_thou)))

# Filtering out to last data to get overall sums and averages
co_end <- co %>% filter(date == max(date))

# Pivot for easier data manipulation
co_end2 <- pivot_longer(co_end, c("cases","population","deaths","cases_per_thou","deaths_per_thou","avg_cases","avg_deaths","avg_case_p_thou","avg_death_p_thou","new_cases","new_deaths","new_cases_per_thou","new_deaths_per_thou"), names_to='variable', values_to = 'value')

Plots

Total Cases and Deaths

# Find top 10 counties
co_top_total <- co_end %>% ungroup %>% slice_max(cases, n=10)
co_top_total_counties <- c(co_top_total$county)

co_end2 %>% filter(variable %in% c('cases', 'deaths'), county %in% co_top_total_counties) %>%
  ggplot() +
  geom_col(mapping = aes(x = reorder(county, -value), y= value/1000, fill = variable)) +
  labs(x="County", y="Total (Thousands)", title="Total Cases and Deaths in the Top 10 Counties") +
  scale_fill_discrete(name="Statistic",labels=c("Total Cases", "Total Deaths")) +
  theme(axis.text.x = element_text(angle = 45)) +
  facet_wrap(vars(variable), scales='free')

From the above visual, it looks like El Paso, Denver, Arapahoe, Adams, Jefferson, and Larimer counties did not do well during the Pandemic. However; this does not take into account population. El Paso (Colorado Springs), Denver, Arapahoe, Jefferson, and Adams county are the most populated counties in Colorado in that order. Douglas county is slightly larger population wise than Larimer, but appears lower down on the overall case and deaths.

Average Cases and Deaths

# Find top 10 counties
co_top_avg <- co_end %>% ungroup %>% slice_max(avg_cases, n=10)
co_top_avg_counties <- c(co_top_avg$county)

co_end2 %>% filter(variable %in% c('avg_cases', 'avg_deaths'), county %in% co_top_avg_counties) %>%
  ggplot() +
  geom_col(mapping = aes(x = reorder(county, -value), y= value, fill = variable)) +
  labs(x="County", y="Average (per day)", title="Average Cases and Deaths in the Top 10 Counties") +
  scale_fill_discrete(name="Statistic",labels=c("Average Cases", "Average Deaths")) +
  theme(axis.text.x = element_text(angle = 45)) +
  facet_wrap(vars(variable), scales='free')

Looking at the average new case and deaths visual, we again see El Paso, Denver, Arapahoe, Adams, Jefferson, and Larimer county, or some of the most populated counties.

Average Cases and Deaths

# Find top 10 counties
co_top_avg_p <- co_end %>% ungroup %>% slice_max(avg_cases, n=10)
co_top_avg_p_counties <- c(co_top_avg_p$county)

co_end2 %>% filter(variable %in% c('avg_case_p_thou', 'avg_death_p_thou'), county %in% co_top_avg_p_counties) %>%
  ggplot() +
  geom_col(mapping = aes(x = reorder(county, -value), y= value, fill = variable)) +
  labs(x="County", y="Total", title="Average Cases and Deaths per Thousand in the Top 10 Counties") +
  scale_fill_discrete(name="Statistic",labels=c("Average Cases per Thousand", "Average Deaths per Thousand")) +
  theme(axis.text.x = element_text(angle = 45)) +
  facet_wrap(vars(variable), scales='free')

The average cases per thousand, calculated by finding the change in the previous and current cases per thousand over time, begins to show a different story. Denver county is no longer the second highest county when represented in cases per thousand. Pueblo, Jefferson, and Weld counties also have noticeably higher average deaths per thousand that other counties.

Deaths Over Time

# Find top 10 counties
co_top_deaths <- co_end %>% ungroup %>% slice_max(deaths, n=10)
co_top_deaths_counties <- c(co_top_deaths$county)

co %>% filter(county %in% co_top_deaths_counties) %>%
  ggplot() +
  geom_line(mapping = aes(x = date, y = deaths, color = county)) +
  labs(x="Date", y="Total Deaths", title="Total Deaths Over Time in the Top 10 Counties")

co_top_deaths %>% select(county, population, cases, deaths, cases_per_thou, deaths_per_thou)
county population cases deaths cases_per_thou
El Paso 720403 237195 1904 329.2532
Jefferson 582881 160677 1554 275.6600
Denver 727211 224919 1524 309.2899
Adams 517421 166371 1466 321.5389
Arapahoe 656590 197066 1368 300.1355
Pueblo 168424 61961 880 367.8870
Weld 324492 106227 780 327.3640
Mesa 154210 51841 639 336.1715
Larimer 356899 109313 606 306.2855
Douglas 351154 100968 498 287.5320

While potentially difficult to see from the above graph, El Paso is the top green line, followed by Jefferson and Denver (green and yellow), then Adams (red) and Arapahoe (yellow). All counties follow a similar trends overall; however, El Paso had a noticeably steeper increase from mid-2021 to 2022 compared to other counties.

Modeling

co_top_pop <- co_end %>% ungroup %>% slice_max(population, n=20)
co_top_pop_counties <- c(co_top_pop$county)
co_top_pop_only <- co %>% filter(county %in% co_top_pop_counties)

mod_deaths_top <- lm(new_deaths ~ date , data = co_top_pop_only)
mod_deaths <- lm(new_deaths ~ date , data = co)

z <- as.Date(paste(2020, 5, 1, sep = "-"))
z2 <- as.Date(paste(2022, 11, 15, sep = "-"))

co %>% filter(county %in% co_top_pop_counties) %>%
  ggplot() +
  geom_line(mapping = aes(x = date, y = new_deaths, color = county)) +
  labs(x="Date", y="New Deaths", title="New Deaths Over Time in the 20 Most Populated Counties") +
  geom_abline(mapping=aes(intercept=summary(mod_deaths_top)$coefficients[1],slope=summary(mod_deaths_top)$coefficients[2]), col = 'black', linetype = 2, size=1) +
  geom_abline(mapping=aes(intercept=summary(mod_deaths)$coefficients[1],slope=summary(mod_deaths)$coefficients[2]), col = 'grey', linetype = 2, size=1)

## unused code
  # geom_hline(mapping=aes(yintercept = mean(co_end$avg_deaths)), col = 'darkgrey', linetype = 3, size=1)
  # geom_hline(mapping=aes(yintercept = mean(co_top_pop$avg_deaths)), col = 'black', linetype = 3, size=1)
  # geom_text(aes(z,1.3, label='Top 20 Average'), size = 3.5) +
  # geom_text(aes(z,1.8, label='Overall Average'), size = 3.5, color ='grey') +
  # geom_text(aes(z2,2.5, label='Top 20 Model'), size = 3.5) +
  # geom_text(aes(z2,3.5, label='Overall Model'), size = 3.5, color ='grey') 

While potentially difficult to distinguish, especially due to the large spike in new deaths in early 2020, the black and grey lines show the linear model (overall and top 20 counties by population). The slope coefficients are printed below:

summary(mod_deaths_top)$coefficients[2]
summary(mod_deaths)$coefficients[2]
[1] -0.0003302051
[1] -0.0001071107

While small, they are both positive, meaning that the liner model is predicting COVID deaths to rise over time. The linear model is likely not a good way to model or predict future trends with this data; however, there have been noticeable spikes at the end of the year, likely centered around holiday travel and COVID spread.

co <- co %>% ungroup() %>% group_by(county) %>% mutate(county_avg_deaths = mean(new_deaths, na.rm=T))
co <- co %>% ungroup() %>% mutate(above_avg_deaths_county = case_when(new_deaths > county_avg_deaths ~ 1, TRUE ~ 0), total_nonzero_death = case_when(new_deaths > 0 ~ 1, TRUE ~ 0))

# Overall and just looking at the 20 most populated counties, Colorado had an average of below 1 deaths per (reported) day!
print(mean(co$new_deaths, na.rm=T))
print(mean(co_top_pop_only$new_deaths, na.rm=T))

co_overall <- co %>% group_by(county) %>% summarize(county_avg_deaths = max(county_avg_deaths), total_above_avg_deaths_county = sum(above_avg_deaths_county), total_total_nonzero_death = sum(total_nonzero_death)) %>% filter(county_avg_deaths > 1) %>% mutate(pct_days_over_county_avg = total_above_avg_deaths_county/total_total_nonzero_death *100)

co_overall
county county_avg_deaths total_above_avg_deaths_county total_total_nonzero_death pct_days_over_county_avg
Adams 1.283713 322 590 54.57627
Arapahoe 1.197898 301 521 57.77351
Denver 1.334501 313 562 55.69395
El Paso 1.667250 386 622 62.05788
Jefferson 1.360771 335 588 56.97279

Overall, Colorado on average and the top 20 most populated counties on average had a below 1 deaths per reported day of data! Because of this, comparing average new deaths per day to these averages becomes difficult since only whole numbers of people can die. Any county with one or more death would be over both averages. Similarly, county overall averages below 1 would also cause any day with over 1 death to be higher than their overall average.

The data is filtered to just the counties with average new deaths higher than 1, which are Adams, Arapahoe, Denver, El Paso, and Jefferson. All have average new death rates lower than 2, but El Paso has the highest with ~1.73 average deaths overall. When comparing the total days above the overall average deaths (i.e. above 1) compared to the days where there was at least one death, El Passo has the higest percent of days where the total deaths were over 2. When potentially comparing against the entire days recorded, because El Paso has the highest days with non-zero COVID deaths, El Paso would still have the highest rate comparing days above their county average.

From the visualizations above and this analysis, I would say that El Paso likely fared the worst during COVID.

Bias

Living in Colorado, I have developed my own biases towards certain counties.

Because there is a large population variation between the large and small counties in Colorado, it’s difficult to compare small counties to larger counties. Even when comparing larger counties, population alone may not tell the full story. Some counties have a more dense population than others, potentially increasing the spread of COVID due to unavoidable contact.

Lastly, with the increasing amount of at home testing options, not all individuals may report a positive COVID test or have their case reported.