Research Report

Investigating Causal links from Observed Features in the first COVID-19 Waves in California



Good S & O'Hare A (2023) Investigating Causal links from Observed Features in the first COVID-19 Waves in California. ArXiv: Ithaca, New York.

Determining who is at risk from a disease is important in order to protect vulnerable subpopula- tions during an outbreak. We are currently in a SARS-COV-2 (commonly referred to as COVID-19) pandemic which has had a massive impact across the world, with some communities and individuals seen to have a higher risk of severe outcomes and death from the disease compared to others. These risks are compounded for people of lower socioeconomic status, those who have limited access to health care, higher rates of chronic diseases, such as hypertension, diabetes (type-2), obesity, likely due to the chronic stress of these types of living conditions. Essential workers are also at a higher risk of COVID-19 due to having higher rates of exposure due to the nature of their work. In this study we determine the important features of the pandemic in California in terms of cumulative cases and deaths per 100,000 of population up to the date of 5 July, 2021 (the date of analysis) using Pearson correlation coefficients between population demographic features and cumulative cases and deaths. The most highly correlated features, based on the absolute value of their Pearson Correlation Coefficients in relation to cases or deaths per 100,000, were used to create regression models in two ways: using the top 5 features and using the top 20 features filtered out to limit interactions between features. These models were used to determine a) the most significant features out of these subsets and b) features that approximate different potential forces on COVID- 19 cases and deaths (especially in the case of the latter set). Additionally, co-correlations, defined as demographic features not within a given input feature set for the regression models but which are strongly correlated with the features included within, were calculated for all features. The five features which had the highest correlations to cumulative cases per 100,000 were found to be the following: Overcrowding (% of households), Average Household Size, Hispanic ethnicity (% of population), Ages 0-19 (% population), education level of 9th to 12th with no high school diploma (% of population older than 25 years), and incidence rates of Long-term Diabetes Compli- cations (per 100,000 population). For cumulative deaths per 100,000, the feature set was similar except Overcrowding (% of households) replaced Long-term Diabetes Complications. The feature set for uncorrelated features was the same for both cases and deaths. This set was comprised of Overcrowding (% of households), Wholesale trade (% of workforce employed in), ‘Transportation, warehousing, and utilities’ (% of workforce employed in), and ‘Graduate or professional degree’ (% of population older than 25 years).

StatusIn Press
Publication date online25/03/2023
Place of publicationArXiv: Ithaca, New York

People (1)


Dr Anthony O'Hare

Dr Anthony O'Hare

Lecturer in Mathematics, Mathematics


Research programmes

Research centres/groups