In the era of COVID-19, data has become a crucial tool in understanding and combating the pandemic. The Public Health Agency of Canada has made available a dataset that provides a comprehensive view of COVID-19 cases in the country. This dataset, updated every Tuesday, is a valuable resource for data analysts of all skill levels, offering insights into the spread and impact of the virus across provinces.
Dataset Overview
The dataset, which is available under the Open Government Licence – Canada, contains information about the number of cases and deaths at the provincial and national levels from January 31, 2020, to the present. As of January 21, 2024, the dataset has 3,090 rows and 23 columns.
Data Analysis and Cleaning with Python
The first step in working with this dataset is to clean and prepare the data for analysis. This process involves several steps, including dropping unnecessary columns and values, renaming columns, and converting data types.
Python Code at GitHub
https://github.com/jsmith6417/Public-Health-Agency-of-Canada-COVID-Cases
Steps:
- Drop the columns not needed: The data is aggregate, overlapping, or not needed for the visualizations in Tableau. After this step, six columns remain.
- Drop the values for province we don’t need: The values for 'Canada' and 'Repatriated travellers' are dropped, leaving 2,678 rows.
- Rename 'prname' to 'Province'.
- Add a column for 'Country' and populate all rows with 'Canada': This allows Tableau to automatically recognize 'Country' and 'Province' as a hierarchy, facilitating the creation of a color-coded map of Canada.
- Convert 'cases' and 'deaths' to numeric values: Non-numeric data such as zeros represented as dashes and blanks are replaced with NaN, which are then replaced with zeros. This simplifies filtering in Tableau and stops the missing values warning.
- Delete the rows where the number of cases and number of deaths are both zero: These weeks have nothing to report, so they are removed, leaving 2,117 rows.
The 'date' column is not converted to datetime in Python as Tableau automatically converts the text to dates.
Data Visualization with Tableau
Tableau is a powerful tool for creating interactive visualizations. It recognizes 'Country' and 'Province' as a hierarchy, making it easy to create a color-coded map of Canada. Two True/False calculated fields are created to filter out case counts equal to zero and death counts equal to zero:
- Cases > 0: `[numtotal_last7] > 0`
- Deaths > 0: `[numdeaths_last7] > 0`
Steps:
- Create interactive maps and bar charts for cases and deaths.
- Assembled the maps and bar charts into dashboards.
- Assemble the dashboards into the final presentation story points.
Visualizations at Tableau Public
This dataset is a good resource for practicing analysis and cleaning in Python, and for creating calculated fields, sorting, grouping, and visualizing the data in Tableau. Map labels are in the tooltips to keep the presentation free of clutter.
Frequently Asked Questions
How did you decide which columns were unnecessary and should be dropped?
Since the data is updated weekly, I decided to keep only the columns with counts and rates for the last 7 days to shrink and simplify the dataset.
We have the date reported, so the reporting week and reporting year columns are not needed.
The 'pruid' and 'update' columns are for internal tracking and not necessary for creating visualizations.
The 'prnameFR' is province names in French. Since we already have the province names in English, these are not needed.
How did you handle missing or incomplete data in the dataset?
I filled in missing and non-numeric values for the counts (such as dashes) first with NA, then converted any NA values to zero.
What were the criteria for creating the True/False calculated fields in Tableau?
Checking values for each of the count columns and noting if they're greater than zero. Tableau will recognize this as boolean (T/F).
For cases > 0: [numtotal_last7] > 0
For deaths > 0: [numdeaths_last7] > 0