使用Open data API对COVID-19的数据进行分析。
In this activity, you will be asked to do three things:
- Query an open data API for public COVID information:
- Mold this information into several dataframes according to our instructions;
- Create quick data transformations and plots to answer specific questions.
This activity’s solutions should be provided in a single IPython Notebook file, named CW2_A1. ipynb.
The UK government has a portal with data about the Coronavirus in the UK; it contains data on cases, deaths, and vaccinations on national and local levels. The portal is available on this page: https://coronavirus.data.gov.uk. You can acquire the data by querying its API.
We ask you to use the requests library in order to communicate with the API. The documentation is available at: https://docs.python-requests.org/endatest. Read carefully the API documentation at https://coronavirus.data.gov.ukidetallsidevelopers-guidelmain-api.
Then complete the following tasks in order to acquire the data.
Create a function get_API_data(filters, structure) that sends a specific query to the API and retrieves all the data provided by the API that matches the query. The function requires two arguments:
- filters (dictionary) are the filters to be passed to the query, as specified in the API documentation. This will be a dictionary where keys are filter metrics and values are the values of those metrics. For example, you may want to filter the data by nation, date etc. As seen in the API documentation, filters are passed to the API’s URL as a URL parameter. This means you will have to format filters inside get_API_data in a way that the API can accept it as an argument.
- structure (dictionary) will specify what information the query should return, again as specified in the API documentation. This will be a dictionary where the keys are the names you wish to give to each metric, and the values are the metrics as specified in the API. The structure argument specifies what attributes from the records that match the filters you wish to obtain, such as date, region, daily casualties etc. The argument is passed as an URL parameter to the API’s URL. This means you will have to format structure inside get_APIJdata in a way that the API can accept it as an argument.
The function get_API_data should return a list of dictionaries answering the query.
To ensure you receive all data matching your query, use the page URL parameter. The function should get data from all pages and return everything as a single list of dictionaries.
An example of the full URL with filter. structure, and page parameters defined can be seen in Listing ; this URL, when queried. returns the first page (pages begin at 1) with data at a regional level, retrieving only the date and new cases by publishing date, and naming them date and newCases, respectively.
Write a script that calls the function get_API_data twice, producing two lists of dictionaries: results_json_national and results_json_regional. Both lists should consist of dictionaries with the following key-value pairs:
- date (string): The date to which this observation corresponds to;
- name (string) : The name of the area covered by this observation (could be a nation, region, a local authority, etc);
- daily_cases (numeric) : The number of new cases at that date and in that area by specimen date;
- cumulative_cases (numeric) : The cumulative number of cases at that date and in that area area by specimen date;
- daily_deaths (numeric) : The number of new deaths at that date and in that area after 28 days of a positive test, by publishing date;
- cumulative_deaths (numeric) : The cumulative number of deaths at that date and in that area after 28 days of a positive test, by publishing date;
- cumulative_vaccinated (numeric) : The cumulative number of people who completed their vaccination (both doses) by vaccination date;
- vaccination_age (dictionary or list of dictionaries) : A demographic breakdown of cumulative vaccinations by age intervals for all people.
The first list of dictionaries obtained (results_j son_national) should have data at the national level (England, Wales, Scotland, Northern Ireland). The second (results_j son_regional) should have data at a regional level (London, North West, North East, etc). Both should contain data for all dates covered by the API.
Attention: Do not query the API too often, as you might be blocked or compromise the API’s service. The API service is used by many other organisations, which rely on it for vital tasks. It is your responsibility to query the API by respecting its rules. We ask students to keep requests under 10 requests every 100 seconds, and 100 total requests every hour. When querying the API, if your response has a 429 status code (or a similar code indicating your query failed), check for a header called “Retry-After”, which indicates how much time you have to wait before doing another query; you should wait that long.
These two lists of dictionaries from before are a good start. However, they are not the easiest way to turn data into insight.
In the following, you will take the data from these lists of dictionaries and turn it into Pandas dataframes. Dataframes have quick transformation, summarising, and plotting functionalities which let you analyse the data more easily.
The code should use native Pandas methods. Implementing the functionality manually (e.g. using loops or directly accessing the arrays inside the dataframes) will be penalised. Follow the library’s documentation. Remember that Pandas methods can very often be chained; use that to your advantage.
Concatenate the two lists of dictionaries (results_json_national and results_json_regional) into a single list.
Transform this list into a dataframe called covid_data, which should now have one column for each metric retrieved from the API (date, name, daily_cases, cumulative_cases, daily_deaths, cumulative_deaths, cumulative_vaccinated, vaccination-age).
The regional portion of the dataframe is a breakdown of the data from England. Thus, all observations in England are contained in the dataframe twice. Hence you can erase all rows in which the name column have the value “England”.
The column name has an ambiguous title. Change it to area.
The date column is of type object, which is for strings and other types. This makes it harder to filter/select by month, year, to plot according to time, etc. Convert this entire column to the datetime type.
Print a summary of the dataframe, which includes the amount of missing data. How you measure the amount of missing data is up to you. Please document your decision in the code.
For the cumulative metrics columns (cumulative_deaths, cumulative_cases, cumulative_vaccinated), replace missing values with the most recent (up to the date corresponding to that missing value) existing values for that area. If none exist, leave it as it is. For example, if there is a missing value in the cumulative_deaths column at the date 08-02-2021, look at all non-missing values in the cumulative_deaths columns whose date is lower than 08-02-2021 and take the most recent.
Now, remove the rows that still have missing values in the cumulative metrics columns mentioned in the last question.
Rolling averages are often better indicators of daily quantitative metrics than raw daily measures. Create two new columns. One, with the 7-day rolling average of new daily cases in that area, including the current day, and one with the same calculation but for daily deaths. Name them daily_cases_roll_avg and daily_deaths_roll_avg.
Now that we have the rolling averages, drop the columns daily_deaths and daily_cases as they contain redundant information.
A column in the dataframe covid_data has dictionaries as values. We can transform this column into a separate dataframe. Copy the columns date, area, and vaccination_age into a new dataframe named covid_data_vaccinations, and drop the vaccination_age column from covid_data.
Transform covid_data_vaccinations into a new dataframe called covid_data_vaccinations_wide. Each row must represent available vaccination metrics for a specific date, in a specific area, and for a specific age interval. The dataframe must have the following columns:
- date: The date when the observation was made;
- area: The region/nation where the observation was made;
- age: The age interval that the observation applies to;
- VaccineRegisterPopulationByVaccinationDate: Number of people registered for vaccination;
- cumPeopleVaccinatedCompleteByVaccinationDate: Cumulative number of people who completed their vaccination;
- newPeopleVaccinatedCompleteByVaccinationDate: Number of new people completing their vaccination;
- cumPeopleVaccinatedFirstDoseByVaccinationDate: Cumulative number of people who took their first dose of vaccination;
- newPeopleVaccinatedFirstDoseByVaccinationDate: Number of new people taking their first dose of vaccination;
- cumPeopleVaccinatedSecondDoseByVaccinationDate: Cumulative number of people who took their second dose of vaccination;
- newPeopleVaccinatedSecondDoseByVaccinationDate: Number of new people taking their second dose of vaccination;
- cumVaccinationFirstDoseUptakeByVaccinationDatePercentage: Percentage of people out of that demographic who took their first dose of vaccination;
- curnVaccinationCompleteCoverageByVaccinationDatePereentage: Percentage of people out of that demographic who took all their doses of vaccination;
- cumVaccinationSecondDosaptakeByVaccinationDatePercentage: Percentage of people out of that demographic who took their second dose of vaccination.
We have created dataframes for our analysis. We will ask you to answer several questions with the data from the dataframes. For each question, follow the same three steps:
- aggregate and/or shape the data to answer the question and save it as an intermediate dataframe;
- apply plot methods on the dataframe to create a single plot to visualise the transformed data;
- write your conclusion as comments or markdown.
Some questions will use data and plots from a previous question and require you only to answer the question; in this case, either have a cell with only comments or only a markdown cell with the answer.
Plotting should be done exclusively using native Pandas visualisation methods, described here. To make these answers clear for us, we ask you to use concise and clear transformations and to add comments to your code.
Show the cumulative cases in London as they evolve through time.
Question: Is there a period in time in which the cases plateaued?
Show the evolution through time of cumulative cases summed over all areas.
Question: How does the pattern seen in London hold country-wide?
Now, instead of summing the data over areas, show us the evolution of cumulative cases of different areas as different lines in a plot.
Question: What patterns do all nations/regions share?
Question: As a data scientist you will often need to interpret data insights, based on your own judgement and expertise. Considering the data and plot from the last question, what event could have taken place in June-July that could justify the trend seen from there onward?
Show us the evolution of cumulative deaths in London through time.
Question: Is there a noticeable period in time when the ongoing trend is broken? When?
Question: Based on the data and plot from the last question, is there any similarity between trends in cumulative cases and cumulative deaths?
Create a new column, cumulative_deaths_per_cases, showing the ratio between cumulative deaths and cumulative cases in each row. Show us its sum over all regions/nations as a function of time.
Question: What overall trends can be seen?
Question: Based on the data and plot from the last question, it seems like, in June-July, the graph’s inclination gets steeper. What could be a reasonable explanation?
Show us the sum of cumulative vaccinations over all areas as a function of time.
Question: Are there any relationships between the trends seen here and the ones seen in Task 21?
Show us the daily cases rolling average as a function of time, separated by areas.
Question: Is there a specific area that seems to escape the general trend in any way? Which one and how?
Show us the daily cases rolling average as a function of time for the area identified in the previous question alongside another area that follows the general trend, in order to compare them.
Question: What reasons there might be to justify this difference?
To be able to compare numbers of cases and deaths, we should normalise them. Create two new columns, daily_cases_roll_avg_norm and daily_deaths_roll_avgnorm, obtained by performing a simple normalisation on all values in the daily_cases_roll_avg and daily_deaths_roll_avg columns; for each column, you divide all values by the maximum value in that column.
Now, on the same line plot with dale as the x-axis, plot two lines: the normalised rolling average of deaths and the normalised rolling average of cases summed over all areas.
Question: Are daily trends of cases and deaths increasing and decreasing at the same rates? What part of the plot tells you this?
The dataframe covid_data_vaccinations_wide has some columns expressed as percentage of population. First, split this dataframe into two dataframes, one for London, one for Scotland.
Now, mould the London dataframe such that each row corresponds to a date, each column corresponds to an age interval, and the data in a dataframe cell is the value of cumVaccinationFirstDoseUptakeByVaccinationDatePercentage for that age interval and date.
Plot the London dataframe as a line chart with multiple lines, each representing an age interval, showing the growth in vaccination coverage per age group.
Because this plot will generate over ten lines, colours will repeat. Add this argument to your call of the plot() method:
style=['--' for _ in range (1.0)] . This will force the first ten lines to become dashed.
Question: Were all age groups vaccinated equally and at the same time, or was there a strategy employed? What strategy does the plot indicate and why?
Do the same transformations asked in the last question, but for the Scotland dataframe.
Question: In both plots, compare how vaccination evolved for two sections of population: 50-64 years and 65-79 years. Were there any differences in the strategies employed between London and Scotland for dealing with both sections