Know the Context of Your Data
Last week, I shared 5 Data Analytics Reflections I had during COVID-19.
In this article, I’d like to expand a little bit with the first reflection - Know the Context of Your Data.
The one most important thing about analytics and data science: the data you use anchors the quality of the analysis
The raw data collected forms the foundation of the analysis.
Rubbish In Rubbish Out.
The quality of the data anchors the quality of the analysis and the trustworthiness of the conclusions drawn. Even in everyday life situations, making claims simply based on the numbers you see without understanding the context can be a dangerous game.
Observations as the pandemic developed
When the COVID19 situation was progressing, especially during the early days, there were various occasions when I had this urge to ask “what's the context of the data” in conversations and debates I observed where numbers are quoted. Below are some examples, with the observation, the question, and the conclusions I think would be affected by the context of data.
Some of these become more apparent or transparent after a while, but it wasn’t always the case at the beginning.
1. Early cases and deaths numbers and the severity of the situation.
Observation: Number of cases and deaths grew rapidly in China, but got under control fairly quickly after the government started enforcing restrictions. The numbers outside of Hubei province had remained quite low. Western media primarily focused on the economic impact, and the “brutal” measures taken in response. Few (if any) had then connected A (rapid spread) and B (strict measures) and ask - what would have happened if it weren’t for the brutal measures?
Questions: What was behind the (high) death rates in the epicentre (in and around Wuhan)? What about the low number of cases outside of Hubei province? Why did China manage to get the situation under control so quickly in comparison? What measures were taken WHEN? What would happen if it hadn’t been for these measures, or they were delayed even more than it had?
Conclusions affected: Caution and early warning signs for the west; models and guidance on the effect of lockdown, social distancing and other measures.
2. Low death rate in Germany.
Observation: When the virus started spreading in Europe, Germany had consistently reported low number of deaths (and hence death rates). I had heard in multiple conversations or even the media praising how Germany has been doing the right thing in keeping the death rates low.
Questions: What was behind the low death rate in Germany? Who were counted towards the deaths numbers and who were tested? How did these affect the numerator and denominator of the calculation? Were they reporting only the death of those without other health conditions even if they had COVID19? Or how much of that can be attributed to other measures they’ve taken?
Conclusions affected: What should be learned and copied from what’s done in Germany. And knowing these differences in statistical calibre, how we could / should compare KPIs across different countries.
3. “Zero new cases” Milestone in China in March.
Observation: In March 2020, multiple Chinese and international media reported China's "zero new cases" or "zero local infection" status. Many celebrated the remarkable turnaround, inspiring hopes across the globe in this tough battle.
Questions: What was behind the “zero new cases” in China in Mar? Was it true that hospitals were refusing patients who were “cured” but got sick again because they were pressured to maintain the “zero case increase” status (anecdotes!)? Was it true that China had been “calibrating” the numbers reported? (This is of course not to comment on how trustworthy the data reported by other countries was, with some report with their own statistics calibre, or simply refuse/fail to test enough samples in the population.)
Conclusions affected: Whether it’s premature to announce the defeat of the virus (or even the first wave) in China back in Mar, and what it means to other countries whose situation lags China by a few weeks.
The Lancet paper retraction scandal
It’s crucial to try to understand the context of the data that has been collected, BEFORE we spent time and effort perfecting the analyses. Sometimes, the unfortunate realisation of data quality issues comes too late, which can put you in a real dilemma. Do you start again, throwing away all that you’ve done? Or do you go with the results, knowing they stand on a wobbly foundation, taking risks of potentially severe consequences?
The Lancet paper retraction scandal should teach us a lesson.
The paper, led by a big name Harvard professor but with questionable data sources, claimed that using Hydroxychloroquine on COVID patients increases heartbeat irregularities and death rates. As one of the oldest and most respected medical journals in the world, The Lancet had published this paper, which had resulted in several major hydroxychloroquine trials being halted - studies that could determine how many people live or die from COVID19 in the future. Then they subsequently retracted this paper only days after publication, due to its authors “can no longer vouch for the veracity of the primary data sources”.
The paper had already caused a huge impact and consequences on the research and treatment landscape for COVID19. And imagine if the flaws in the paper weren’t flagged??
We NEED to make sure the data is accurate and reliable, especially in healthcare where the margin for error is smaller. We need to ensure data-driven decisions are made based on analyses that are scientifically rigorous and robust. Like the Bloomberg Medical Science & Tech Reporter Michelle Cortez said in the Prognosis podcast: We need to know the right answers, not just the fast ones.
Reference read:
Two elite medical journals retract coronavirus papers over data integrity questions
The Lancet has made one of the biggest retractions in modern history. How could this happen?
The HOW - Some Practical Tips
We have established that it’s imperative to know and understand the context of the data. Here is some thinking around HOW to get to the bottom of it when it comes to real project work.
With system-collected data, you can start by understanding the process around the system, and how personnel interact with the system, as well as the pipeline (flow) of the data generated or collected.
With survey-based data, it is helpful to look at the questions used in the survey. How a question is phrased might affect how people choose or write their answers. Having worked in the quantitative market research field, I have witnessed many cases where the design of the survey (including the phrasing of the questions, length of the survey, and how the survey is distributed, etc) can dramatically affect the quality of data collected.
With reported data (e.g. supplied by third-party), like in the example of the Lancet paper, it is key to understand the actual source of the raw data, and how it has been processed, transformed, and aggregated. It is not easy to catch inaccuracies and miscalculations without having access or getting down to the nitty-gritty of data processing and analysis. But at the very least, if you are to use the reported data, you are responsible to perform some due diligence on the source data for its integrity.
It’s not always easy to get to the bottom of it, but it will be worth the time and effort, especially when you know that key decisions will be made based on the insights from your analysis.
Most importantly, once you understand what goes into the data (there will always be a level of ambiguity, uncertainty and noise), the question is what you should do now to clean and process your data before the analysis, and also what it means to the interpretation and conclusions you are going to draw.
Do you have any tips on how to better understand the context of the data? Share your thoughts with me!