Lack of transparency and alleged under-reporting of COVID-19 testing, cases, and deaths has drawn widespread criticism. The problem of determining the reliability of COVID-19 data in developing countries is even more acute as the pandemic has tested most public health systems. We have investigated the veracity of published COVID-19 data in India by applying Benford’s law to the district level cases data. We find that data from official sources matches the Benford’s Law prediction very well, suggesting that the quality of data is largely acceptable for decision making.
The use of various types of data indicators to understand and help tackle COVID-19 has given rise to questions of authenticity and the consequent usefulness of downstream data products. Traditional public health surveillance systems are mostly manual, suffer from severe time lags, and lack spatial resolution. Data released by various state governments in India has at times been found to be inaccessible or of inadequate quality and granularity.
Even before formulating the null hypothesis for testing the veracity of COVID-19 data in India, we must look for a robust statistical tool for detecting manipulation of the data. A look at instances of fraud detection in socio-economic data lead us to Benford’s law. Benford's law is the observation that in many collections of numbers, be they mathematical tables, real-life data, or combinations thereof, the leading significant digits are not uniformly distributed, as might be expected, but are heavily skewed toward the smaller digits. More precisely, the digit 1 tends to occur with a probability of ∼30%, much greater than the expected 11.1% (i.e., one digit out of 9). It is also called the first digit law, first digit phenomenon, or leading digit phenomenon.
Given that human manipulators seemed to create socio-economic data that mimics a uniform distribution, this makes the utility of Benford’s law in detecting such manipulation obvious. One salient characteristic of COVID-19 data is that it spans multiple orders of magnitude – making it an ideal candidate for the application of Benford’s law. We are aware that Benford’s law is not a universal law. However, this is a law of general probability as stated by Benford himself.
Confirmed case numbers at the district level are basic data points for crucial public health and administrative policy decisions. We begin with a null hypothesis that the leading digits of all COVID-19 case numbers are equally distributed (as would be the case for a set of randomly generated numbers). We can test if this is true for Indian district level cases data. A comparison with the Benford’s law standard (Table 1) would show if there is any similarity in the trend of occurrence of leading digits. If the confirmed case data is unreliable, it is unlikely that the frequency of each leading digit (1 to 9) in confirmed case numbers will match Benford’s law. Exponential distributions fit Benford’s Law well and so it could act as a basic smell test to identify discrepancies in phenomena like spread of a pandemic.
The case counts at the district level in India are available through covid19india, a volunteer-driven, crowdsourced portal that collects and parses different bulletins released by states daily. The leading digits in the district level data shows a trend that closely follows Benford’s Law, as seen in the graph below. A look at the plot (Graph 1) depicting the district level COVID-19 data implies that our null hypothesis is rejected. The correlation between the data and the predictions as per Benford's law is very high (Correlation coefficient is 0.999375).
The district level COVID-19 data seem to conform to Benford’s law. The use of Benford’s law as a first level check to assess data quality is therefore a vital step towards establishing credibility with data consumers and can be explored further.
Acknowledgements: We are grateful to Isalyne Gennaro, Harsh Vardhan Pachisia, and Vikram Sinha for their inputs.