The Importance of Data Quality in Data Science Projects
Once, I worked on a project where the dreams were high, and the plans were sound, but the data quality was dreadful. The investment did not deliver the expected results. We were dealing with multiple systems for incoming data—some hosted by third-party providers—none of which had been validated before the project began. This was critical work, expected to improve client satisfaction and, in turn, increase customer retention and acquisition. However, after spending time analysing the data, I encountered too many errors to find satisfaction in the output.
Some metrics were unavailable, while others produced unexpected, constant values. Compounding these issues were system failures, data blockages from third parties, and more. I had to explain that the deliverables would not provide any real value. Despite my warnings, I was instructed to continue working and deliver whatever output I could with the existing data sources. This output was ultimately presented to external stakeholders as a valid data solution. It looked polished in a visually appealing dashboard, showcasing increases and decreases. However, the external holders were unaware that this solution was not recommended due to the underlying data quality issues.
Important decisions were likely made using this flawed data, and external stakeholders implemented business actions based on these findings. I often find myself asking: How did those decisions turn out? If they brought any value, it was likely due to sheer luck rather than the reliability of the data work.
Could this process be improved to deliver better value? Absolutely. We would see significant improvements if the data quality were fixed at the source if the scheduling processes operated smoothly, and if timestamps reflected actual action times instead of data ingestion times. Furthermore, we could make more informed decisions if third parties didn’t block access to critical data sources or hold ownership of that data.
The Question of Validated Business Decisions
This leads us to a crucial question: Does it have to be this way? What can we do to ensure validated, cautious business decisions? The key attributes we need are accuracy, reliability, and completeness.
The Need for Accuracy
When I begin my data investigation, I assume that something is wrong. Even if multiple analysts have examined the data before me, I approach it critically, seeking to uncover any inaccuracies. I conduct various tests to measure data volume, check daily rollovers, and identify duplicates, as these can skew data and fail to reflect reality. For example, a problem was discovered with the API connection, showing that Facebook data is not fully streaming into the system and that automated connections are not pulling all of the data from time to time. Resulting in inaccurate metrics.
The Need for Reliability
This concept has stuck with me from my marketing background. When I create buyer personas, I envision specific characters represented by the data. For instance, in telecommunications, the buyer persona for government and military data might be a helicopter pilot communicating via satellite in the Sahara Desert with the people on the ground. In contrast, for an insurance broker’s call centre, I think of an older woman seeking life insurance to protect her family. For a mobile phone provider, the persona might be a professional mother needing reliable communication for her and her children who came to the mobile phone store seeking advice. Once I create these personas, I refine them using data and validate their accuracy by asking: Does this make sense? Is it sensible?
The Need for Completeness
While having accurate and reliable data is crucial, completeness is also vital. However, not all incomplete information is without value. Consider a system where signal interruptions result in empty rows instead of zero values. This could indicate a server issue. Similarly, if a person skips a question in survey data, it might have significance depending on the question’s context. Thus, we must consider whether to complete the incomplete data or remove the column entirely for future analysis.
Conclusion
In conclusion, improving data quality at the source is essential for making informed business decisions. By focusing on accuracy, reliability, and completeness, we can enhance the value of our data and, consequently, the outcomes of our projects. This is a critical step toward ensuring that data-driven decisions are not left to chance but are based on solid, validated information.