Decoding AI Data Validation

SECTIONS

What is Data Validation? What can AI Contribute?AI Data Validation: Key Terms and Concepts The Sensitivity of Financial Data Validation

What is Data Validation? What can AI Contribute?

You’ve collected data, and now you’re ready to use it. But what’s to say that it’s accurate? What if something went wrong during the collection process, and now the data houses critical errors?

That’s where data validation comes into play. Through data validation, you can confirm it’s fit for purpose. For most data, its value lies in its accuracy. Case in point: small errors can have significant consequences. For example:

In healthcare, minor errors can lead to disastrous patient outcomes.
In manufacturing, inaccurate data can cause costly product recalls and major safety hazards.
In financial services, inaccurate data can cause compliance issues, resulting in a damaged company reputation, large fines and a jeopardised future.

Though these examples may seem extreme, plenty of real-world ‘horror stories’ support the importance of accurate data. For example, in September 2024, TD Bank was fined $28 million for sharing information about tens of thousands of U.S. customers with consumer reporting companies. Part of TD Bank’s malfeasance was that the data negatively affected its customers' reputations – because it was inaccurate.

A watertight validation mechanism (i.e. that leaves no room for errors or loopholes) will ensure that data is accurate before it can flow downstream. By catching errors early, you can prevent the costliness of their impact.

‍

Wyatt Earp’s famous quote about gunfights applies surprisingly well to data validation.

‍

Why not manually validate data?

You may have a well-trained team of analysts who can manually validate data. However, even the most experienced reviewers can fail to find errors. Various psychological ‘glitches’ account for why people can often overlook such errors, including:

Overconfidence bias: If someone has collected the data themselves, they may not be willing to accept that it may contain errors.
Availability heuristic: An individual might fixate on obvious errors they’ve corrected before rather than subtler/nuanced alternatives.
Confirmation bias: Someone’s prior beliefs about the data (e.g. that it’s unlikely to contain certain errors) may override effective error detection.

Of course, the above are only a few reasons someone might overlook such data errors.

AI can loosely replicate a human’s understanding of data (i.e. neural networking) without pesky cognitive biases accompanying manual validation. AI doesn’t get tired, bored or overconfident. Let’s take a closer look at the practicalities of AI data validation.

‍

AI Data Validation: Key Terms and Concepts

Optical Character Recognition (OCR) & Manual Data Collection

When visual algorithms ‘read’ data, they can make mistakes. One of the most common is to confuse characters that look similar – an underlying flaw of Optical Character Recognition (OCR) technology. OCR technologies don’t understand what they’re collecting. Instead, they’re programmed to assemble similar-looking data.

‍

Manual data collection can also yield inaccuracies when extracting data from PDFs and copying them into Excel. The most common entails adding an extra digit to captured data. And, as aforementioned, using manual review to verify data (e.g. a four-eye check) isn’t a consistently accurate strategy.

AI data validation compares the extracted data to the original. Consistency checks allow a validation tool to identify where an error may have occurred during collection.

‍

Internal vs. External Consistency Checks

Internal as opposed to external consistency is a software concept applicable to data validation. Internal consistency refers to how data relates to itself. For example, if analysing a financial statement, you would expect the total assets to equal the sum of liabilities and equity. If it doesn’t, you can assume something has gone wrong during data collection.

Internal checks, therefore, can confirm the consistency of certain data attributes, including:

Sums and totals
Units and currencies
Multipliers (e.g. thousands or millions)
Reporting periods

The downside of internal checks is that they can only ensure coherence within the dataset.

That’s where external consistency checks can help. By connecting the AI validation mechanism to an external data source, like Companies House, Bloomberg or S&P Global Market Intelligence, you can verify that your dataset aligns – via external accuracy.

‍

Exception Handling

What should the AI validation algorithm do if it can’t automatically correct an error? Ideally, it should correct it. But, failing that, the system should flag the error for manual review.

Exceptions will occur, and any AI system that guarantees 100% accuracy – with no element of human review – is likely exaggerating its capabilities. However, a well-calibrated system will minimise errors and ensure a robust protocol is in place to prevent such errors from moving downstream.

‍

Human-in-the-loop (HITL): Machines and Humans Working in Harmony

Humans can do more than correct the occasional error in the AI data validation process. Human-in-the-loop (HITL) is an AI system involving manual overview. For some AI tools, such overview involves trained analysts – with an eagle eye for small inaccuracies – combing through datasets.

HITL systems can be particularly helpful if you require 100% data accuracy. If your error tolerance is extremely low, consider opting for a HITL solution.

‍

The Sensitivity of Financial Data Validation

Financial data validation is a particularly sensitive process, as it contains Personally Identifiable Information (PII). PII can paint a portrait of an individual’s financial health, making the consequences of its leak extremely detrimental – for the individual and the business.

For some businesses, the threat of fines and reputational damage caused by insufficient data protection is enough to eschew third-party AI-powered solutions. However, AI-powered data validation can protect the confidentiality of the data it’s validating by:

Providing access controls (e.g. role-based permissions).
Conducting regular audits and penetration tests.
Implementing Data Loss Prevention (DLP) policies that protect the sanctity of data (e.g. during cyberattacks).

If you’re concerned about an AI vendor’s ability to handle sensitive data, a few minutes of due diligence should be sufficient to ‘validate’ your concerns by verifying the following:

Do they have the right certifications (e.g. ISO27001 or SOC2 certification)?
Have they worked with clients likely to have sensitive data requirements (e.g. banks)?
Do they encrypt their data during transit?

Of course, you could construct an internal solution to avoid using a third party altogether. However, developing an AI-powered validation tool is (likely) only viable if you’ve got significant machine-learning resources at your disposal, not to mention buckets of patience.

‍

Try Evolution AI’s Data Extraction & Validation Solutions

Validating data is necessary for preserving its accuracy and, in some cases, security. Yet, manual data validation is riddled with cognitive biases, preventing full accuracy. However, when machines and humans work together, the result can be completely accurate, low-effort validation.

‍

Interested in (accurately) extracting data from financial documents? Our product’s robust validation algorithms mean we can guarantee complete accuracy under our managed service. Contact Evolution AI’s financial project management team to discover more by booking a demo or emailing us at hello@evolution.ai.

‍