AI’s Dirty Secret: A Foundation of Flawed Data

Bias in artificial intelligence comes from flawed data inputs

Artificial intelligence (AI) is only as effective as the data that feeds it. If this data is flawed, AI systems will produce results that are not only ineffective but may also enable biases to spread at scale. To ensure future AI solutions are free of this kind of prejudice, we must be honest about our current reality and take action to change it.

Flawed Data

Face it, today’s data is inherently flawed—which means we’re living in a flawed reality. Why? Because collectors of data are humans with their own set of biases. This means our datasets are imbued with the conscious and unconscious beliefs and dispositions of the people that were involved in their lifecycles.

But it’s not just the data that’s flawed—it’s the entire AI production pipeline. When flawed data is fed into algorithms that produce results, the results take on the data’s bias. These skewed results then become inputs for other AI systems which guide future decisions. And just like data collectors, each AI practitioner is human with their own issues and beliefs when it comes to the application of their knowledge and skills to AI development. For example, an AI developer who has a strong background in graph analysis may frame every problem as a graph problem. For an AI practitioner that grew up in a middle-class home in the Prairies, a distribution may intuitively confirm their lived experience and may not accurately align with the population represented in the data.

Data’s Hidden Danger

Failure to understand the current state of the data ecosystem could result in a perpetual cycle of artificial intelligence bias. For example, say you’re creating an AI model to predict the price of homes in Washington, DC for appraisers and realtors to benchmark new sales. As new home sales occur based on these predictions, this new data is used as input to the AI model to ensure that predictive accuracy is maintained.

However, the practice of redlining—whereby mortgage lenders draw red lines around neighborhoods in which they do not want to make loans—is never factored into the model. This inadvertently propagates prejudiced assumptions in the model’s results. If these assumptions, that may be overtly discriminatory, aren’t spotted before being fed into an AI system, they could have largescale societal impact.

Police traffic stops also highlight the flaws of our current datasets. Because they are inherently human endeavors, traffic stops produce biased data that is then baked into AI solutions like predictive policing and traffic optimization. In North Carolina, for example, a recent analysis of 22 million traffic stops over 20 years found that a driver’s race, gender, location, and age were the primary factors in a police officer’s decision to pull over a vehicle. The data showed that African Americans were stopped twice as often as white drivers. And while they were four times more likely to be searched, they were actually less likely to be issued a ticket.

In the example above, you can see how the officers’ biases led to a generation of datasets that were taken as objective representations of truth—even though they were far from it. The prejudiced feedback loop found in predictive policing is built on a foundation of flawed data, but the results are baked into an AI ecosystem used to inform future decision-making. To put things simply, if we don‘t act now our future will be tarnished by problems of the past.

Better AI for a Better Future

So, what can be done? We have an incredible opportunity to stop this prejudiced cycle by interrogating our data and challenging the assumptions it was built on. And while no dataset will ever be perfect, we must also recognize the flaws in the process of collecting data in order to fix it. This starts by examining every interaction between the pipeline and the people involved in it.

A key component to solving this problem is intentionality and meaningful diversity. Ideally, it starts from the ground up with the people who are generating the data. There should be diversity in the team that collects the data, evaluates the data, and builds the tools. People must work to become aware of their biases and actively seek to remove their impact.

Ultimately, diversity helps identify blind spots and correct them—this takes a diverse team of individuals to fill in the gaps and right the ship. This requires including people from different backgrounds, with different life experiences, and with expertise deploying multiple statistical methods. Remember, this process takes time and may involve a lot of trial and error—but it can be done. And future generations will thank you for it. After all, if we are to create a future where artificial intelligence permeates our daily lives, we cannot use datasets riddled with yesterday’s problems. Building better AI starts today.

Contact us to get started.

Get alerted to new job postings, events, and insights by registering for our monthly newsletter.