Exclusive: What is data poisoning and why should we be concerned?


Share this content

Share on facebook
Share on twitter
Share on linkedin

Machine learning could be one of the most disruptive technologies the world has seen in decades. Virtually every industry can benefit from these artificial intelligence (AI) applications and its adoption rates reflect widespread confidence in its potential. However, as machine learning becomes more common, the threat of data poisoning becomes more concerning.

As of 2018, 47% of organisations worldwide had embedded AI into their operations, with another 30% piloting such projects. While these technologies have significant potential for good, their potential for harm has become more prevalent as an increasing number of businesses rely on them. Data poisoning seeks to take advantage of that potential.

What is data poisoning?

Data poisoning involves tampering with machine learning training data to produce undesirable outcomes. An attacker will infiltrate a machine learning database and insert incorrect or misleading information. As the algorithm learns from this corrupted data, it will draw unintended and even harmful conclusions.

Data poisoning attacks fall into two main categories: attacks targeting availability and those targeting integrity. Availability attacks are often unsophisticated but broad, injecting as much bad data into a database as possible. After a successful attack, the machine learning algorithm will be entirely inaccurate, producing little to no true or useful insights.

Attacks against machine learning integrity are more complex and potentially more harmful. These leave most of a database untouched, except for an unnoticeable back door that lets attackers control it. As a result, the model will seemingly work as intended but with one fatal flaw, such as always reading one file type as benign.

Why data poisoning is such a concern

Data poisoning attacks can cause substantial damage with minimal effort. The most significant downside to AI is that its efficacy is almost directly proportional to its data quality. Poor-quality information will produce subpar results, no matter how advanced the model is and history shows that it doesn’t take much to do this.

One AI experiment called ImageNet Roulette used user-uploaded and labeled pictures to learn to classify new images. Before long, the AI started using racial and gender slurs to label people. Easily overlookable, seemingly small considerations, like people using harmful language on the internet, become shockingly prevalent when an AI algorithm learns from this data.

As machine learning becomes more advanced, it will draw more connections between data points that humans wouldn’t think of. Consequently, even minuscule changes to a database can have substantial repercussions. With more organisations relying on these sometimes unsupervised algorithms, data poisoning could cause significant damage before anyone realises it.

Steps to stop data poisoning

While data poisoning is concerning, companies can defend against it with existing tools and techniques. The Department of Defense’s Cyber Maturity Model Certification (CMMC) outlines four basic cyber principles to keep machine learning data secure: network, facility, endpoint and people protection.

Network protection steps like setting up and updating firewalls will help keep databases off-limits to internal and external threats. Businesses should restrict access to machine learning databases to only those directly involved with machine learning projects. Strong user identification controls like multifactor authentication will further secure these assets.

Facility protection covers the physical security of an organisation’s systems. That includes restricting access to data centres through keycards or other similar controls to limit how many people can enter a server room.

Endpoint security is especially important for any machine learning model that uses data from Internet of Things (IoT) sensors. IoT malware attacks jumped 215.7% in 2018 and remain prevalent as more companies embrace these often unsecured technologies. In light of their vulnerability, all endpoints in a machine learning project should feature data encryption, access controls and up-to-date anti-malware software.

Finally, machine learning projects should involve thorough user training. Anyone with access to machine learning databases should understand how their actions could unintentionally sway results, requiring close attention to data quality. These users should also understand the importance of strong password management and how to spot phishing attempts.

New technologies introduce new risks

Machine learning is an exciting technology, but the increasing reliance on it introduces new threats from data poisoning. A seemingly insignificant amount of poor-quality information can render a security AI program useless or create a biased hiring algorithm. Businesses don’t need to avoid machine learning, but they should be aware of these risks and take the appropriate steps to prevent them.

Devin Partida is a technology writer and the Editor-in-Chief of the digital magazine, ReHack.com. To read more from Devin, check out the site.

Receive the latest breaking news straight to your inbox