Generative AI

Exploring the privacy problems of generative AI


Patricia Thaine, cofounder and CEO of Private AI, discusses the potential problem spots that could arise when the outputs of LLMs divulge sensitive data or other types of personally identifiable information

Large Language Models (LLMs) and generative artificial intelligence are trained on oodles of data, and it’s important that companies and people using generative AI know the risks of harvesting data, and of having their data harvested as well.

At a keynote discussion last Wednesday, May 22, as part of MIT’s annual AI conference EmTech Digital 2024, Patricia Thaine, Cofounder and CEO of Private AI, discussed the potential pain points and problem spots that could arise when the outputs of LLMs divulge sensitive data or other types of personally identifiable information (PIIs).

Private AI works on a platform that allows organizations to embed a specific privacy tool that it might need to use in its product of software pipeline while respecting privacy laws and minimizing the possibility of privacy leaks.

Privacy and personally identifiable information

Thaine outlined that privacy concerns related to artificial intelligence and training it had been around long before the announcement of ChatGPT in 2022, and noted how a “massive patchwork of privacy regulations,” not just in the US but around the world make it difficult to comply with legislation.

She cited an April 2021 report in which a South Korean chatbot exposed PIIs, such as names, nicknames, and home addresses in its responses. In March 2023, ChatGPT has its first data leak, in which a bug was said to have exposed chat histories to other users, and ChatGPT got banned in Italy soon after, with a French minister saying ChatGPT doesn’t respect privacy laws.

An EU task force was set up in April 2023 to work towards common policy on setting privacy rules for AI. Then in April 2024, a draft bipartisan US federal privacy bill was released, which would ideally make it easier for companies to comply.

Thaine then went on to explain the nature of personally identifiable information, which includes direct identifiers, such as names, and quasi-identifiers, which include one’s date of birth or race, and could aid in identifying a given person further.

Data minimization, she says, is key, by making LLMs work while keeping your data privacy intact. This is a difficult problem to solve due to a number of factors, such having large volumes of data to sift through, the aforementioned patchwork of laws to protect privacy, having multilingual data and audio recordings, including associated speech recognition and optical character recognition errors, and troubles contextualizing data for specific use cases.

Thaine also notes data privacy is only one of a myriad of issues connected to generative AI, such as bias and copyright concerns, among others.

At the very least, she said regulatory compliance with privacy in AI can be done by design – or at the beginning of work on the AI – rather than having what she called “a band-aid solution at the end.”


Here’s how machine learning can violate your privacy

Privacy corner cases and being a firewall for PIIs

Speaking with James O’Donnell, MIT Technology Review’s reporter on AI, Thaine also discussed privacy corner cases and PIIs as it related to the work of Private AI.

When asked about special or surprising cases her team encountered or needed to consider, she mentioned that the locations of credit cards – or more specifically insurance companies having documents where a credit card or account number were written on the sides of a piece of paper – was a corner case that was surprising.

Meanwhile, when it came to healthcare data, she said, “You need to be careful to… under HIPAA, not reveal the month of the birthday, along with the year of the birthday… zodiac signs are something you need to look out for.”

Thaine also discussed Private AI’s coordination with partners, such as highly-regulated industries or companies aiming to innovate quickly, to determine specific corner cases where privacy can be strengthened for a given particular situation, and then generate synthetic or fake information to train their systems to not divulge information in a given case.

She mentioned that her company works like a “PII firewall for any data going into data science teams or as a firewall for data going out to large language models or being used to train or fine-tune models without that memorization of personal information,” as well as reducing the possibility of PIIs being reverse-engineered.

Thaine, in closing, also noted privacy regulators need common ground for understanding PIIs and the importance of privacy, especially in sectors like the ad space, has to be expounded on. Said Thaine, “You can’t expect somebody to build technology to comply with regulations that even regulators don’t necessarily agree upon.”

“Data breaches become more and more costly, and as you see with the FTC (Federal Trade Commission) forcing people to delete their (AI) models if they don’t do the right thing, that is the motivator likely needed to have people actually be more careful with their data.” – Rappler.com



Source

Related Articles

Back to top button