Will LLM Adoption Demand More Stringent Data Security Measures?
The rise of large language models (LLMs) has significantly changed how we communicate, conduct research, and enhance our productivity, ultimately transforming society as we know it. LLMs are exceptionally skilled in natural language understanding and generating language that seems more accurate and human-like than their predecessor. However, they also pose new risks to data privacy and the security of personal information. Compared to narrow AI systems with LLMs, more complex issues like sophisticated phishing attacks, manipulation of online content, and breaches in privacy controls are emerging.
A recent study by MixMode analyzed data from the National Cyber Security Index (NCSI), the Global Cybersecurity Index (GCI), the Cybersecurity Exposure Index (CEI), and findings from Comparitech to assess cyber safety across 70 countries. Findings indicate that countries with the most robust cybersecurity infrastructures include Finland, Norway, and Denmark. The United Kingdom, Sweden, Japan, and the United States also maintain strong defenses against cyber threats. While the USA scores highest on the Global Cybersecurity Index, it only ranks ninth in overall safety. Similarly, Canada, with a strong Global Cybersecurity Index, ranks tenth overall in safety.
Worrisome are the countries that pose the highest risk for cyber-attacks. The threat exposure in these emerging countries includes the economic impact of potential financial losses for both public and private sectors, the threat to national security systems, particularly their vulnerability to cyber espionage, and attacks on critical infrastructure and national safety. These countries, more prone to data breaches, face the exposure of highly sensitive information, leading to identity theft, financial fraud, increased sources of cybercrime, and eroded investor and consumer confidence.
With the rise of LLM adoption, the U.S. Biden administration reauthorized Section 702 of FISA into law in April, extending warrantless surveillance and impacting U.S. civil liberties amid widespread data collection concerns, further complicating the government’s role in creating safeguards and ensuring trust around artificial intelligence.
To discuss the vulnerabilities associated with large language models and the ramifications of the new law on individuals’ data privacy rights and civil liberties, especially concerning emerging tech companies, I met with two experts: Saima Fancy, a data privacy expert and former privacy engineer at Twitter/X, and Sahil Agarwal, CEO of Enkrypt AI, an end-to-end solution for generative AI security.
The Perpetual Demand for Data
Between 2022 and 2023, there was a 20% increase in data breaches. Publicly reported data compromises rose by 78% year-over-year. The average cost of a data breach hit an all-time high last year at $4.45 million, marking a 15% increase from three years ago. Notably, 90% of organizations experienced at least one third-party vendor suffering a data breach. Globally, the number of victims in 2023 doubled compared to 2022.
Saima Fancy explained that the driving force behind these issues is the intense desire for data collection by organizations, often resulting in reckless behavior. Technologies like OpenAI, she indicated, are launched prematurely to maximize data collection. “These technologies are often released too early by design,” she noted, “to accumulate as much data as possible. While they appear free, they’re not truly free because you’re providing your personal data, essential for training their models.”
She added that organizations could have opted to legally acquire data and train their models in a structured manner. “Many tools were ready but weren’t launched immediately because they were undergoing rigorous sandboxing and validation testing,” she explained, noting the rush to release new technologies isn’t always deliberate but often fueled by enthusiasm and the pressure to innovate, which can lead to unintended consequences. “There’s a race to release new technologies, which can inadvertently cause harm. It’s often just a case of ‘let’s release it and see what happens.'”
Fancy also highlighted that many of these tools are rarely stable in their nascent form, and developers will fine-tune the models over time. “This means the initial outputs might not be accurate or what users expect. Despite this ongoing learning phase, these tools are already live on a global scale,” she added.
LLMs and the Indiscriminate Scraping of PII
Given that LLMs have been developed from data gathered indiscriminately across web properties there’s a risk of exposing sensitive details such as credentials, API keys, and other confidential information. Fancy believes that society’s susceptibility to security threats is unprecedented, observing, “The public is undoubtedly more vulnerable now than ever before. We’re in the fifth industrial revolution, where generative AI is a relatively new phenomenon. As these tools are released publicly without adequate user education, the risk skyrockets. People input sensitive data like clinical notes into tools like chat GPT, not realizing that once their personal information is entered, they’re used to train models. This can lead to their data being potentially re-identified if they’re not properly protected.”
She emphasized the risk extends beyond individual users to corporations as well, particularly if employees are not properly trained in using these technologies, including prompt engineering. “The vulnerability is extremely high, and for corporations, the risk is even greater because they risk losing public trust and they are not immune to data breaches. As these tools evolve, so do the techniques of malicious actors, who use them to refine their ransomware and phishing attacks, making these threats more sophisticated and costly to mitigate.”
Today we are witnessing the swift emergence of regulation through the EU AI Act, the most comprehensive legislation on AI and the Biden Executive Order on Safe Secure and Trustworthy AI in late 2023. Sahil Agarwal, CEO of Enkrypt AI, points out, “Since the introduction of ChatGPT, there has been a significant increase in awareness among the public, legislators, and companies about the potential risks of AI. This heightened awareness has surpassed much of what we’ve seen over the past decade, highlighting both the potential and the dangers of AI technologies.”
He adds that the regulatory mandates are clear and if you’re handling customer data or distributing tools you need to be mindful and ensure these tools aren’t used for harmful purposes. As well, the penalties don’t randomly apply to any startup working with generative AI technology, instead he continues, “ …they’re targeted at specific stage of a company’s development and certain types of general-purpose AI technologies, emphasizing, they’re there to guide more responsible innovation.”
Cyber Attacks are More Advanced
Today’s AI technology can enable highly convincing phishing emails or messages that may appear legitimate and trick individuals into revealing sensitive information. In addition, LLMs have enabled attackers to conduct more effective social engineering attacks by generating personalized messages or responses based on more extensive knowledge of the target’s online activity, their preferences and behaviors. We’ve also seen a rise in malicious content as LLMs can be used to generate fake news articles or reviews, which, in turn, can spread misinformation, manipulate public opinion or propagate malware. The speed at which LLMS can also automate certain stages of cyber-attacks by generating targeted queries, crafting exploit payloads, or bypassing security measures, is making it increasingly difficult to identify these before they’re implemented. In addition, the rise of adversarial networks can now manipulate the inputs to generate outputs with intention to deceive spam filters, fraud and malware detection systems.
Fancy explained that both personal and commercial levels are affected and that many systems are vulnerable due to outdated technology or inadequate security measures during transitions to cloud or hybrid systems. Startups often overlook security and privacy measures, which is a clear oversight, revealing, “From a privacy-by-design perspective, being proactive rather than reactive is crucial. Starting with security measures early in the software development life cycle is essential to prevent sophisticated attacks that can severely impact institutions like banks and hospitals.”
She emphasized the naive thinking many startups have, believing they’re immune to attacks adding, “But what LLMs have done is dramatically broaden the attack landscape, making it easier for malicious actors to observe and eventually exploit vulnerabilities in a company’s systems. It’s crucial for businesses, especially startups, to prioritize funding for security measures from the outset to protect against these threats. Regulations are also becoming more stringent globally, which could have severe repercussions for non-compliance.”
Agarwal revealed his team has uncovered some serious safety issues and vulnerabilities with the top LLMs and agrees that perhaps these models are not ready for prime time. “Traditionally, cybersecurity and standard Data Loss Prevention (DLP) solutions have focused on monitoring specific keywords or personally identifiable information (PII) exiting the network. However, with the advent of generative AI, the landscape has shifted towards extensive in-context data processing.” He states that DLPs now should be tuned to the context of activities or potential security threats occurring, rather than just monitoring data outflows.
Agarwal added the need for safe adoption of these updated AI systems has yet to develop standard practices mainly because of the nascency of LLMs, “While many discuss the concept of responsible adoption, the specifics of implementing, measuring risks, and safeguarding against potential issues when using large language models and other generative technologies in businesses have not been fully established.
The Irony of AI Safety Amid the Reauthorization of Section 702, FISA into Law
The recent passage of Section 702 of the Foreign Intelligence Surveillance Act (FISA) into law has created this dichotomy that exists within the same U.S. government that has also mandated the development of trustworthy artificial intelligence. The law was established in 1978 to spy on foreign individuals, underwent significant changes after 9/11, allowing the broad sweeping of U.S. citizens’ information without a warrant. This practice, known as “incidental collection,” meant that if such information was obtained, it would still be allowed. Key events include the 2013 revelations by NSA whistleblower Edward Snowden, who exposed the PRISM program and the involvement of major tech companies like Google and Facebook (at the time) in giving the government unfettered access to user data. This bulk collection of U.S. citizens’ emails, mobile communications, and phone records was later ruled unconstitutional. Despite added provisions to increase oversight and minimize incidental collection, the law was renewed without amendments, resulting in more powerful surveillance capabilities than those considered during the initial AI hype eight years ago.
Fancy expressed the regrettable development of this renewal for companies that collect data and create models. “With laws like this being renewed for another two years, the implications are huge. Federal governments can compel companies to hand over data on a span of the population, which is essentially surveillance at another level. This is scary for regular folks and is justified under the guise of world state protection. It may seem necessary at face value, but it opens doors wide for abuse on a larger scale. It is unfortunate.”
While it may seem there is no way out of this mandate, companies like Signal and Proton claim they cannot access user data due to end-to-end encryption technology, which allows only users access to their own information. Could this absolve companies of their responsibility amidst the renewed government compliance? Fancy acknowledged, “There are data vaults where customers can store their data, and the company only works with the metadata. These vaults are locked and encrypted, accessible only by the customer. Blockchain technology is another method, where users hold the keys to their data. These methods can protect user data, but they also limit the company’s ability to monetize the data. Nonetheless, they are valid use cases for protecting user information.”
Startups Need to be Proactive: Data Privacy and Security Should not be an Afterthought
Agarwal is adamant that innovation should not come at a human cost. He argues that startups creating technologies with potential adverse effects on customers need to integrate ethical guidelines, safety, security and reliability measures from the early development stages, adding, “We cannot adopt a wait-and-see approach, scaling up technology like GPT-4o to widespread use and only then addressing issues reactively. It’s not sufficient to start implementing safety measures only after noticing misuse. Proactive incorporation of these safeguards is crucial to prevent harm and ensure responsible use from the start.”
Fancy emphasized that many out-of-the-box tools are available in the market to help companies implement new security measures within their cloud infrastructure. Whether it’s a private or public cloud, these tools can manage data sets, create their own cloud classifications, categorize and bucket data for protection, and constantly monitor for infractions, loopholes, or openings within the cloud structure. However, she acknowledged that these tools are not readily affordable and require investment upfront as data is being brought in, whether unstructured or structured. As per Fancy, “It’s crucial for companies to know where their data is located, as many big data companies admit to not knowing which data centers their data resides in, posing a significant risk.” Fancy also pointed out the importance of encryption, both for data at rest and in transit. Often, companies focus only on encrypting data at rest, however she points out, “…infractions frequently occur when data is traveling between data centers or from a desktop to a data center. You’ve got to make sure the encryption is happening there as well.”
She highlighted that we are in a favorable time of technological evolution, with many tools available to enhance data security. While these tools’ costs are decreasing, it’s still necessary for companies to allocate funds for privacy and security measures right from the start. “As your funding is coming in, you’ve got to set some money aside to be mindful of doing that,” Fancy stated. Unlike major companies that can absorb the impact of regulatory actions, small startups must be proactive in their security measures to avoid severe consequences.
Startups, however, can initiate other measures to protect data privacy and security. Fancy suggests smaller startups should practice data minimization, ”and practice collecting data for the purpose that the data was collected. They can put robust consent management measures and audit logging in place that allows startups to manage logs effectively and monitor access controls to see who has access to the data.”
Fancy noted that there are healthy mechanisms and practices within data ecosystems that startups can adopt to enhance their security measures. She stated, “These practices can be a viable alternative for startups that cannot afford comprehensive security tools initially.” Startups could start with smaller features of these tools and expand their capabilities as their funds grow. By adopting these practical measures, startups can effectively enhance their data security and privacy protection without incurring significant initial costs.
Ethics is Now Mainstream
Where technology, compliance and brand risk collide is within the realization that the most compelling technology to date, brought to the masses by Open AI, the darling of large language models, heavily venture capital backed –is still flawed. Agarwal reveals when he recently spoke to a principle from private equity firm, they were experiencing a major scandal when articles they produced were leaked due to their decision to remove certain safeguards. He stressed, “This incident highlights that ethical considerations are not just about morality; they’ve become crucial to maintaining a brand’s integrity. Companies are motivated to uphold ethical standards not solely for altruism but also to mitigate brand risks. This need for robust controls is why topics like generative AI security are gaining prominence. At events like the RSA conference, you’ll find that one of the central themes is either developing security solutions for AI or employing AI to enhance security measures.”
Data Security Needs to be in Lock-Step with Advancing AI
The reauthorization of Section 702 of FISA underscores a significant tension between the pursuit of advanced AI technologies and the imperative of safeguarding data privacy. As large language models (LLMs) become increasingly integrated into various aspects of our lives, the potential for sophisticated cyber-attacks and the erosion of individual privacy grows. Saima Fancy and Sahil Agarwal emphasize the urgent need for robust cybersecurity measures, ethical guidelines, and proactive regulatory compliance to mitigate these risks.
Promoting AI innovation while ensuring data privacy and security presents a complex challenge prompting organizations to balance the benefits of cutting-edge technologies with the responsibility of protecting sensitive information. Ensuring AI safety, maintaining public trust, and adhering to evolving regulation and security standards are essential to responsible innovation. It’s early day but we’re already seeing organizations take the crucial steps to prioritize security and ethical considerations to foster a safer and more trustworthy world.