Can data ownership be preserved in generative AI?

June 4, 2024

150 10 minutes read

Can data ownership be preserved in generative AI? — urlhttps3A2F2Fsource media brightspot.s3.us east 1.amazonaws.com2F1d2F6b2Fe80b35e24b91afb5739a40ba43e42Fthompson shaded background.jpg

Transcription:

Transcripts are generated using a combination of speech recognition software and human transcribers, and may contain errors. Please check the corresponding audio for the authoritative record.

Penny Crosman (00:04):

Welcome to the American Banker Podcast. I’m Penny Crosman. John Thompson, global head of artificial intelligence at EY, has set up what is believed to be the largest private, secure, generative AI environment in the world. He’s here with us today to tell us a little bit about how he and his team did this, and especially how he has dealt with issues around data ownership and data privacy along the way. Welcome, John.

John Thompson (00:29):

Thanks, Penny. How are you today?

Penny Crosman (00:31):

Good, how are you?

John Thompson (00:32):

I’m doing great, thanks.

Penny Crosman (00:33):

Good. Thanks for coming. So I think I heard you say that you’ve been given an almost unlimited budget for generative AI projects. Is that true?

John Thompson (00:44):

Well, I wouldn’t say it was unlimited. I would say last year when Gen AI was exploding, that EY leadership was very excited about it, interested in it, and wanting to have as many of the 400,000 people enabled with Gen AI as possible. So they actually said you can spend quite a bit of money, but not unlimited.

Penny Crosman (01:07):

Okay. And can you say anything about your process of looking at and deciding which generative AI models to work with?

John Thompson (01:21):

Sure, yeah. EY is a rather large firm conservative by nature being an auditing and tax and accounting oriented firm. So the real questions that came through EY when we were deciding to do this was around data privacy, security, those kinds of things. So I spent a lot of time with our governing functions, which is InfoSec, risk management, data privacy, and the legal office. Having conversations with those four bodies and asking them, what is really the no go areas for you? Each of these areas, what’s really going to make you say no to standing up this service? So it was more around understanding the culture of the organization and their concern around data and protecting it than picking which model we were going to use.

Penny Crosman (02:16):

Can you share what was an example of a deal breaker?

John Thompson (02:21):

Sure. Yeah. It was one of those organizations, I think it might’ve been risk management, came back and said, none of the information that goes into the models or comes out of the models, or anytime after each of the queries or prompts have been processed, can any of the information be retained? So when you send in a prompt to the model and it’s processed and the response comes back, we had to ensure that there was no persistence to that data anywhere in the systems. So their view was that that cannot be, we can’t have hundreds of thousands of people doing things and leaving data laying around all over the place.

Penny Crosman (03:04):

It’s really interesting. Can the model still learn even though it no longer has the access to the data?

John Thompson (03:13):

Well, models are, when they’re in scoring mode, they’re not learning. So neural network models, which LMS and S SLMs are, all neural networks are operate in a mode where they’re in training mode and you’re teaching them and you’re putting in training data and they’re learning, and then these models are locked. So most of the models that you’re using are not learning. They’re actually in scoring mode. So we were using models that were already trained. So no, they actually weren’t learning anything.

Penny Crosman (03:49):

Well, and it’s interesting the idea that they were already trained, because this is something I’ve been thinking about, and this may just be my lack of understanding, but I’ve been thinking about these issues around data ownership and how open AI seemed to have kind of hoovered up content from the New York Times, from Reddit, from Twitter, from novels, all kinds of content sources that you wouldn’t necessarily have expected a company to just absorb to train GBT for. And of course, it’s facing several lawsuits from content creators because of this. But I guess I have several questions around this because first of all, assuming OpenAI wanted to address this, could it just expunge all of this data from its models or would it have to start over? And if the content wasn’t used or was deleted, would the large language model still be as useful, or does some of the usefulness lie in all of the content that has been absorbed?

John Thompson (05:05):

So the quick answers are no, no, and yes. So I’ll go back and answer those in turn. Yes. My position, I’ve written a book on this called Data For All, and my position in that of the law is that the creators of any kind of data, whether it’s a newspaper article in the New York Times or a sonnet to your puppy or whatever it is, is owned by the creator. And those creators do their recompense for the use of their data. So they should not be using data that they don’t license appropriately to the question, can they remove certain data from a model once it’s trained? The answer is no. Some people could say you could, but kind of destroys the model. So you would have a technical argument there with some people. But my answer is no. You can’t just excise certain pieces of data after the model is trained. It just doesn’t work that way. And then the last question was, I said, yes, but what was the question, penny?

Penny Crosman (06:14):

Well, I think it was just, does the usefulness of the model lie in all of that content that’s been absorbed? It does,

John Thompson (06:22):

Yes. And the real difference, as I said, large language models and small language models are neural networks At their heart. That’s what they are. And we’ve been working with neural networks for close to 30, 40, 50 years at this point. But OpenAI and the other large language model vendors, mistro and others, cracked the code. I mean, OpenAI was first, so let’s give them their due. They cracked the code in the scale of the information being hoovered up and being ingested in the model during the training phase really made the difference. These models have been around for a long time, and these techniques have been known for decades, but it really, the data that made the difference and made large language models what they are today. So yes, the answer to your third question is yes, the value of these models is supremely dependent on the data they ingest.

Penny Crosman (07:20):

Well, excuse me. Then you have that issue of the potential for bias because for instance, I don’t know if you use Reddit very much, but there are definite points of view that people share that are very specific. And if say, a bank is using a large language model that’s been trained on some of that content to say, summarize customer service calls, which a lot of banks are using generative AI for today, is there a legitimate fear that bias that may have been absorbed from perusing some of these sites could then be brought to bear on these seemingly very innocuous activities like summarizing calls?

John Thompson (08:12):

Yeah, data. When you bring in internet scale data and large forums like Reddit, you are going to end up with unfortunate opinions, opinions that you would not want expressed in a business context. And the model producers open AI sstl, Microsoft Meta, have done a really good job in the system prompts in making sure that the models don’t do that, that it suppresses those kinds of responses. Now, those responses still happen and the models still hallucinate, but the system prompts and the other safeguards and guardrails that have been put into the models do a really good job of suppressing those kinds of responses. So many people think that, oh, well, it is been taken care of. Well, no, it’s still a foundational component of the model. It’s just there’s been layers and layers added onto the model to stop those kind of responses to suppress those kind of biases.

(09:09):

Now, you can fine tune a model. There is a way to go and fine tune a model. There’s a number of ways to ground models. Fine tuning is one, retrieve augmented generation is another. These models, as I said, still hallucinate. They still say unfortunate things, but you’re wrapping other layers of information around them to have the responses be at a very high level, 80 to 90%, 90 plus percent that are in the range of acceptable responses. So yes, there is bias in the training data. That’s always going to be the case when you’re taking in information of this scale, but there’s plenty of ways to make sure that the models don’t express that in unfortunate ways.

Penny Crosman (09:54):

Can you share some examples of ways that this generative AI system has been useful at ey?

John Thompson (10:03):

We have built, as you said in the introduction, it’s not my words, it comes from Gartner and Forrester and other analysts that listen to many companies around the world. We’ve built the world’s largest generative AI environment used by well over 275,000 people on a daily basis. And those people use it for all sorts of use cases. They summarize documents, they write PowerPoint presentations, they ask questions about factual facts about different situations. So it’s wide ranging on what they’re using it for in a business context. So we have it being used in audit, tax consulting, financial services in mergers and acquisitions. So people are doing their day job and actually being very productive with it. I hear all sorts of anecdotal evidence of, I cut two hours in doing this, I cut four hours in doing that. It took me five minutes to compare these 10 documents when in the past it would’ve taken me four hours to do it. So lots and lots of use cases across those many hundreds of thousands of people.

Penny Crosman (11:11):

So a lot of research assistance kind of,

John Thompson (11:15):

Yeah, writing first drafts, research summarization, writing code, computer code, and writing papers and documents and responses and things like that.

Penny Crosman (11:26):

Are you under any pressure to show a return on investment for this

John Thompson (11:31):

In the first year? No, because we wanted to really engage with, have many, many people engage with the technology. We’ve done a good job in that. I would think that EY is in the top 1% of companies with the broadest base of employees engaged with generative AI in a meaningful way. We are in our budget cycle right now. I’m not telling anybody anything new. Our fiscal year ends at the end of June. So as we go into fiscal year 25, there is lots of conversation about what is the return on investment on what we spent this year and what it’s going to be going forward.

Penny Crosman (12:11):

Do any advice or suggestions for other people at other companies who are thinking about implementing some generative AI models and various use cases, and maybe they’re testing, experimenting a little bit? What are some of the things people should think about when they’re evaluating providers and assessing platforms?

John Thompson (12:36):

Yeah, that’s a great question. I probably wouldn’t have said this two years ago, but I really think that Microsoft and their Azure environment has done a really good job in setting up lots of guardrails, lots of security privacy, many of the things that every company wants and needs when they’re setting up large language models and small language models can be found in the Azure environment. So just don’t go out there and do it on your own partner with someone. AWS has it, Google has it, all the big providers have really mature platforms, and they’re all moving really fast right now. So I wouldn’t do it alone. If I were you, I would find a family of models that you like that are applied to the problems that you have, whether it’s document summarization or question answering or inference or whatever it happens to be. And use those models in one of those contexts, in one of those compute or cloud environments. And that’s my top level guidance is don’t do it alone. Work with a partner on this.

Penny Crosman (13:42):

And when you think about some of those safeguards you were talking about earlier, like making sure the system erases all prompts and doesn’t store anything, did you have to build some of that in yourselves, or were you able to find a provider that did all of that for you?

John Thompson (14:00):

Most of our environment is built in Microsoft Azure. So many of the things that our governing functions wanted us to do, like the non persistence of data were configurable in the platform. One of the areas that we thought we had had it done and we had the system up and running, and then we found out that Microsoft’s abuse monitoring system actually persisted data. So we had to take some additional actions in those areas, but we didn’t really have to build much. It was pretty much all configure it, test it, launch it in the way we went,

Penny Crosman (14:38):

Marie. Great. And if people want to hear more from you, can you tell us a little bit about some of the books that you’ve written that people could read for themselves?

John Thompson (14:47):

It’s very kind of you, penny. I appreciate that opportunity. I’ve written four books. The first one was on how to stand up in an enterprise AI function in a large company that’s kind of dated. Most people have already done that. My second book is called Building Analytics Teams, which I’m humbled and surprised four years after publication. It continues. It’s actually accelerating in sales at this point, so it’s selling better now than it did at launch four years ago, and I’m going to be building a class on that and teaching it at the University of Michigan soon. Then I’ve written a book on data and data ownership, and it’s really surprising to me that people still, when I ask questions in an audience where I’m speaking, how many people think that they own their data, maybe 5% of people raise their hand. And that was the reason for writing Data for All, is to give people the understanding that if you generate data, you own it, you have the right to delete it, you have the right to use it in the way that you want to, and you have the right to stop people from using it if you don’t want them to.

(15:53):

So Data For All was my fourth, third book, and then I wrote a book on causal ai, which is the next big thing in AI that’ll be coming in a couple years. And then the fifth book I’m working on is about traditional ai, causal ai and generative ai, and how all those families of AI are going to work together and all of those books can be found on Amazon.

Penny Crosman (16:15):

Sounds good. I will take a look. Well, John Thompson, thank you so much for joining us today, and to all of you, thank you for listening to the American Banker Podcast. I produced this episode with audio production by Adnan Khan. Special thanks this week to John Thompson at EY Rate us. Review us and subscribe to our content at www.americanbanker.com/subscribe. For American Banker, I’m Penny Crossman and thanks for listening.

Source

June 4, 2024

150 10 minutes read