Generative AI

Figuring Out The Innermost Secrets Of Generative AI Has Taken A Valiant Step Forward


In today’s column, I aim to provide an insightful look at a recent AI research study that garnered considerable media attention, suitably so. The study entailed once again a Holy Grail ambition of figuring out how generative AI is able to pull off being so amazingly fluent and conversational.

Here’s the deal.

Nobody can right now explain for sure the underlying logical and meaningful basis for generative AI being extraordinarily impressive. It is almost as though an awe-inspiring magical trick is taking place in front of our eyes, but no one can fully delineate exactly how the magic truly works. This is a conundrum, for sure.

Many AI researchers are avidly pursuing the ambitious dream of cracking the code, as it were and finding a means to sensibly interpret the massive mathematical and computational morass that underlies modern-day large-scale generative AI apps, see my coverage at the link here. They do so because they are intrigued by the incredible and vexing puzzle at hand. They do so to potentially gain fame or fortune. They do so since it is a grand challenge that once solved might bring forth other advances that we don’t yet realize await discovery. Lots of really good reasons exist for this arduous and at times frustrating pursuit.

I welcome you to the playing field and urge you to join in the hunt.

Headlines Galore With A Bit Of Moderation Needed

The recently released study that caused noteworthy interest was conducted by Anthropic, the maker of the generative AI app known as Claude. I will walk you through the ins and outs of the work. This will include excerpts to whet your appetite and include my analysis of what this all means.

Here are some of the headlines that remarked on the significance of the study:

  • “No One Truly Knows How AI Systems Work. A New Discovery Could Change That” (Time)
  • “Here’s What’s Really Going On Inside An LLM’s Neural Network” (Ars Technica)
  • “A.I.’s Black Boxes Just Got A Little Less Mysterious” (New York Times)
  • “Anthropic Tricked Claude Into Thinking It Was The Golden Gate Bridge And Other Glimpses Into The Mysterious AI Brain)” (VentureBeat)

There is little doubt that this latest research deserves rapt attention.

I might also add that the AI community all told is steadily biting off just a tiny bit at a time concerning what makes generative AI symbolically tick. There is no assurance that our hunting is heading in the right direction. Maybe we are finding valuable tidbits that will ultimately break the inner mysteries. On the other hand, it could be that we are merely chewing around the edges and remain far afield from solving what is undoubtedly a great mystery.

Time will tell.

As we proceed herein, I will make sure to properly introduce you to the terminology that underscores efforts to unpack the mechanisms of generative AI. If you were to dive into these matters headfirst you would discover that a slew of weighty vocabulary is being utilized.

No worries, I’ll make sure to explain the particulars to you.

Hang in there and we will get to covering these vocabulary gems of the AI field:

  • Generative AI (GenAI, gAI)
  • Large Language Models (LLMs)
  • Mechanistic interpretability (MI)
  • Artificial neural networks (ANNs)
  • Artificial neurons (ANs)
  • Monosemanticity
  • Sparse autoencoders (SAE)
  • Scaling laws
  • Linear representation hypothesis
  • Superposition hypothesis
  • Dictionary learning
  • Features as computational intermediates
  • Features neighborhoods
  • Feature completeness
  • Safety-relevant features
  • Features manipulations
  • And more…

In my ongoing column, I’ve mindfully examined other similar research studies that have earnestly sought to unlock what is happening inside generative AI. You might find of special interest this coverage at the link here and this posting at the link here. Take a look at those if you’d like to go further into the brass tacks of a fascinating and fundamental journey that is abundantly underway.

A quick comment before we leap into the fray.

Readers of my column are well aware that I eschew the ongoing misuse of wording in and around the AI arena that tries to attach human-based characteristics to today’s AI. For example, some have referred to the study that I am about to explore as having delved into the “mind” of AI or showcased the AI “brain”. Those are exasperatingly misapplied wordings. They are insidiously anthropophilic and falsely mislead people into believing that contemporary AI and humans are of the same ilk.

Please don’t fall for that type of wording.

You will hopefully observe that I try my best to avoid making use of those comparisons. I want to emphasize that we do not today have any sentient AI. Period, end of story. That might be a surprise since there is a lot of loose talk that suggests otherwise. For my detailed coverage of such matters, see the link here.

Anyway, sorry about the soapbox speech but I try to deter the rising tide of misleading characterizations whenever I get the chance to do so.

On with the show.

Trying To Get The Inner Mechanisms Figured Out

Let’s start at the beginning.

I assume you’ve used a generative AI app such as ChatGPT, GPT-4, Gemini, Bard, Claude, or the like. These are also known as large language models (LLMs) due to the aspect that they model natural languages such as English and tend to be very large-scale models that encompass a large swatch of how we use our natural languages. They are all pretty easy to use. You enter a prompt that contains your question or issue that you want solved. Upon hitting return, the AI app generates a response. You can then engage in a series of prompts and responses, acting as though you are carrying out a conversation.

Easy-peasy.

How does the generative AI app or LLM craft the responses?

In one sense, the answer is very straightforward.

The prompt that you enter is converted into a numeric format commonly referred to as tokens (see my in-depth explanation at the link here). The numeric version of your entered words is then funneled through an elaborate maze of mathematical and computational calculations. Eventually, a response is generated, still in a numeric or tokens format, and converted back into words so that you read what it says. Voila, you then see the words displayed that were derived as a response to your entered prompt.

If we wanted to do so, it would be quite possible to follow the numbers as they weave through the mathematical and computational maze. This number would lead to that number. That number would lead to this other number. On and on this would go. It would be a rather tedious tracing of thousands upon thousands, or more like millions upon millions of numbers crisscrossing here and there.

Would a close examination of the numbers tell you what is conceptually or symbolically happening within the mathematical and computational maze?

Strictly speaking, perhaps not. It would just seem like a whole bunch of numbers. You would be hard-pressed to say anything other than that a number led to another number, and so on. Explaining how that made a difference in getting a logical or meaningful answer to your prompt would be extraordinarily difficult.

One possibility is that there isn’t any meaningful way to express the vast series of arcane calculations. Suppose that it all happens in a manner beyond our ability to understand what the underlying mathematical and computational mechanics are conceptually doing. Just be happy that it works, some might insist. We don’t need to know why, they would say.

The trouble with this is that we are increasingly finding ourselves reliant on so-called black boxes that are modern-day generative AI.

If you can’t logically or meaningfully explain how it generates responses, this ought to send chills up our spines. We have no systematic means of making sure it is doing the right thing, depending upon what is meant by doing things right. The whole concoction might go awry. It might be waylaid by evildoers, see my discussion at the link here. All manners of concern arise when we are fully dependent upon a mysterious black box that remains inscrutable to coherent explanation.

I took you through that indication to highlight that we can at least inspect the flow of numbers. One might argue that a true black box won’t let you see inside. You customarily cannot peer into a presumed black box. In the case of generative AI, it isn’t quite the proper definition of a black box. We can readily see the numbers and watch as they go back and forth.

Take a moment and mull this over.

We can watch the numbers as they proceed throughout the input-to-output processing within generative AI. We also know the data structures that are used, and we know the formulas implemented as mathematical and computational calculators. The thing we don’t know and cannot yet explain is why in a conceptual symbolic sense the outputs turn out to be strikingly fitting to the words that we input.

How can we crack open this enigma?

Much of the AI research on this beguiling topic tends to explore smaller versions of contemporary generative AI. It is a classic move of trying to get our feet wet before diving into the entire lake. The cost to play around is a lot lower on a small version of generative AI. You can also more readily observe what is happening. All in all, starting in the small is handy.

I’ve discussed the prevailing discoveries from the small-scale explorations, see the link here.

Sometimes you need to take baby steps. Begin by crawling, then standing up and stumbling, then outright walking, and hope that you’ll one day be running and sprinting. The concern raised is that what we learn from small-scale explorations might not give rise to medium-scale and large-scale explorations.

That’s a strident belief by some that size matters. If a small-sized generative AI can be mapped and explained, one viewpoint is that this doesn’t directly imply that anything larger in size can be equally explained. Perhaps there is something else that happens when the scale increases. It could be that the seemingly toy-like facets of a small-scale generative AI do not ratchet up to the big-time versions.

Okay, the gist is that with generative AI we are faced with a kind of black box that we thankfully can inspect and are presented with the issue that the large scale makes it harder and costlier to do investigations, but we can at least do our best on the smaller scale versions.

I believe you are now up-to-speed, and I can get underway with examining the recent study undertaken and posted by Anthropic.

Fasten your seat belts for an exciting ride.

Examining Generative AI At Scale

I’ll first explore an online posting entitled “Mapping the Mind of a Large Language Model” by Anthropic, posted online on May 21, 2024. There is also an accompanying online paper that I’ll get to afterward and provides deeper details. Both are worth reading.

Here are some key points from the “Mapping the Mind of a Large Language Model” posting (excerpts):

  • “Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. “
  • “This is the first-ever detailed look inside a modern, production-grade large language model.”
  • “Opening the black box doesn’t necessarily help: the internal state of the model—what the model is “thinking” before writing its response—consists of a long list of numbers (“neuron activations”) without a clear meaning.”
  • “From interacting with a model like Claude, it’s clear that it’s able to understand and wield a wide range of concepts—but we can’t discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.”

Allow me a moment to reflect on those points.

Before I discuss the points, I would like to say that I was saddened and disappointed at the title wording of the posting, namely “Mapping the Mind of a Large Language Model”. Can you guess why I had some heartburn?

Yes, you probably guessed that the use of the word “Mind” was lamentedly an anthropomorphic reference. I realize that in this world of seeking eyeballs, it makes for more enthralling and catchy wording. There is plenty of that these days. You will note that in one of the bullets they at least put a somewhat similar word in quotes, i.e., “thinking”, which helps somewhat to avoid an anthropomorphizing indication.

Back to the bullet points. The researchers opted to use their prior work on examining small-scale generative AI or LLM to see what they could find when using a larger-scale variant. They point out that the sea of numbers does not readily lend itself to a human-level understanding of what is meaningfully and symbolically taking place.

They mention “neurons” and such aspects as “neuron activations”.

Let me bring you into the fold.

Generative AI and LLMs tend to be designed and programmed by using mathematical and computational techniques and methods known as artificial neural networks (ANNs).

The idea for this is inspired by the human brain consisting of real neurons biochemically wired together into a complex network within our noggins. I want to loudly clarify that how artificial neural networks work is not at all akin to the true complexities of so-called wetware or the human brain, the real neurons, and the real neural networks.

Artificial neural networks are a tremendous simplification of the real things. It is at best a modicum of a computational simulation. Indeed, various aspects of artificial neural networks are not viably comparable to what happens in a real neural network. ANNs can somewhat be used to try and simulate some limited aspects of real neural networks, but at this time they are a far cry from what our brains do.

In that sense, we are once again faced with a disconcerting wording issue. When people read or hear that a computer system is using “neurons” and doing “neuron activation” they would make the reasoned leap of faith that the computer is acting exactly like our brains do. Wrong. This is more of that anthropomorphizing going on.

The dilemma for those of us in AI is that the entire field of study devoted to ANNs makes use of the same language as is used for the biological side of the neurosciences. This is certainly sensible since the inspiration for the mathematical and computational formulation is based on those facets. Plus, the hope is that someday ANNs will indeed match the real things, allowing us to fully emulate or simulate the human brain. Exciting times!

Here’s what I try to do.

When I refer to ANNs and their components, I aim to use the word “artificial” in whatever related wording I use. For example, I would say “artificial neurons” when I am referring to the inspired mathematical and computational mechanisms. I would say “neurons” when referring to the biological kinds. This ends up requiring a lot of repeated uses of the word “artificial” when discussing ANNs, which some people find annoying, but I think it is worth the price to emphasize that artificial neurons are not the same today as true neurons.

You can envision that an artificial neuron is like a mathematical function that you learned in school. An artificial neuron is a mathematical function implemented computationally that takes an input and produces an output, numerically so. We can implement that mathematical function via a computer system, either as software and/or hardware (with both working hand-in-hand).

I also speak of “artificial neural activations” as those artificial neurons that upon being presented with a numeric value as an input will then perform some kind of calculation and produce an output value. The function is said to have been activated or enacted.

Not everyone abides by that convention of strictly saying “artificial” when referring to the various elements of ANNs. They assume that the reader understands that within the context of discussing generative AI and LLMs, the notion of neurons and neuron activation refers to artificial neurons and artificial neuron activation. It is a shortcut that can be confusing to some, but otherwise silently understood by those immersed in the AI field.

I’ll leave it to you to decide which convention you prefer.

Moving Further Into The Forest

Let’s next see some additional salient points indicated in the notable research study (excerpts):

  • “In October 2023, we reported success applying dictionary learning to a very small “toy” language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.” (ibid).
  • “Those concepts were intriguing—but the model really was very simple.” (ibid).
  • “But we were optimistic that we could scale up the technique to the vastly larger AI language models now in regular use, and in doing so, learn a great deal about the features supporting their sophisticated behaviors.” (ibid).

Those points note that the prior work had found “features” that seemed to suggest concepts exist within the morass of the artificial neural networks used in generative AI and LLMs.

Let me say something about that.

Envision that we have a whole bunch of numerical mathematical functions. Lots and lots of them. We implement them on a computer via software. We connect them such that some feed their results into others. This is our artificial neural network, and each mathematical function is considered an artificial neuron.

This is the core of our generative AI app.

We will slap on a front end that takes words via a prompt from the user and converts those words into numbers or tokens. We feed those into the artificial neural network. Numbers flow from function to function, or we would say from artificial neuron to artificial neuron. When the calculations are completed, the numeric values are fed to our front end which converts them back into readable words.

I earlier asked you whether we could make any conceptual or symbolic sense out of all those numbers flowing back and forth.

Attempts so far have usually focused on looking at clumps of artificial neurons.

Perhaps if someone asks a question about the Golden Gate Bridge, for example, there might be some clump of artificial neurons within a vast array of them that are particularly activated using that reference. Voila, we might then claim that this or that set of artificial neurons seems to represent the conceptual notion and facets pertaining to references about the Golden Gate Bridge.

In smaller-scale generative AI, this has been a mainstay of results when trying to interpret what is going on inside the generative AI. There are various sets of artificial neurons in the overall artificial neural network used within the generative AI app that seem to signify specific words or phrases. I liken this to probing a messy interconnected contrivance of Christmas lights. You might do testing and see that if you plug in this or that plug, those lights here or there light up. When you plug in a different portion, this or that lights come on.

We can do the same with generative AI. Feed in particular words. Trace what parts of the artificial neural network seem to be producing notable values, or as said to be artificial neural activations. Try this repeatedly. If you consistently observe the same clump or set being activated, you might conclude that those represent the notion of whatever word or phrase is being fed in, such as referencing the Golden Gate Bridge.

You can further test out your hypothesis by instigating things.

Suppose we removed those artificial neurons from the ANN or maybe neutralized their functions so that they were now unresponsive. Presumably, the artificial neural network might no longer be able to respond when we enter our phrase of “Golden Gate Bridge”. Or, if it does respond, it might allow us to trace to some other part of the ANN that is apparently also involved in trying to mathematically and computationally model those particular words.

I trust that you are following along on this, and it makes reasonable sense, thanks.

If we examine an artificial neural network and discover portions that seem to represent particular words or phrases, what shall we overall call that specific set or subset of artificial neurons in a generic sense?

For the sake of discussion, let’s refer to those as “features”.

A feature will be an instance of our having found what we believe to be a portion of artificial neurons that seem to demonstrably represent particular words or phrases in our artificial neural network. In a sense, you could assert that a feature represents concepts, such as the concept of what a dog is, the concept of what the Golden Gate Bridge is, and so on.

Imagine it this way. We do lots of testing and discover a clump that seems to activate when we enter the word “dog” in a prompt. Perhaps this set of artificial neurons is a mathematical and computational modeling of the concept underlying what we mean by the use of the word “dog”. We find another clump that activates whenever we enter the word “cat” in a prompt. These are each a considered feature that we’ve managed to find within the overarching artificial neural network that sits at the core of our generative AI app.

How many “features” might there be in a large-scale generative AI app?

Gosh, that’s a tough question to answer.

In theory, there could be zillions of them. There might be a so-called “feature” that represents every distinct word in the dictionary. For the English language alone, there are about 150,000 or more words in an average dictionary. Add in phases. Add in all manner of permutations and combinations of how we use words. Make sure to place the words into the context of a sentence, the context of a paragraph, and the context of an entire story or essay.

Let’s see what the referenced research study had to say:

  • “We mostly treat AI models as a black box: something goes in and a response comes out, and it’s not clear why the model gave that particular response instead of another.” (ibid).
  • “Opening the black box doesn’t necessarily help: the internal state of the model—what the model is “thinking” before writing its response—consists of a long list of numbers (“neuron activations”) without a clear meaning.” (ibid).
  • “From interacting with a model like Claude, it’s clear that it’s able to understand and wield a wide range of concepts—but we can’t discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.”
  • “Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts.”
  • “Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.”

That pretty much echoes what I said above.

Features Are Not An Island Unto Themselves

There is a vital twist noted in the above last bullet point.

Features might rely upon or be considered related to other features.

Consider this. When I use the word “dog” there are a lot of interconnected concepts that we immediately tend to think about. You might at first think of a dog as an animal with four legs. Next, you might think about types of dogs such as golden retrievers. Next, you might consider dogs you’ve known such as your beloved pet from childhood. Next, you might consider famous dogs such as Lassie. Etc.

In the AI parlance, and within the context of generative AI and LLMs, let’s say that we might find “features” that relate to other features. I would dare say we would certainly expect this to be the case. It seems unlikely that one feature upon itself could represent everything about anything of any modest complexity.

I have led you step by step to the especially exciting part of the research study (excerpts):

  • “We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai), providing a rough conceptual map of its internal states halfway through its computation.” (ibid).
  • “Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities.” (ibid).
  • “A feature sensitive to mentions of the Golden Gate Bridge fires on a range of model inputs, from English mentions of the name of the bridge to discussions in Japanese, Chinese, Greek, Vietnamese, Russian, and an image.” (ibid).
  • “Looking near a ‘Golden Gate Bridge’ feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.” (ibid).

Those are fascinating and significant results.

Here’s why.

First, it seems that the notion of “features” as used when exploring smaller-scale generative AI was useful when exploring larger-scale generative AI. That is heartwarming and quite encouraging. Were this not the case, we might have to revert to step one and start over when trying to surface the inner facets of generative AI.

Second, the features in the large-scale generative AI were said to be deeper, wider, and have a greater semblance of abstraction. This again is something we would hope to see. Small-scale generative AI cannot usually make its way out of a paper bag, while large-scale generative AI provides all the knock-your-socks fluency that we experience. The base assumption is that large-scale generative AI achieves its loftiness via having a deeper, wider, and more robust abstraction of natural language than small-scale generative AI, by far. That seems to be the case.

Third, the researchers found not just a dozen or so features, not a few hundred features, not a few thousand features, but instead, they found millions of features. Great news. If they had only found a lesser number of features, it might suggest that features are extremely hard to find or that they cloak themselves in some unknown manner.

A problem that we might face is that there could be many, many millions upon millions of features. This is a problem since we then must figure out ways to find them, track them, and figure out what we might do with them. Anytime that you have something countable in the large, this presents challenges that will require further attention.

Never a dull moment in the AI field, I can assure you of that handy-dandy rule.

Safety Is A Momentous Part Of Deciphering Generative AI

What might we want to do with the features that we uncover within generative AI?

I suppose you could stare at them and admire them. Look at what we found, might be the proud exclamation.

A perhaps more utilitarian approach would be that we could do a better job at designing and building generative AI. Knowing about features would be instrumental in boosting what we can get generative AI to accomplish. Advances in AI are bound to arise by pursuing this line of inquiry.

There is a chance too that we might learn more about the nature of language and how we use language. Keep in mind that generative AI is a massive pattern-matching mechanism. To undertake the initial data training for generative AI, usually vast swaths of the Internet are scanned, trying to pattern match how humankind makes use of words.

Maybe there are new concepts that we’ve not yet landed on in real life. Now, hidden within generative AI, and yet to be found and showcased for all to see, we might discover eye-opening concepts that no one has heretofore voiced or considered. Wow, that would be something of grand amazement.

I have so far noted the upsides of finding features.

In life, and especially in the use case of AI, there is a duality of good and bad always at play. Generative AI can be used for the good of humanity. Hooray! Generative AI can also be used in underhanded ways and be harmful to humanity. That’s the badness associated with generative AI. I cover various examples of the dual use of generative AI at the link here.

Here’s what the research study indicated on the downsides or safety considerations (excerpts):

  • “Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude’s responses change.” (ibid).
  • “We also found a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them).” (ibid).
  • “Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude’s harmlessness training and it responds by drafting a scam email.” (ibid).
  • “The fact that manipulating these features causes corresponding changes to behavior validates that they aren’t just correlated with the presence of concepts in input text, but also causally shape the model’s behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.” (ibid).
  • “We hope that we and others can use these discoveries to make models safer.” (ibid).

The points above note that a feature that is supposed to suppress the AI from writing scam emails could be manipulated into taking the opposite stance and proffer the most scam of scam emails that one could compose.

Your gut reaction might be that this seems mildly disconcerting, but not overly dangerous or destructive.

Let me enlarge the scope.

Suppose we make use of generative AI for the control of robots, which is already being undertaken in an initial but rapidly growing manner, see my coverage at the link here. The generative AI has been carefully data-trained to be cautious around humans and not cause any injury or harm to people.

Along comes a hacker or evildoer. They manage to examine the inner workings of the generative AI and ferret out the feature that is indicative of being careful around humans. With a few light-touch changes, they get the feature to flip around and allow harm to humans. Going even further into this diabolical scheme, the feature is altered to purposely seek to harm people.

Yikes, you might be saying.

Stop right now on all this research that is identifying features. Drop it like a lead balloon. It is going to backfire on us. These efforts are going to be a goldmine for those who have evil intentions. We are handing them a roadmap to our destruction.

You have entered into the classic debate about whether knowledge can be too much of a good thing. The AI field has been grappling with this since the beginning of AI pursuits. A counterargument is that if we hide our heads in the sand, the odds are that those evildoers are going to ferret this out anyway. By putting this into the sunshine, hopefully, we have a greater chance of devising safety capabilities that will mitigate the underhanded plots.

On a related facet, I’ve been extensively covering the field of AI ethics and AI law, which dives deeply into these momentous societal and cultural questions, see the link here and the link here, for example. You are encouraged to actively participate in determining your future and the future of those generations yet to come along.

Getting Into Overtime On The Inner Mechanisms

I promised you at the start of this discussion that we would lean into a heaping of AI terminology.

Here’s that list again:

  • Generative AI (GenAI, gAI)
  • Large Language Models (LLMs)
  • Mechanistic interpretability (MI)
  • Artificial neural networks (ANNs)
  • Artificial neurons (ANs)
  • Monosemanticity
  • Sparse autoencoders (SAE)
  • Scaling laws
  • Linear representation hypothesis
  • Superposition hypothesis
  • Dictionary learning
  • Features as computational intermediates
  • Features neighborhoods
  • Feature completeness
  • Safety-relevant features
  • Features manipulations
  • And more…

The first items on the list have been generally covered so far. I introduced you to the nature of generative AI, large language models, artificial neural networks, and artificial neurons. The item on the list that refers to mechanistic interpretability is the AI insider phrasing for trying to interpret the inner mechanics of what is happening within generative AI. I’ve covered that too with you.

Some of the terms toward the tail-end of the list can be readily covered straightaway.

Specifically, let’s quickly tackle these:

  • Features as computational intermediates
  • Features neighborhoods
  • Feature completeness
  • Safety-relevant features
  • Features manipulations

You know now what a feature is, and the shortlist shown here augments various feature-related aspects.

You can seemingly realize that a feature could be construed as a computational intermediary. It is a means to an end. If someone enters a prompt that says, “How do I walk my dog”, the feature within generative AI that pertains to the word “dog” is a computational intermediary that will help with mathematically and computationally assessing that portion of the sentence and aid in generating a response.

Features can be considered within various potentially identifiable features-neighborhoods. There might be a feature that represents all four-legged creatures. The feature for “dog” would likely be within that neighborhood, as would the feature for “cat”. These are collections of features, and for which a given feature might well appear in more than one neighborhood and most likely does.

The completeness of a feature entails whether the feature covers a complete aspect or only a partial aspect. For example, maybe we discover a feature associated with “dog” but this feature does not account for hairless dogs. That’s in some other feature. We might then suggest that the feature we found is incomplete.

In the terminology that lists the phrase of safety-relevant features and feature manipulations, I already mentioned that we have to be on our toes when it comes to AI safety. You are already acquainted with that phraseology.

The list is now shortened to these fanciful terms:

  • Monosemanticity
  • Sparse autoencoders (SAE)
  • Scaling laws
  • Linear representation hypothesis
  • Superposition hypothesis
  • Dictionary learning

I’d like to take you into the full paper that the researchers provided, allowing us to unpack those pieces of terminology accordingly.

The Deepness Of The Forest Can Be Astounding

I will be quoting from the paper entitled:

  • “Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet” by Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan, Anthropic, posted online May 21, 2024.

Let’s start with this (excerpts):

  • “Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces.”
  • “We do so by training a sparse autoencoder (SAE) on the model activations, as in our prior work and that of several other groups. SAEs are an instance of a family of ‘sparse dictionary learning’ algorithms that seek to decompose data into a weighted sum of sparsely active components.”
  • “Our SAE consists of two layers.”
  • “The first layer (‘encoder’) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.”
  • “The second layer (‘decoder’) attempts to reconstruct the model activations via a linear transformation of the feature activations.”

That’s quite a mouthful.

I am going to explain this at a 30,000-foot level. I say that because I am going to take some liberties by simplifying what is otherwise a highly complex matter. For those trolls out there (you know who you are) that will be chagrined by the simplification, sorry about that, but if there is sufficient interest by readers, I will gladly come back around to this in a future posting and lay things out in more finite detail.

Unpacking initiated.

To try and find the features within generative AI, you could do so by hand. Go ahead and roll up those sleeves! That being said, you might as well get started immediately because to ferret out millions of them you would work by hand until the cows come home. It’s just not a practical approach when inspecting a large-scale generative AI app.

We need to devise a piece of software that will do the heavy lifting for us.

Turns out that there is a software capability known as a sparse autoencoder (SAE) that can be used for this very purpose. Thank goodness. You might find it of idle interest that an SAE is devised by using an artificial neural network. In that sense, we are going to use a tool that is based on ANN to try and ferret out the inner secrets of a large-scale ANN. Mind-bending. I discuss this further at the link here.

We can set up the SAE to examine a generative AI app when we are feeding prompts into it. Let the SAE find the various activations. This uses an underlying algorithm that is referred to as dictionary learning.

Dictionary learning essentially involves finding foundational pieces of something and then trying to build upon those toward a larger semblance, almost like examining LEGO blocks and then using those to build a structure such as a LEGO flower or LEGO house. Some AI researchers believe that dictionary learning is quite useful for this task, while others suggest that different methods might be more suitable. The jury is out on this for the moment.

Whew, go ahead and take a short break if you like, perhaps get a glass of wine. Congrats, you are halfway through this discourse on the heavy side of AI verbiage.

Let’s clock back in.

Monosemanticity is a word that frequently is used by linguists. It refers to the idea of having one meaning, wherein “mono” is of one thing and semanticity refers to the semantics of words. Some words are monosemnatic and have only one meaning, while other words are polysemantic and have more than one meaning. An example of a word that is polysemantic would be the word “bank”. If I toss the word “bank” at you and ask you what it means, you will indubitably scratch your head and probably ask me which meaning I intended. Did I mean the bank that is a financial institution, or did I mean the bank that is at the edge of a stream or river?

Features within generative AI are likely to involve some words that are monosemantic and others that are polysemantic. Usually, you can discern which meaning is coming into play by examining the associated context. When I tell you that I managed to climb up on the bank, I assume you would be thinking of a river or lake rather than your local ATM.

More Of This Complexity Enters Into The Big Picture

Let’s discuss scaling laws.

Here is a related excerpt from the cited paper:

  • “Training SAEs on larger models is computationally intensive. It is important to understand (1) the extent to which additional compute improves dictionary learning results, and (2) how that compute should be allocated to obtain the highest-quality dictionary possible for a given computational budget.” (ibid).

The crux is that the running of the SAE is going to consume computer processing time. Someone has to pay for those processing cycles. We want to run the SAE as long as we can afford to do so, or at least until we believe that a desired number of features have been sufficiently found. Each feature we discover is going to cost us something in computer time used. Money, money, money.

A wise thing to do would be to try and get the most bang for our buck. No sense in having the SAE chew up valuable server time if it isn’t producing a wallop of nifty features. Scaling laws are basically rules of thumb that at some point you’ve probably done as much as you can profitably do. Going a mile more might not be especially fruitful.

This then leaves us with these last two pieces of hefty terminology to unravel:

  • Linear representation hypothesis
  • Superposition hypothesis

Here are some especially relevant excerpts from the cited paper:

  • “Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis and the superposition hypothesis.” (ibid).
  • “At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces.” (ibid).
  • “The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.” (ibid).

Tighten your belt for this.

Linear representation means that we can at times represent something of a complex nature via a somewhat simpler linear depiction. If you’ve ever taken a class in linear algebra, think about how you used various mathematical functions and numbers to represent complex graphs, spheres, and other shapes. Not only were you able to represent those elements, but you could also use numeric matrices and vectors to expand them, shrink them, rotate them, and do all manner of linear transformations.

Our hypothesis in the case of generative AI is that we can potentially adequately and sensibly represent the features within generative AI by a linear form of representation. This could be characterized as the linear representation hypothesis.

Why is it a hypothesis?

Because we might end up realizing that a linear representation won’t cut the mustard. Maybe it is insufficient for the task at hand. Perhaps we need to find some other form of representation to suitably codify and make use of features within generative AI. Right now, it seems like the right means, but we must scientifically and systematically ask ourselves whether it is fully worthy or if we need to switch to alternative means.

The superposition hypothesis is a related cousin.

I will playfully engage you in figuring out what the superposition hypothesis consists of in the context of generative AI. If you know something about physics and the role of superposition in that realm, you admittedly have a leg up on this.

Suppose you decided to watch one artificial neuron in a vast artificial neural network that sits at the core of a generative AI app. All day long, you sit there, patiently waiting for that one artificial neuron to be kicked into action. A numeric value finally flows into the artificial neuron. It does the needed calculations and then outputs a value that then flows along to another artificial neuron.

Eureka, you yell out. The artificial neuron that you had so tenderly observed was finally activated and did so when the word “dog” had been entered as part of a prompt.

Can you conclude that this one artificial neuron is solely dedicated to the facets of “dog”?

Maybe, or maybe not.

We might feed in a prompt that has the word “cat” and see this same artificial neuron be activated. There could be lots of other situations that activate this one artificial neuron. Making a brash assumption that this artificial neuron has only one singular purpose is a gutsy move. You might be right, or you might be wrong.

The world would be easier if each artificial neuron had only one purpose. Think of it this way. Once you ferreted out the purpose, you are done and never need to revisit that artificial neuron. You know what it does. Case closed.

In physics, a similar question has arisen, for example about waves. A given wave might encode multiple waves and therefore in a sense have multiple uses. A regular dictionary defines superposition as the act of having two or more things that coincide with each other.

Our use here is that it seems reasonable to believe that artificial neurons will have more than just one singular purpose. They will encode facets that will apply to more than one feature. When examining and discerning what an artificial neuron represents, we need to keep an open mind and expect that there will be multiple uses involved.

But that’s just a hypothesis, namely the superposition hypothesis.

Conclusion

I’m sure you know that in 1969, Astronaut Neil Armstrong stepped onto the lunar surface and uttered the immortal words “That’s one small step for man, one giant leap for mankind.”

When it comes to generative AI, the rush toward widely adopting generative AI and large language models is vast and growing in leaps and bounds. Generative AI is going to be ubiquitous. If that’s the case, we certainly ought to know what is happening inside the inner sanctum of generative AI.

A lot of small steps are still ahead of us.

Let’s aim to make a giant leap for all of humankind.



Source

Related Articles

Back to top button