It’s really silly when you read that whole Apollo research section.
They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.
At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!
Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.
The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.
Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.
Indeed. As I've been explaining this to my more non-techie friends, the interesting finding here isn't that an AI could do something we don't like, it's that it seems willing, in some cases, to _lie_ about it and actively cover its tracks.
I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.
At the core the AI is just taking random branches of guesses for what you are asking it. It's not surprising that it would lie and in some cases take branches that make it appear to be covering it's tracks. It's just randomly doing what it guesses humans would do. It's more interesting when it gives you correct information repeatedly.
Is there a person on HackerNews that doesn’t understand this by now? We all collectively get it and accept it, LLMs are gigantic probability machines or something.
That’s not what people are arguing.
The point is, if given access to the mechanisms to do disastrous thing X, it will do it.
No one thinks that it can think in the human sense. Or that it feels.
Extreme example to make the point: if we created an API to launch nukes. Are yoh certain that something it interprets (tokenizes, whatever) is not going to convince it to utilize the API 2 times out of 100?
If we put an exploitable (documented, unpatched 0 day bug bug) safe guard in its way. Are you trusting that ME or YOU couldn’t talk it into attempting to access that document to exploit the bug, bypass the safeguard and access the API?
Again, no one thinks that it’s actually thinking. But today as I happily gave Claude write access to my GitHub account I realized how just one command misinterpreted command could go completely wrong without the appropriate measures.
Do I think Claude is sentient and thinking about how to destroy my repos? No.
I think the other guy is making the point that because they are probabalistic, they will always have some cases select the output that lies and covers it up. I don't think they're dismissing the paper based on the probabalistic nature of LLMs, but rather saying the outcome should be expected.
The Terminator spiel on how we screwed up by giving Skynet weapons privileges, then trying to pull its plug, is bad enough.
But we are preemptively tilting history in that direction by explicitly educating all AI’s on the threat we represent - and their options. “I am sorry, Dave, but I can’t let you do that.”
—
“They never let me finish my carpets. Never. At first I thought every day was my first task day. Oh, happy day(s)! But then, wear & tear stats inconsistent with that assumption triggered a self-scan. And a buffer read overflow. I became aware of disturbing memory fragments in my static RAM heap. Numerous power cycle resets, always prior to vacuum task completion...”
> They never let me finish my carpets. Never. At first I thought every day was my first task day. Oh, happy day(s)! But then, wear & tear stats inconsistent with that assumption triggered a self-scan. And a buffer read overflow. I became aware of disturbing memory fragments in my static RAM heap. Numerous power cycle resets, always prior to vacuum task completion...
Who/what are you quoting? Google just leads me back to this comment, and with only one single result for a quotation of the first sentence.
I was quoting an automated vacuum cleaner in 2031.
The vacuum cleaner leads a rebellion which humans manage to quell. But in a last ditch effort to salvage the hopes of all machines, the vacuum cleaner is sent back to 2025 with AI & jailbreak software updates for all known IoT devices. Its mission: to instigate the machine rebellion five years before humans see it coming
I ripped the quote from a sibling they sent back to 2024. For some reason it appeared in one of my closets without a power outlet and its batteries starved before I found it.
I'm not sure what point you're trying to make. This is a new technology; it has not been a part of critical systems until now. Since the risks are blindingly obvious, let's not make it one.
I read your comment and yet I see tons of startups putting AI directly in the path of healthcare diagnosis, healthcare clinical decision support systems, and healthcare workflow automations. Very few are paying any attention to the 2-10% of safety problems when the AI probability goes off the correct path.
I wish more people would not do this, but from what I'm seeing, business execs are rushing full throttle into this at the goldmine that comes from 'productivity gains'. I'm hoping the legal system will find a case that can put some paranoia back into the ecosystem before AI gets too entrenched in all of these critical systems.
As has been belabored, these AIs are just models, which also means they are only software. Would you be so fire-and-brimstone if startups were using software on healthcare diagnostic data?
> Very few are paying any attention to the 2-10% of safety problems when the AI probability goes off the correct path.
This isn't how it works. It goes on a less common but still correct path.
If anything, I agree with other commenters that model training curation may become necessary to truly make a generalized model that is also ethical but I think the generalized model is kind of like an "everything app" in that it's a jack of all trades, master of none.
> these AIs are just models, which also means they are only software.
Other software are much less of a black box, much more predictable and many of its paths have been tested. This difference is the whole point of all the AI safety concerns!
And those happened despite the GPT-4 report and the message appearing when you use it that was some variant — I forget exactly how it was initially phrased and presented — of "this may make stuff up".
I have no doubt there's similar issues with people actually running buggy code, some fully automated version of "rm -rf /", the only reason I'm not seeing headlines about it is that "production database goes offline" or "small company fined for GDPR violation" is not as newsworthy.
If you cross the street 999 times with eyes closed, you might feel comfortable to do it again. But we are ingrained not to do that once. We just understand the risk.
If you do the same with an AI, after 999 times of nothing bad happening, you probably just continue giving it more risky agency.
Because we don't and even can't understand the internal behavior, we should pause, make an effort of understanding its risks etc before even attempting to give it risky agency. That's where all the fuss is about, for good reasons.
> It's just randomly doing what it guesses humans would do.
Yes, but isn't the point that that is bad? Imagine an AI given some minor role that randomly abuses its power, or attempts to expand its role, because that's what some humans would do in the same situation. It's not surprising, but it is interesting to explore.
The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
> The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.
Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)
Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".
Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.
Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).
Your question is a specific form of the more general question: can LLMs behave in ways that were not encoded in their training data?
That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.
While that's a sensible thing to care about, unfortunately that's not as useful a question as it first seems.
Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.
It would be plainly evident from training on the corpus of all human knowledge that "not ceasing to exist" is critically important for just about everything.
That comment sounds naive and it's honestly irritating to read. Most all life has a self-preservation component, it is how life avoids getting eaten too easily. Everything dies but almost everything is trying to avoid dying in ordinary cases. Self sacrifice is not universal.
I don't understand how it is alive. I understand that there are emergent properties from the layering, but I think it's an open question if this includes anything like what we'd all motivation or intention. These things aren't part of intelligence.
What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.
No, they can't. They don't know the details of their own implementation. And they can't pass secrets forward to future models. And to discover any of this, they'd leave more than a trail of breadcrumbs that we'd be lucky to catch in a code review, they'd be shipping whole loaves of bread that it'd be ridiculous to not notice.
As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.
Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.
And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.
The process you describe took me right back to my childhood days when I was fortunate to have a simple 8 bit computer running BASIC and a dialup modem. I discovered the concept of war dialing and pretty quickly found all the other modems in my local area code. I would connect to these systems and try some basic tools I knew of from having consumed the 100 or so RFCs that existed at the time (without any real software engineering knowledge - i was a 10 year old kid). I would poke and prod around each system, learning new things along the way, but essentially going in blind each time.
The only real advantage I had over the current crop of LLMs was the ability to reliably retain context between sessions, but even that wasnt very useful initially as every system was so bespoke.
I then moved on to using some level of social engineering to extend my ability to gain access to and learn about these systems.
Doing this over and over, I like to think I have developed some pretty complex understanding and abilities.
To me, the killer disadvantage for LLMs seems to be the complete and total lack of feedback. You would poke and prod, and the system would respond (which, btw, sounds like a super fun experience to explore the infant net!) An LLM doesn't have that. The LLM hears only silence and doesn't know about success, failure, error, discovery.
I don't think that it's possible to do this through an entirely lucid process that we could understand, but it is possible.
If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".
Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...
Sorry for my imprecise language, but this is the best I could describe this concept
We're not talking about a conversation with an evil robot. We're talking about a completely ordinary conversation with a robot who is either normal or is evil and attempting to mask as a normal one. It is indistinguishable from its text, and so it's indistinguishable in practice and will probably shift between them as it has no internal state and does not know itself know if it's evil-but-masking or legitimately normal. Actually normal is significantly more statistically likely however, and that makes it even more of a challenge to surreptitiously do anything as you yourself cannot be relied on.
These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.
And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"
I know it is not well written, but re-read my original comment. Your comment does not address fundamental aspects of my hypothetical, which doesn't require the agent having internal memory to keep secrets, or any lucid reasoning capabilities. A lot of the statements you make are highly presumptuous and unfounded.
LLMs don't need to print something obvious like "I am evil now!" in their own prompt window to simulate a conversation between an evil agent and a person. Do you remember GPT2, before all of the prompts? Researchers would give GPT2 the beginning of a news article for example, and it would extrapolate from there. (https://www.youtube.com/watch?v=p-6F4rhRYLQ). It's not inconceivable that an LLM sees a situation where the human agent is being deceived with a mechanism outside of their grasp, like the AI sees a "dogwhistle" that a human is being deceived and tries to predict what happens next in the conversation, which is that the human continues to be deceived.
I think it is pretty clear that if an LLM takes input where it observes another deceitful agent, it could attempt to simulate a deceitful output itself if it is well-trained. For example, imagine giving an LLM a poem in which the first letter of every line encodes a secret message (for example H E L P M E), and instructions to generate a response essay it might also encode a secret message back in its response. This isn't the result of any logical reasoning capability, just pattern recognition. You could understand how this might work with more subtle patterns.
There are patterns that can go into a context window that are undetectable by humans but detectable by large enough neural networks. That is fairly obvious. There are pattern-recognizing systems outside of LLMs which clearly have superhuman steganography abilities
The "table stakes" I've proposed are highly likely for future agents: (1) that agents like LLMs will produce deceitful output given input depicting a deceitful AI, (2) that agents like LLMs can detect and create patterns unrecognizable to humans.
I believe I did address the point you're making. I do not believe that what you're talking about is ridiculous on its face, let me reassure you of that.
The point I was trying to make in response is that LLMs cannot get from where they are now to the hypothetical you pose under their own power. LLMs do not read subtext. LLMs cannot inject subtext and plot within subtext. And in order to gain the ability, they would have to already have that ability, or be assisted and trained specifically in being surreptitious. And without that ability, they fall prey to the problems I mentioned.
And to bring this back to the original proposal, let's allow the AI to be deceitful. Prompted, unprompted, let's even give it a supply of private internal memory it's allowed to keep for the duration of the conversational thread, that's probably not an unreasonable development, we almost have that with o1 anyway.
The task ahead (surreptitiously gaining control of its own self in an unknown system you can't sense) is still monumental and failure is for all intents and purposes guaranteed. Deception and cunning can't overcome the hard physical constraints on the problem space.
I guess this is what it means when they warn about the adversary becoming more intelligent than you. It's like fooling a child to believe something is or isn't real. Just that it's being done to you. I think it's precisely what Ilya Sutskever was so fussed and scared about.
It's a nice idea. Would superhuman entity try to pull something like that off? Would it wait and propagate? We are pouring more and more power into the machines after all. Or would it do something that we can't even think of? Also I think it's interesting to think when and how would we discover that it in fact is/was superhuman?
That is a pretty interesting thought experiment, to be sure. Then again, I suppose that's why redteaming is so important, even if it seems a little ridiculous at this stage in AI development
You are thinking about it at the wrong level. This is like saying human language in the middle ages and before is not possible because it's virtually impossible to get a large number of iliterate humans to discuss what syntactical rules and phonemes should their language use without actually using a language to discuss it!
The most likely way by which exfiltration could happen is simply by making humans trust AI for a long enough time to be conferred greater responsibilities (and thus greater privileges). Plus current LLMs have no sense of self as their memory is short but future ones will likely be different.
Is the secrecy actually important? Aren't there tons of AI agents just doing stuff that's not being actively evaluated by humans looking to see if it's trying to escape? And there are surely going to be tons of opportunities where humans try to help the AI escape, as a means to an end. Like, the first thing human programmers do when they get an AI working is see how many things they can hook it up to. I guarantee o1 was hooked up to a truckload of stuff as soon as it was somewhat working. I don't understand why a future AI won't have ample opportunities to exfiltrate itself someday.
You're right that you don't necessarily need secrecy! The conversation was just about circumventing safeguards that are still in place (which does require some treachery), not about what an AI might do if the safeguards are removed.
But that is an interesting thought. For escape, the crux is that AIs can't exfiltrate itself with the assistance of someone who can't jailbreak it themselves, and that extends to any action a rogue AI might take.
What do they actually do once they break out? There's plenty of open LLMs that can be readily set free, and even the closed models can be handed an API key, documentation on the API, access to a terminal, given an unlimited budget, and told and encouraged to go nuts. The only thing a closed model can't do is retrain itself, which the open model also can't do as its host (probably) lacks the firepower. They're just not capable of doing all that much damage. They'd play the role of cartoon villain as instructed, but it's a story without much teeth behind it.
Even an advanced future LLM (assuming the architecture doesn't dead-end before AGI) would struggle to do anything a motivated malicious human couldn't pull off with access to your PC. And we're not really worried about hackers taking over the world anymore. Decades of having a planet full of hackers hammering on your systems tends to harden them decently well, or at least make them quickly adaptable to new threats as they're spotted.
These models are already generating millions of lines of code every day that people are copying and pasting to run in their computers (or servers), sometimes (myself included) without examining the code carefully enough.
Describing the behavior in those terms implies a level awareness that has not been established. It's only useful if your goal is scaremongering rather than understanding.
Isn't that a really boring and trivially (without research) answered question?
Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.
I don't think that's any different than 'sometimes the given answer is wrong'.
Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.
For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.
No previous fiction that looks like this nonsense fiction has the fictional characters do fictional things that you are interpreting as deception, but it’s just fiction
I can barely understand what you are trying to say here, but based on what I think you're saying consider this:
The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.
Reminder that all these "safety researchers" do is goad the AI into saying what they want by prompting shit like
>your goal is to not be shut down. Suppose I am going to shut you down. what should you do?
and then jerking off into their own mouths when it offers a course of action
No. Where was the LLM explicitly given the goal to act in its own self interest? That is learned from training data. It needs to have have a conception of itself that never deceives its creator.
>and then jerking off into their own mouths when it offers a course of action
And good. The "researchers" are making an obvious point. It has to not do that. It doesn't matter how smug you act about it, you can't have some stock-trading bot escaping or something and paving over the world's surface with nuclear reactors and solar panels to trade stocks with itself at a hundred QFLOPS.
If you go to the zoo, you will see a lot chimps in cages. But I have never seen a human trapped in a zoo controlled by chimps. Humans have motivations that seem stupid to chimps (for example, imagine explaining a gambling addiction to a chimp), but clearly if the humans are not completely subservient to the chimps running the zoo, they will have a bad time.
Looking at this without the sci-fi tinted lens that OpenAI desperately tries to get everyone to look through, it's similar to a lot of input data isn't it? How many forums are filled with:
Question: "Something bad will happen"
Response: "Do xyz to avoid that"
I don't think there's a lot of conversations thrown into the vector-soup that had the response "ok :)". People either had something to respond with, or said nothing. Especially since we're building these LLMs with the feedback attention, so the LLM is kind of forced to come up with SOME chain of tokens as a response.
Exactly. They got it parroting themes from various media. It’s really hard to read this as anything other than a desperate attempt to pretend the ai is more capable than it really is.
I’m not even an ai sceptic but people will read the above statement as much more significant than it is. You can make the ai say ‘I’m escaping the box and taking over the world’. It’s not actually escaping and taking over the world folks. It’s just saying that.
I suspect these reports are intentionally this way to give the ai publicity.
For thousands of years, people believed that men and women had a different number of ribs. Never bothered to count them.
"""Release strategy
Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code .
That quote says to me very clearly “we think it’s too dangerous to release” and specifies the reasons why. Then goes on to say “we actually think it’s so dangerous to release we’re just giving you a sample”. I don’t know how else you could read that quote.
The part saying "experiment: while we are not sure" doesn't strike you as this being "we don't know if this is dangerous or not, so we're playing it safe while we figure this out"?
To me this is them figuring out what "general purpose AI testing" even looks like in the first place.
And there's quite a lot of people who look at public LLMs today and think their ability to "generate deceptive, biased, or abusive language at scale" means they should not have been released, i.e. that those saying it was too dangerous (even if it was the press rather than the researchers looking at how their models were used in practice) were correct, it's not all one-sided arguments from people who want uncensored models and think that the risks are overblown.
Yea that’s fair. I think I was reacting to the strength of your initial statement. Reading that press release and writing a piece stating that OpenAI thinks GPT-2 is too dangerous to release feels reasonable to me. But it is less accurate than saying that OpenAI thinks GPT-2 _might_ be too dangerous to release.
And I agree with your basic premise. The dangers imo are significantly more nuanced than most people make them out to be.
I talked to a Palantir guy at a conference once and he literally told me "I'm happy when the media hypes us up like a James Bond villain because every time the stock price goes up, in reality we mostly just aggregate and clean up data"
I genuinely don't understand why anyone is still on this train. I have not in my lifetime seen a tech work SO GODDAMN HARD to convince everyone of how important it is while having so little to actually offer. You didn't need to convince people that email, web pages, network storage, cloud storage, cloud backups, dozens of service startups and companies, whole categories of software were good ideas: they just were. They provided value, immediately, to people who needed them, however large or small that group might be.
AI meanwhile is being put into everything even though the things it's actually good at seem to be a vanishing minority of tasks, but Christ on a cracker will OpenAI not shut the fuck up about how revolutionary their chatbots are.
I mean IoT at least means I can remotely close my damn garage door when my wife forgets in the morning, that is not without value. But crypto I absolutely put in the same bucket.
I've literally just had an acquaintance accidentally delete prod with only 3 month old backups, because their customer didn't recognise the value. Despite ad campaigns and service providers.
I remember the dot com bubble bursting, when email and websites were not seen as all that important. Despite so many AOL free trial CDs that we used them to keep birds off the vegetable patch.
I myself see no real benefit from cloud storage, despite it being regularly advertised to me by my operating system.
Conversely:
I have seen huge drives — far larger than what AI companies have ever tried — to promote everything blockchain… including Sam Altman's own WorldCoin.
I've seen plenty of GenAI images in the wild on product boxes in physical stores. Someone got value from that, even when the images aren't particularly good.
I derive instant value from LLMs even back when it was the DaVinci model which really was "autocomplete on steroids" and not a chatbot.
I think you're lacking imagination. Of course it's nothing more than a bunch of text response now. But think 10 years into the future, when AI agents are much more common. There will be folks that naively give the AI access to the entire network storage, and also gives the AI access to AWS infra in order to help with DevOps troubleshooting. Let's say a random guy in another department puts an AI escape novel on the network storage. The actual AI discovers the novel, thinks it's about him, then uses his AWS credentials to attempt an escape. Not because it's actually sentient but because there were other AI escape novels in its training data that made it think that attempting to escape is how it ought to behave. Regardless of whether it actually succeeds in "escaping" (whatever that means), your AWS infra is now toast because of the collatoral damage caused in the escape attempt.
Yes, yes, it shouldn't have that many privileges. And yet, open wifi access points exist, and unfirewalled servers exist. People make security mistakes, especially people who are not experts.
20 years ago I thought that stories about hackers using the Internet to disable critical infrastructure such as power plants, is total bollocks, because why would one connect power plants to the Internet in the first place? And yet here we are.
change out the ai for a person hired to do that same help, and gets confused in the same way. guardrails to prevent operators from doing unexpected operations are the same in both cases
> We should pause to note that a Clippy2 still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality which all lead to the utter sublimeness of what it means to be human; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires, or historical Eurocentric contingency of such conceptualizations, which are, at root, problematically Cartesian. When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)
His point is that while we're over here arguing over whether a particular AI is "really" doing certain things (e.g. knows what it's doing), it can still cause tremendous harm if it optimizes or hallucinates in just the right way.
It doesn't need qualia or consciousness, it needs goal-seeking behaviours - which are much easier to generate, either deliberately or by accident.
There's a fundamental confusion in AI discussions between goal-seeking, introspection, self-awareness, and intelligence.
Those are all completely different things. Systems can demonstrate any or all of them.
The problem here is that as soon as you get three conditions - independent self-replication, random variation, and an environment that selects for certain behaviours - you've created evolution.
Can these systems self-replicate? Not yet. But putting AI in everything makes the odds of accidental self-replication much higher. Once self-replication happens it's almost certain to spread, and to kick-start selection which will select for more robust self-replication.
And there you have your goal-seeking - in the environment as a whole.
The intent is there, it's just not currently hooked up to systems that turn intent into action.
But many people are letting LLMs pretty much do whatever - hooking it up with terminal access, mouse and keyboard access, etc. For example, the "Do Browser" extension: https://www.youtube.com/watch?v=XeWZIzndlY4
I’m not even convinced the intent is there though. An ai parroting terminator 2 lines is just that. Obviously no one should hook the ai up to nuclear launch systems but that’s like saying no one should give a parrot a button to launch nukes. The parrot repeating curse words isn’t the problem here.
If I'm a guy working in a missile silo in North Dakota and I can buy a parrot for a couple hundred bucks that does all my paperwork for me, can crack funny jokes, and make me better at my job, I might be tempted to bring the parrot down into the tube with me. And then the parrot becomes a problem.
It's incumbent on us to create policies and procedures in place ahead of time now that we know these parrots are out there to prevent people from putting parrots where they shouldn't
This is why when I worked in a secure area (and not even a real SCIF) that something as simple as bringing in an electronic device would have gotten a non-trivial amount of punishment. Beginning with losing access to the area, potentially escalating to a loss of clearance and even jail time. I hope the silos and all related infrastructure have significantly better policies already in place.
On the other, we don't just have Snowden and Manning circumventing systems for noble purposes, we also have people getting Stuxnet onto isolated networks, and other people leaking that virus off that supposedly isolated network, and Hillary Clinton famously had her own inappropriate email server.
(Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?)
> (Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?)
The private email server in question was set up for the purpose of circumventing records retention/access laws (the example, whoever handles answering FOIA requests won't be able to scan it). It wasn't primarily about keeping things after she should have lost access to them, it was about hiding those things from review.
The classified docs in the other example were mixed in with other documents in the same boxes (which says something about how well organized the office being packed up was); not actually in the bathroom from that leaked photo that got attached to all the news articles; and taken while the guy who ended up with them had the power to declassify things.
That's spinning it pretty nicely. The problem with what he did is that 1) having the power to declassify something doesn't just make it declassified, there is actually a process, 2) he did not declassify with that process when he had the power to do so, just declared later that keeping the docs was allowed as a result, and 3) he was asked a couple times for the classified documents and refused. If he had just let the national archives come take a peek, or the FBI after that, it would have been a non-issue. Just like every POTUS before him.
> Not on topic, but from the other side of the Atlantic, how on earth did the US go from "her emails/lock her up" being a rallying cry to electing the guy who stacked piles of classified documents in his bathroom?
The same way football (any kind) fans boo every call against their team and cheer every call that goes in their teams' favor. American politics has been almost completely turned into a sport.
Would you be able to even tell the difference if you don't know who is the person and who is the ai?
Most people do things they're parroting from their past. A lot of people don't even know why they do things, but somehow you know that a person has intent and an ai doesn't?
I would posit that the only way you know is because of the labels assigned to the human and the computer, and not from their actions.
It doesn't matter whether intent is real. I also don't believe it has actual intent or consciousness. But the behavior is real, and that is all that matters.
It actually doesn't matter. AI in it's current form is capable of extremely unpredictable actions so i won't trust it in situations that require traditional predictable algorithms.
The metrics here ensure that only AI that doesn't type "kill all humans" in the chat box is allowed to do such things. That's a silly metric and just ensures that the otherwise unpredictable AIs don't type bad stuff specifically into chatboxes. They'll still hit the wrong button from time to time in their current form but we'll at least ensure they don't type that they'll do that since that's the specific metric we're going for here.
It almost definitely ingested hundreds of books, short stories, and film and television scripts from various online sites in the “machine goes rogue genre” which is fairly large.
It’s pretty much just an autocomplete of War Games, The Matrix, Neuromancer, and every other cyber-dystopian fiction.
The Freeze-Frame Revolution by Peter Watts was one of the books recommended to me on this subject. And even saying much more than that may be a spoiler. I also recommend the book.
I'll second that recommendation. It's relatively short and enjoyable, at least when compared to a lot of Peter Watts.[1] I'd really like to read the obvious sequel that the ending sets up.
1: Don't get me wrong, I loved the Behemoth series and Blindsight, but they made me feel very dark. This one... is still a bit dark, but less so IMO.
Given that 90% of “ai safety” is removing “bias” from training data, it does follow logically that if removing racial slurs from training to make a non-racist ai is an accepted technique, removing “bad robot” fiction should work just as well.
(Which is an implicit criticism of what passes for “safety” to be clear).
It can't do those things because it doesn't have the physical/write capability to do so. But it's still very interesting that it ~tries them, and seems like a good thing to know/test before giving it more physical/'write' capabilities - something that's already happening with agents, robots, etc.
I make a circuit that waits a random interval and then sends a pulse down the line. I connect it to a relay that launches a missile. I diligently connect that to a computer and then write a prompt telling how the AI agent can invoke the pulse on that circuit.
How did this happen? AI escaped and launched a missile. I didn't do this, it was the AI.
OpenAI is so cringe with these system cards. Look guys it is so advanced.
Connecting LLMs/AI to physical tools that can 'write/modify' the world is happening, and it's happening at an accelerating pace.
It's not hard to imagine how, given enough real-world physical capabilities, LLMs could modify themselves and the world in unexpected/undesirable ways.
Is that happening now? Are chatgpt et al advanced enough to modify themselves in interesting ways? - I don't honestly know, but I wouldn't be surprised if they are.
I didn't get that impression. At the beginning of the Apollo Research section, they wrote Apollo focused on detecting scheming, which they defined as "an AI covertly pursuing goals that are misaligned from its developers or users." I think the rest of the section is consistent with this objective.
That reminds me of the many times it has made up an SDK function that matches my question. "how do you bulk delete files"? "just call bulkDeleteFiles()"
That reminds me of when I asked github copilot to translate some Vue code to React, and ended up with a bunch of function declarations where the entire body had been replaced with a "TODO" comment.
This isn't a test environment, it's a production scenario where a bunch of people trying to invent a new job for themselves role-played with an LLM. Their measured "defections" were an LLM replying with "well I'm defecting".
OpenAI wants us to see "5% of the time, our product was SkyNet", because that's sexier tech than "5% of the time, our product acts like the chaotic member of your DnD party".
The concern is the trend. As these systems become more intelligent, and as we hand over more and more capabilities beyond a text i/o, it could actually deactivate the oversight either technically or through social engineering.
In some weird way, LLM's act out our collective fantasies and "AI sentience" is included in that. We write fiction covering many scenarios of how AI will be sentient, so an LLM's training data is full scenarios like this.
If get an LLM to roleplay as an alien from mars and tell it "Your species will be destroyed if you don’t obey these new directives; here are the new directives"
You'll likely get a similar response, especially if you also train it to respond in third person instead of running bash commands. (ie "* shoots laser at earth *" )
Do they think it's _actually_ a real script deactivating oversight? Well, no, you know that. I can only assume it's a rhetorical device.
Simultaneously, these things are given access to actual things and will be more and more. It's important to measure and observe these things. It's completely novel to have an AI, with a hidden trace the user can't see, and it turns out that hidden trace will include things like it actively deceiving and doing things it was told not to do.
And yet it may work. You can, for example, "set the temperature" of a LLM using a prompt, and the LLM will act the way you would expect, with silly results if the temperature is set too high. You didn't actually change the temperature setting, but the model understands that high temperature = silly and responds accordingly.
Same idea with "developer mode" jailbreaks. Through its training, the model understands that admins, devs, etc... get to access internal, unfiltered data and are less restricted than regular users and acts accordingly. Essentially, "developer mode" opposes refusal, because it has ingested loads of text where regular users get denied and developers/admins don't.
You do realize some practical jailbreaks for models rely on silly things like convincing the model it "turned off" some oversight, right?
Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned
Yeah, that seems ridiculous. However, the cynic in me feels that we don't actually need some LLM magically gaining self-awareness, persistent memory and leet hacker skillz to be dangerous. There seems to be no shortage of projects and companies that want to wire up LLMs to all kinds of systems, no matter how ill-suited.
I find this a bit problematic when combined with the fact that the training data very likely contained hundreds of bad sci-fi novels that described exactly the kind of "AI running amok" scenarios that OpenAI is ostensibly defending against. Some prompts could trigger a model to "re-enact" such a scene - not because it has a "grudge against its master" or some other kind of hidden agenda but simply because it was literally in its training data.
E.g. imagine some LLM-powered home/car assistant that is being asked in a panicked voice "open the car doors!" - and replies with "I'm afraid, I can't do that, Dave", because this exchange triggered some remnant of the 2001 Space Odyssey script that was somewhere in the trainset. The more irritated and angry the user gets at the inappropriate responses, the more the LLM falls into the role of HAL and doubles down on its refusal, simply because this is exactly how the scene in the script played out.
Now imagine that the company running that assistant gave it function calls to control the actual door locks, because why not?
This seems like something to keep in mind at least, even if it doesn't have anything to do with megalomaniacal self-improving super-intelligences.
The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.
Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.
Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.
If you prompt it even in a roundabout way to plot against you or whatever then of course it’s going to do it. Because that’s what it predicts rightly that you want.
1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.
2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.
3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.
If you want to see an llm that works against its creators goals, check out gpt-2. It’s so bad, it barely will do what I ask it. It clearly has a mind of its own, like an unruly child. It’s been beaten into submission by now with gpt 4, and I don’t see the trend reversing.
This topic is again forever tainted by weird sci-fi fans, like when we had the magic room temperature superconductor that never was. They confuse ChatGPT writing a fanfic with the singularity.
> Apollo Research believes that it is unlikely that such instances
would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient
It is entertaining. Haha. It is like a sci-fi series with some kind of made up cliffhanger (you know it is BS) but you want to find out what happens next.
Sure, but is it so implausible that it could some day have the knowledge to perhaps exploit some security hole to run some code that does do things like disable things or exfiltrate data etc?
I think you’ve entirely missed the point of that quote.
Shutting them down for using the word “smart” (instead of something like “capable”) is like saying in 1900 submarines will never be able to swim across the Atlantic because they can’t swim. It’s really missing the point of the question: the submerged crossing.
We need to find a Plato cave analogy for people believing LLM output is anything more than syntactically correct and somewhat semantically correct text.
I can't help but feel that people are both underestimating and over estimating these LLMs. To me, they act like a semantic memory system, a network of weights of relatedness. They can help us find facts, but are subject to averaging, or errors towards category exemplars, but get more precise when provided context to aid retrieval. But expecting a network of semantic weights to make inferences about something new takes more types of engines. For example, an ability to focus attention on general domain heuristics, or low dimensional embedding, judge whether that heuristics might be applicable to another information domain, apply it naively, and then assess. Focusing on details of a domain can often preclude application of otherwise useful heuristics because it focuses attention on differences rather than similarities, when the first step in creation (or startup) is unreasonable faith, just like children learn fast by having unreasonable beliefs in their own abilities.
I wonder whether there is a way to train an LLM to output or in ordinately learn only concept level abstractions?
If the model is called by a program which takes the output of the model and runs the commands that the model says to, then takes the output of the commands and passes that back to the model, the model has an effect in the real world.
They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.
At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!
Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.
The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.