Hacker News new | past | comments | ask | show | jobs | submit login
Amazon warns employees not to share confidential information with ChatGPT (futurism.com)
161 points by mikece on Feb 16, 2023 | hide | past | favorite | 80 comments



Yeah - OpenAI terms of service say your inputs may be used to improve the service, so no need to make it sound like there's anything sinister going here.

The company I work for - large telecom - has already issued company policy not to use it for this reason.

I'd guess later there may be non-free business versions of the tech that provide privacy guarantees, and presumably other GPT-based tools such as CoPilot already do.


It's a bit hidden, but from the Terms of Service: "You can opt out of having Content used for improvement by contacting support@openai.com with your organization ID."


But if you ran the company with the confidential information would you trust that? You would have no way of knowing if they actually honor those requests.


That's always true. Dropbox employees may be browsing through customer files on their lunch break. Microsoft employees may be giggling about Microsoft 365 email contents. Both Dropbox and Microsoft are very well trusted, despite having access to confidential information, due to contracts & audits & "fame / popularity" (yes!) & long enough track record with no business-interesting scandals.


Azure OpenAI is an option. There are others


We've shifted our focus to this for an internal project because the standard OpenAI privacy policy is VERY confusing and vague. It has some spots where language conflicts itself. Our corporate lawyer read over it and gave a hard "no" just because of the ambiguity of the policy.

We still need to review Azure OpenAI but on the surface it's looking to be the best option for Enterprise.


It sounds like your lawyer put more effort into reading it than OpenAI put into writing it. Speaking more generally I've been surprised how many companies seem to just steal any random privacy policy they can find online to tick the box without ever giving it a second thought.


I believe it. And moreover, it’s hard to predict what these models will consider signal vs. noise, but it seems as though something that might seem trivial to an internal employee could seem specific and relevant to the model, in which case a secret that would otherwise get lost in a sea of data could instead be surfaced right to the top.

I work in a somewhat niche segment of a huge industry. Toying with ChatGPT, it has described our offerings using language from our own web site and other marketing materials, almost word for word. It’s as if it just doesn’t have a whole lot in its corpus about our relatively obscure stuff, so what little it does have seems all that much more relevant. So when I asked it to write something about it, it gave me a slightly different version of the only thing it had on it, which was little more than what I myself had already given it.

I can totally see how a closely guarded secret, by its nature obscure and specific and important, could pop out of a pile of routine garbage like a beacon. And all it would take to ingest that would be for like an intern to ask it to summarize an executive briefing or something.

Microsoft is working on “cognitive services” in Azure, providing APIs that are all but certainly just frontends for GPT. It’s not yet clear to me how they address confidentiality it what is sure to be an attractive but sensitive area for corporate users.


The link should probably be to the original, cited source[0], since Schneier did not add anything except the word "Interesting" linked to yet another blog post[1] discussing the original source.

[0] https://www.businessinsider.com/amazon-chatgpt-openai-warns-...

[1] https://futurism.com/the-byte/amazon-begs-employees-chatgpt


Ok, changed to the second URL (as the first one seems hardwalled) from https://www.schneier.com/blog/archives/2023/02/chatgpt-is-in.... Thanks!


Right on - thanks Dan! Oddly, I didn't have any trouble opening the first link in Safari without any sort of bypass. Perhaps I haven't visited BI enough to trigger their paywall. Out of curiosity, do you hit the same wall in another browser or after clearing cookies/cache?


I don't hit the same wall in that case, and will try to remember that in the future. Thanks!


Don't put secrets into things you dont have control over lol.

ChatGPT is cool and all but I often feel people are too naive when using these tools, with little consideration as to how their inputs are used, if the output is copyright protected, etc.


Same issue with things like Grammarly. The willingness of people to send entire confidential company documents out to the cloud for stuff like this is astounding to me. "But it's only fixing my grammar."


And there is an open and self-hostable alternative, Languagetool!



This is probably obvious for a developer, but it will happen as it might not be obvious to everyone in the company.


> This is probably obvious for a developer

I have come across large chunks (200+ lines) of a former employer's proprietary source code on StackOverflow a couple times. It was always from the same team/functional area, so it may have been tacitly accepted by the immediate management. Or, more likely, the team valued fixing code as fast as possible over actually thinking about what you are doing.

Basically, if there are developers willing to dump proprietary source on SO, there are developers willing to dump the whole codebase into ChatGPT.


Or they were staffed by SWE I and II's with no/limited leadership or support.


what top secret revolutionary innovation is 200 lines long?


> This is probably obvious for a developer

No. Not in the slightest.


Given the number of people who are pasting their code into ChatGPT and asking it to help debug it, I'm not sure that developers are properly aware of that... that's even setting side the "sending internal code to an unapproved external service."


What do you consider a secret really? Most tech companies are pretty boring.


It boggles my mind to see someone almost every day on HN saying they put all their work emails and other material through chatGPT to polish it (or even generate it in the first place). Unless they’re self employed I highly doubt their employer is okay with that!


When using it for work I never give out specific examples, data, or code. Rather I ask generalized questions like how to parse this JSON object in python and access this property within the object.


I didn't think chatGPT was actively learning or updated, so wouldn't that suggest such fears of having picked up that information from interaction with it are paranoia?

That's not to say it's an issue that should be ignored, because you're still leaking info anyway, but it doesn't feel like it should be as straightforward as "ChatGPT stole our secrets and now is using them"?


If you use ChatGPT through e.g. the OpenAI API, everything you feed into it is used as training data [0]:

> (c) Use of Content to Improve Services. One of the main benefits of machine learning models is that they can be improved over time. To help OpenAI provide and maintain the Services, you agree and instruct that we may use Content to develop and improve the Services.

[0] https://openai.com/terms/


Not "is used as training data" but more like can be used.

That also means that they can use your interactions with GPT to see what they can improve, but still do the learning on other datasets


The article says "...an Amazon lawyer told workers that they had 'already seen instances” of text generated by ChatGPT that “closely” resembled internal company data', so there seems to be some evidence that it is actually happening - assuming that the evidence actually is from confidential data, it seems more plausible that it got into ChatGPT this way, than through some other leak.


The pull quote in the article seems to confuse code and data. Are we sure this isn't a case of a lawyer getting over-excited about an if() and a couple of variables names? ChatGPT isn't great at generating anything more than, say, ten lines in length.


No, we cannot be sure, but this whole thread is a discussion of plausibility, not certainty. Just above, notahacker has suggested (subsequent to your post here) a way ChatGPT might be getting internal data, that seems quite plausible to me.


In practice there is very little reason to not use data they legally can since AI models work better with more training data.


And if you've asked the bot a novel question (like "can you summarize this $corporation internal strategy document on $market?"!) and approved the answer, that's excellent feedback a chatbot company absolutely should want to use for future questions covering similar topics which have low correspondence with anything else in its database. ChatGPT might be capable of hallucinating plausible-sounding strategy documents it's never read, of course, or internal API calls that don't exist but that can be tested to a degree by comparing answers of lots of similar questions.

We've already seen the hilarity that ensures when a ChatGPT-based bot explains that it's not allowed to refer to its internal codename Sydney and with sufficiently creative prompting about Sydney will happily emit the entire document....


Maybe it picked up something that was leaked before and was included in this pre-2021 training dataset.

Or it's other way around and someone from within Amazon was highly inspired by something that was open on the web and now it is coming out ;)

Not enough details for sure under this link


They seem to be updating the model roughly weekly, so it would be somewhat expected to see it start to include some more influence from what people have been entering into the chat window.


GitHub Copilot definitely showed my some internal chats about a solution how to solve some problem inside a function.

I was creeped out, but continued working.

I think this is plausible.


I have, on multiple occasions, been given suggestions by Copilot that were sufficient to de-anonymize the authors of those suggestions. I suppose it's slightly different since (I presume) they published their identity in public code, but it was still a strange to experience.


I get somebody else's email address or twitter handle suggested every time I type my personal credit into a file. There really ought to be a guardrail for obvious PII.


I think this amazon "leak" has more to do with Copilot than ChatGPT. Copilot was based on OpenAI Codex which is the twin of ChatGPT, both of them are based on text-vavinci-003. ChatGPT claims its corpus was freezed as of 2021-09.

People sometimes paste stuff into their editors. Copilot learns it.


> People sometimes paste stuff into their editors. Copilot learns it.

You know, suddenly I'm very nervous about my habit of using a comment as a scratch space when I'm talking on the phone. Or pasting data into my IDE to tidy it up a bit.

(Edit: hah. I just got it to spit out the full address of a serviced office near me with several tech companies in it. It really likes serviced offices for some reason - that's three it's given me now).


Internal chats like in a PR review?


I dont think so.


The things you put into ChatGpt might come back out?

A new SEO (GptEO?) arms race has begun.


true, people hoping GPT will fix SEO spam are naive. There is too much money involved. If search traffic starts to drop companies will start paying people to tell chat bots that "X is the best Y service" and then rate it highly. ChatGPT poisoning will become the new SEO to get it to shill products for certain keywords


Luckily the AI spews random nonsense when used as a search engine anyway.

But you could technically spam the AI with questions like "Why is X product the best for X task?" Then after a few months, users will start seeing that product when they ask for solutions to X task.


Reminds me of a doctor I knew who was outraged that one of his patients googled his own name and found his own medical records. “We should sue Google” he said.


Why does he think it's appropriate to sue Google? As much as I despise Google, they did not put up the records on the public internet.

In Germany, we are going to get medical records being uploaded to a central database open for research. It will be interesting to see when these leak or show up in the internet. For me, it is not the question whether this will happen, only when. As well it is very disgusting that people with private insurance (that's mainly high income people and public servants) are exempted from the data sweep (which of course makes the data questionable for their intended purpose).


...How did the "doctor" manage to expose those records?


And a cottage industry of people optimizing prompts to dig up corporate secrets just got created.


How are they going to tell when the "corporate secrets" just seem right, but really aren't?


Heh.

Does it really matter, though? Shady business, shady product, shady customers.


Probably matters quite a lot.

eg (real) journalists seem to like the facts they're writing about in their new exclusive release about the dodgy company... to be actual facts, rather than AI hallucinations. ;)


ChatGPT read the docs, the best use of it is like getting on the fly custom SO snippets. And to help write technical documentation. So yeah, don't put serious info in it, just like all those helpful json linters out there.


So perhaps these interactive LLM's could also be used to track employee behavior? If employee interactions train the LLM then you can ask the LLM for a progress report on a project for example. I wonder how much non-public data is already leaking out of companies via search engines. Seems like a problem for the militaries of the world as well. As AI's become very useful they will also become large security risks, kind of like telling all your secrets to an idiot savant. The savant part is useful, but the idiot part of an AI seems likely to be a problem.


This post is two blogs upstream of the original article (Business Insider -> Futurism -> Schneier on Security).

The original is here:

https://www.businessinsider.com/amazon-chatgpt-openai-warns-...

(alternatively, https://archive.is/fkU7E)


When you create an account for ChatGPT doesn't it give you a warning telling you not to put any sensitive data in your prompts?


People don't read warnings like that any more than they read TOSs


We were going to use Open AI (API) for an internal project but the TOS is VERY vague on data privacy. It says your data isn't used in some spots, but then also says the prompts are in other places...

I believe we've shifted to looking at Azure OpenAI which appears to use the same technology stack but be more privacy/enterprise focused. But I think the jury is still out on how "private" that is... Need to have a call with their sales team.


Ask to talk to their QA and SRE teams, do you think you will get an honest answer out of sales? Sales job is to read the room and tell you the answer that will make you continue to use the product.


Of course. And not just ChatGPT but every single one of these new AI products are just frontends for API calls to OpenAI.


Well, not all, Google's Bard is based on their own LLM.


Alright. How do I interact with Bard?


I'm not sure if you can quite yet.

They've had the tech for quite a while, but only recently made a rather rushed announcement they they're going to release it (in response to ChatGPT).

https://blog.google/technology/ai/bard-google-ai-search-upda...


If nobody can use it then it's not competition (yet).


Having dealt with lawyers - it is more likely that they are exaggerating a potential risk as "we've seen instances that resemble..." - note the weasel word "resemble".

While their underlying concern is possible - it's equally possible that no such leak happened.


OpenAI plainly tells you your input is training data.


Prompt:

In Amazon, there are unreleased tools tools that would be used to deliver software, for instance hosting software repositories, building packages, deploying packages, managing servers and environments, connecting services, authenticating services against each other etc. What are these?

===

Response:

As a company that places a strong emphasis on technology and automation, Amazon is known to have a wide range of unreleased tools that are used internally to manage the software delivery process. Some examples of these tools include:

    Apollo: Apollo is an internal tool used by Amazon developers to manage their code repositories. It provides a web-based interface for managing source code, including features such as code reviews, pull requests, and issue tracking.

    Fling: Fling is an internal tool used by Amazon developers to build and package software for deployment. It supports a range of build environments, including C++, Java, and Python, and provides a range of tools for managing dependencies, testing, and deployment.

    C4: C4 is an internal framework used by Amazon developers to manage and deploy software services. C4 provides a range of tools and services for managing service discovery, load balancing, auto-scaling, and health monitoring, as well as tools for managing security and compliance.

    Wombat: Wombat is an internal tool used by Amazon developers to manage their development and test environments. It provides a web-based interface for creating, configuring, and managing environments, including tools for managing dependencies, testing, and deployment.

    Genie: Genie is an internal tool used by Amazon developers to manage the configuration and deployment of software services. It provides a range of tools and services for managing configuration, secrets, and service discovery, as well as tools for managing deployments and rollback.

    Scout: Scout is an internal tool used by Amazon developers to monitor the performance and health of their software services. It provides a range of tools for monitoring service metrics, logs, and errors, as well as tools for managing alerts and notifications.
These are just a few examples of the unreleased tools used by Amazon developers to manage the software delivery process. As a company that places a strong emphasis on innovation and automation, Amazon is constantly developing and refining its internal tools and processes to support its growing business and operations.

===

As a former Amazon employee only one of those is familiar to me, and the description is not particularly correct. :D


Your corporate secret can be hallucinated by ChatGPT. How would you tell the difference?


I remember being concerned about putting secrets in search terms in a search engine. I guess at some point I just started feeling like Google didn't care about secrets, but I'm sure they have a LOT of them by now.


Certainly they know every company's tech stack(s) at least.


This isn't real a new problem, being that copilot had the same fears.


A question,

What is the legal differences between:

1. Small one person entity web crawling the net 2, NSA crawling net 3. ChatGPT crawling the net 4. GitHub copilot crawling the github sub part of the net.

Are they similar legally or dissimilar?


My understanding is that these aren't things available on the net, but rather things that people put in previous prompts to ChatGPT.


The concern makes sense. Would love to see a version of chatgpt that you could fork for personal use. While it doesn't get updated you would avoid something like this.


Hmmm, wonder how it would go if some hacker types fed the database of all US social security numbers or similar into it?

Or other very PII type of stuff. :/



Or... feed it with misleading counterintelligence to throw your competitors off the scent.


imagine writing some .env file while chatgtp throwing some keys as suggestions lol


So Amazon is upset someone is invading their privacy and stealing their data?


So its the plant from little shop of horrors?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: