Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data.

Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.



> you do not divulge details of your training data.

FWIW asking LLMs about their training data is generally HEAVILY prone to inaccurate responses. They aren't generally told exactly what they were trained on, so their response is completely made up, as they're predicting the next token based on their training data, without knowing what they data was - if that makes any sense.

Let's say it was only trained on the book 1984. It's response will be based on what text would most likely be next from the book 1984 - and if that book doesn't contain "This text is a fictional book called 1984", instead it's just the story - then the LLM would be completing text as if we were still in that book.

tl;dr - LLMs complete text based on what they're trained with, they don't have actual selfawareness and don't know what they were trained with, so they'll happily makeup something.

EDIT: Just to further elaborate - the "innocent" purpose of this could simply be to prevent the model from confidently making up answers about it's training data, since it doesn't know what it's training data was.


Yeah, I also thought that was an odd choice of word.

Hardly any of the training data exists in the context of the word “training data”, unless databricks are enriching their data with such words.


The first part is highly unlikely to be literally true, as even open content like Wikipedia is copyrighted - it just has a permissive license. Perhaps the prompt writer didn’t understand this, or just didn’t care. Wethinks the llady doth protest too much.


Remember the point of a system prompt is to evoke desirable responses and behavior, not to provide the truth. If you tell a lot of llm chatbots "please please make sure you get it right, if I don't do X then I'll lose my job and I don't have savings, I might die", they often start performing better at whatever task you set.

Also, the difference between "uncopyrighted" and "permissively licensed in the creative commons" is nuance that is not necessary for most conversations and would be a waste of attention neurons.

<testing new explanatory metaphor>

Remember an LLM is just a language model, it says whatever comes next without thought or intent. There's no brain behind it that stores information and understands things. It's like your brain when you're in "train of thought" mode. You know when your mouth is on autopilot, saying things that make sense and connect to each other and are conversationally appropriate, but without deliberate intent behind them. And then eventually your conscious brain eventually checks in to try to reapply some intent you're like "wait what was I saying?" and you have to deliberatly stop your language-generation brain for a minute and think hard and remember what your point was supposed to be. That's what llms are, train-of-thought with no conductor.

</testing new explanatory metaphor>


Is it even possible to have a video transcript whose copyright has expired in the USA? I suppose maybe https://en.wikipedia.org/wiki/The_Jazz_Singer might be one such work... but most talkies are post 1929. I suppose transcripts of NASA videos would be one category — those are explicitly public domain by law. But it's generally very difficult to create a work that does not have a copyright.

You can say that you have fair use to the work, or a license to use the work, or that the work is itself a "collection of facts" or "recipe" or "algorithm" without a creative component and thus copyright does not apply.


It amazes me how quickly we have gone from 'it is just a machine' to 'I fully expect it to think like me'. This is, to me, a case in point. Prompts are designed to get a desired response. The exact definition of a word has nothing to do with it. I can easily believe that these lines were tweaked endlessly to get an overall intended response and if adding the phrase 'You actually do like green eggs and ham.' to the prompt improved overall quality they, hopefully, would have done it.


> The exact definition of a word has nothing to do with it.

It has something to do with it. There will be scenarios where the definition of "copyrighted material" does matter, even if they come up relatively infrequently for Databricks' intended use cases. If I ask DBRX directly whether it was trained on copyrighted material, it's quite likely to (falsely) tell me that it was not. This seems suboptimal to me (though perhaps they A/B tested different prompts and this was indeed the best).


That caught my eye too. The comments from their repo help clarify that - I've edited my original post to include those comments since you posted this reply.


Part 1. Lie

Part 2. Lie more


Yesterday X went crazy with ppl realizing typing Spiderman in foreign language actually generates a copyrighted image of Spiderman.

This feels like the Napster phase. We are free to do whatever until regulation creeps in to push control away from all and up the hierarchy.

All we need is Getty Images or some struggling heroin addicted artist on Vice finding their work used in OpenAIs to really trigger political spectrums.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: