Anyone got a contact at OpenAI. They have a spider problem

btown · 2024-04-11T14:26:20 1712845580

This reminds me of how GPT-2/3/J came across https://reddit.com/r/counting, wherein redditors repeatedly post incremental numbers to count to infinity. It considered their usernames, like SolidGoldMagikarp, such common strings on the Internet that, during tokenization, it treated them as top-level tokens of their own.

https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...

https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...

Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.

It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)

aidenn0 · 2024-04-11T15:22:01 1712848921

I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs. I mean if someone posts a question on an internet forum that I don't know the answer to, I'm certainly not going to post "I don't know" since that wouldn't be useful.

In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.

Rebelgecko · 2024-04-11T18:41:34 1712860894

Reminds me of a joke

Three logicians walk into a bar. The bartender says "what'll it be, three beers?" The first logician says "I don't know". The second logician says "I don't know". The third logician says "Yes".

mgsouth · 2024-04-11T22:22:23 1712874143

If, like me, you didn't get the joke at first:

Both of the first two logicians wanted a beer; otherwise they would know the answer was "no". The third logician recognizes this, and therefore knows the answer.

KptMarchewa · 2024-04-11T22:52:17 1712875937

Unless one of those wanted two beers. Or 0.5 beer. Or -1 beers. Or 1e9 beers. Or 2147483648 beers.

prepend · 2024-04-11T23:07:20 1712876840

He didn’t know perfectly, but he knew with great enough probability to place an order. In the very small chance that someone wanted two beers, someone would speak up.

This way is logically most efficient to work and involve the least communication.

t0astbread · 2024-04-12T00:37:51 1712882271

Hopefully no one asks where the bathroom is.

barfingclouds · 2024-04-12T14:07:29 1712930849

You’re referencing the wrong joke

KptMarchewa · 2024-04-12T15:37:11 1712936231

Right.

bkirkby · 2024-04-12T02:58:01 1712890681

why is it three logicians? wouldn't it work with just two?

kkylin · 2024-04-12T03:40:19 1712893219

I recently heard this explained () in the following way: three is the smallest number where you can set up an expectation (with the first two) and then break it. This is why three is such a common number, not just in jokes but in all sorts of story-telling.

() In a lecture by the mathematician & author Sarah Hart.

protomikron · 2024-04-12T21:56:06 1712958966

This guy footnotes.

adw · 2024-04-12T03:16:42 1712891802

[Three is always funnier](https://en.wikipedia.org/wiki/Rule_of_three_(writing)).

douglaskayama · 2024-04-14T08:47:12 1713084432

Three is a magic number. https://www.youtube.com/watch?v=0rdhqAZSRDc

robertlagrant · 2024-04-12T00:34:11 1712882051

That sounds like my old neighbour, a professor of logic from the university of science.

SoftTalker · 2024-04-12T02:51:27 1712890287

I've heard tell of the place. By chance, did he have a doghouse?

bigiain · 2024-04-12T00:54:13 1712883253

Found the QA engineer:

https://news.ycombinator.com/item?id=25851770

3pt14159 · 2024-04-12T18:19:24 1712945964

I do love a good joke, but this one falls a bit flat.

Logically speaking, the second bar tender could have thought to himself "no I don't want any beer, but one of these two other guys may want to double fist" and so there is really no way for the third logician to answer in the affirmative.

withinboredom · 2024-04-11T19:13:03 1712862783

And the human bartender passes the check to the third logician.

readyplayernull · 2024-04-11T20:14:20 1712866460

The third logician never finishes his beer, his friends get more free beers. The bar overflows.

dylan604 · 2024-04-12T00:00:32 1712880032

How much would that exploit be worth on the open market?

withinboredom · 2024-04-12T09:33:31 1712914411

It's one you keep to yourself and your friends. Otherwise, it will get fixed up.

singingfish · 2024-04-11T20:16:41 1712866601

I'm really cross that the word "hallucination" has taken off to describe this as it's clearly in incorrect word. The correct word to describe it is "confabulation", which is clinically more accurate and a much clearer descriptor of what's actually going on.

https://en.wikipedia.org/wiki/Confabulation

mindcrime · 2024-04-11T20:46:13 1712868373

I proposed[1] the portmanteau "hallucofabulation" as a compromise, but it hasn't caught on yet. I'm totally shocked and dismayed by this, of course.

[1]: https://news.ycombinator.com/item?id=36977935

brookst · 2024-04-11T20:56:57 1712869017

The re-use of the "c" as a soft c in "hallucinate" and then a hard c in confabulate is confusing, and probably affecting the uptake of your neologism.

fhars · 2024-04-11T21:21:22 1712870482

Yes, it would probably have to be halucinfabulation for purely phonetic reasons.

mindcrime · 2024-04-11T22:16:23 1712873783

Maybe if I added a hyphen? "halluco-fabulation"?

singingfish · 2024-04-12T08:00:58 1712908858

would lead to hafabulation. Still not acceptable in my view - confabulation is clearly the correct term

randomcarbloke · 2024-04-12T08:41:23 1712911283

but confabulation sufficiently describes the phenomenon without the need for ugly neologisms.

mcmcmc · 2024-04-12T00:07:16 1712880436

It would probably help if you didn't drop the "n" from the confabulation part.

zestyping · 2024-04-12T06:45:09 1712904309

It's "fabrication", plain and simple.

Fully agreed that "hallucination" is a bonkers word for it — sensational and melodramatic. But few people know what a confabulation is, and moreover it's an overly complex way to describe the phenomenon.

The LLM is making something up. It's a fabrication.

It's not fanciful; it's not spooky; it's mundane, as it should be.

singingfish · 2024-04-12T08:02:13 1712908933

Fabrication implies making things up for the sake of it. Confabulation is similar but is defined by making things up due to some limitation in capacity.

dkasper · 2024-04-11T21:06:28 1712869588

That’s a great word for some types of hallucinations. But some things that are called hallucinations may not be memory errors.

singingfish · 2024-04-11T21:46:36 1712871996

Please can you give an example of what might not be a memory error. Not that I think "memory error" is the right phrase either.

dkasper · 2024-04-11T22:19:10 1712873950

I was thinking along the lines of answering with correct information but not following the prompts. Maybe this could be considered confabulation also.

singingfish · 2024-04-12T02:37:20 1712889440

My view is that hallucination is something related to the interpretation of reality. It's not really directly mapping to memory at all. The mechanisms of confabulation entirely surround the gluing together of memories, and what are these models other than some sort of representation of memory.

I believe that you can also cause something a bit like a transient dysphasia by giving them bad inputs as well, so there is that on the language production side. However there's still nothing that pertains to the experience aspects central to what hallucinations actually are.

mensetmanusman · 2024-04-11T20:51:34 1712868694

Disagree. One seems more innocent and neutral, the other seems like it is trying to lie to you.

JohnFen · 2024-04-12T16:42:39 1712940159

I prefer "error" but would be OK with "mistake".

The problem with "hallucination" and "confabulation" is that they both imply a consciousness.

singingfish · 2024-04-12T21:27:21 1712957241

I'm not sure that's correct. Hallucination definitely implies some sort of connection to reality, confabulation does not, it implies some kind of hard to detect error in stitching memories together coherently

cyanydeez · 2024-04-13T00:05:50 1712966750

Technically, then everything it's doing are errors just with varying degrees of precision.

It's still just probabilistic babble. We were doing this with markov chains.

prepend · 2024-04-11T23:09:42 1712876982

I think hallucination is better and more accurate at it implies a bit of imagination and buffoonery deceit.

I don’t think confabulate matches as well as it implies confusion or mixture of different ideas.

ChatGPT isn’t confused, it’s making things up. It’s trying to bullshit as best it can in hope that what it makes up convinces its user.

singingfish · 2024-04-12T02:39:02 1712889542

That making things up based on memories of past things is entirely what confabulation is. Bullshitting in the large as it were. I've met quite a few clinical confabulators (people with Korsakov syndrome and the like) and I find the parallels remarkable.

abracadaniel · 2024-04-11T15:27:08 1712849228

That’s a good observation. If LLMs had taken off 15 years ago, maybe they would answer every question with “this has already been asked before. Please use the search function”

jdashg · 2024-04-11T16:31:08 1712853068

Nowadays they are instead learning to say "please join our Discord for support"!

TeMPOraL · 2024-04-11T18:49:12 1712861352

Much like one of the first phrases spoken by babies today is "like and subscribe".

Voultapher · 2024-04-11T20:10:39 1712866239

Part of me really wants to believe this is a joke.

But given how often toddlers are "taken care of" by planting them in front of youtube :|

TeMPOraL · 2024-04-12T05:54:24 1712901264

It's not. Kids overhear what parents watch, their ears are like little recorders. Meanwhile, kids videos near-universally end with either a like&subscribe admonition, or some crap like "ask parents to download our tablet app". Even the quality videos, they all do that.

Even if you don't show children videos, but want to play some music, YouTube is still the least-hassle, least-bullshit music stream player (arguably still it's main use for adults, too). Ain't anyone got time to deal with Spotify's ever more broken app. And this is the limit of technical skill of almost all parents. They can't exactly run SponsorBlock in YouTube's mobile app (and paid YouTube doesn't help here either, surprise surprise).

Not making excuses (though I'm not really blaming parents for this) - just saying how things actually are.

jvanderbot · 2024-04-11T20:24:01 1712867041

Which is crazy because there's plenty of good content for kids on Youtube (if you really need a break!). Blippy, Meekah, Seasame Street, even that mind-numbing drivel Cocomelon (which at least got my girls talking/singing really early).

realfeel78 · 2024-04-11T23:36:58 1712878618

There's actually no such thing as good "content" for kids, sorry.

jvanderbot · 2024-04-12T00:32:10 1712881930

I get the sentiment, but when reality hits unrealistic parental expectations, things get messy.

If you have to put a show on TV to give some songs to sing along to or to distract them while you're making lunch, I'm not judging you, and I think it's best to put this content on a gradient rather than black and white.

9991 · 2024-04-12T04:05:25 1712894725

For all of human history until 70 years ago, no baby watched TV. Reconsider what you "have to" do.

devbent · 2024-04-12T07:24:33 1712906673

People had extended family around, now days they don't.

Parenting is easier if you have 5 family members in walking distance and they also have similarly ages kids who can all play together.

jvanderbot · 2024-04-12T12:52:06 1712926326

Neither did they have vaccines, bikes, gymnastics classes, dozens of books, a constant supply of fruit and veggies, family vacations, tractor rides, swing sets, family movie nights, planetarium projectors for a few bucks, zoos, kids museums, and in general a conflict free peaceful and disease free existence

Focusing on such a tiny thing and blowing it up into a huge negative out of context of their rich, busy, and safe lives is really out of hand.

IG_Semmelweiss · 2024-04-12T02:38:32 1712889512

If you are trying to get kids to be fluent in a 2nd or 3rd language, there certainly is

TeMPOraL · 2024-04-12T05:59:22 1712901562

Sure, and most of it starts with a jingle and ends up with begging block.

I used to cut all those things to shape with youtube-dl and Audacity; we have a library of a good hundred+ of sanitized songs to play, but with modern world hating files and anything offline, it turned out to be quite a hassle to keep the practice up.

fennecbutt · 2024-04-13T13:57:24 1713016644

Which has been done for generations. And before TV it was leaving the door wide open for them to free roam whether they suffered or prospered from it.

dizhn · 2024-04-11T21:02:57 1712869377

For the actual useful information visit my Patreon.

stcredzero · 2024-04-11T20:22:46 1712866966

Isn't there some kind of voice or video generation model that says "be sure to like and subscribe" given an empty prompt?

cozzyd · 2024-04-12T03:56:37 1712894197

Google was depressingly early on my daughter's word list (from, "hey Google, play $FOO")

TeMPOraL · 2024-04-14T08:29:40 1713083380

Which is why I hate the current breed of assistants. They won't be Star Trek-level nice until vendors give up on the whole "Hey [brand 1], [brand 2] on [brand 3] with [brand 4]" interface.

jdashg · 2024-04-18T22:37:06 1713479826

Amazon at least used to letting you say "Computer,", fwiw! Not sure if they still do, but I was happy to learn of that option!

SirMaster · 2024-04-11T15:34:52 1712849692

I thought LLMs don't say when they don't know something because of how they are tuned and because of RLHF.

ben_w · 2024-04-11T18:19:39 1712859579

They can say they don't know, and have been trained to in at least some cases; I think the deeper problem — which we don't know how to fix in humans, the closest we have is the scientific method — is they can be confidently wrong.

exe34 · 2024-04-11T15:28:55 1712849335

Marked as duplicate.

btown · 2024-04-11T16:28:04 1712852884

Your question may be a better fit for a different StackExchange site.

amadeuspagel · 2024-04-11T17:26:30 1712856390

We prefer questions that can be answered, not merely discussed.

exe34 · 2024-04-11T17:44:52 1712857492

Feel like we're going to get dang on our case soon...

TeMPOraL · 2024-04-11T18:50:33 1712861433

Topic locked as not constructive.

taco-hands · 2024-04-11T22:39:24 1712875164

Damn. I was hoping to be the fastest gun in the west on this one!

hiatus · 2024-04-11T16:30:32 1712853032

Contrast with Q&A on products on Amazon where people routinely answer that way. I have flagged responses saying "I don't know" but nothing ever comes of it.

ceejayoz · 2024-04-11T20:33:14 1712867594

This is Amazon's fault; they send an email that looks like it's specifically directed to you. "ceejayoz, a fellow customer has a question on..."

At some point fairly recently they added a "I don't know the answer" button to the email, but it's much less prominent than the main call-to-action.

aendruk · 2024-04-11T17:04:47 1712855087

I’d place in the same category the responses that I give to those chat popups so many sites have. They show a person saying to me “Can I help you with anything today?” so I always send back “No”.

CuriouslyC · 2024-04-11T16:04:54 1712851494

A lot of LLM hallucination is because of the internal conflict between alignment for helpfulness and lack of a clear answer. It's much like when someone gets out of their depth in a conversation and dissembles their way through to try and maintain their illusion of competence. In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations.

A lot more of LLM hallucination is it getting the context confused. I was able to get GPT4 to hallucinate easily with questions related to the distance from one planet to another, since most distances on the internet are from the sun to individual planets, and the distances between planets varies significantly based on their locations in cycle. These are probably slightly harder to fix.

danenania · 2024-04-11T16:27:06 1712852826

"In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations."

I've noticed that while this can help to prevent hallucinations, it can also cause it to go way too far in the other direction and start telling you it doesn't know for all kinds of questions it really can answer.

sumtechguy · 2024-04-11T18:24:23 1712859863

My current favorite one is to ask the time. Then ask it if it is possible for it to give you the time. You get 2 very different answers.

mzi · 2024-04-11T19:18:26 1712863106

It also has a problem with quantity, so it gets confused by things like the cube root of 750l that it maintains for a long time is around 9m. It even suggests that 1l is equal to 1m³.

shkkmo · 2024-04-11T15:58:03 1712851083

>In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.

This isn't true. There are many contexts where it is true but it doesn't actually generalize they way you say it does.

There are plenty of cases where experts in a non-one-on-one context will express a lack of knowledge. Sometimes this will be as part of making point about the broader epistemic state of the group, sometimes it will be simply to clarify the epistemic state of the speaker.

lanstin · 2024-04-12T00:18:33 1712881113

I personally will almost always say I don't know while talking thru to a solution. Admittedly this is informal speech that doesn't make it to written form.

philipswood · 2024-04-11T18:23:07 1712859787

I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.

Because expecting a behaviour, like knowing you don't know, that isn't represented in the training set is silly.

Kids make stuff up at first, then we correct them - so they have a way to learn not to.

aidenn0 · 2024-04-11T18:55:50 1712861750

> I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.

The problem is that curating data is slow and expensive and downloading the entire web is fast and cheap.

See also https://en.wikipedia.org/wiki/Cyc

philipswood · 2024-04-11T19:07:09 1712862429

Agreed. Using LLM to generate or curate training sets for other generations seems like a cool approach.

Maybe if you trained a small base model to know it doesn't know in general and THEN trained it on the entire web with embedded not-knowing preserving training examples, it would work?

philipswood · 2024-04-11T19:10:05 1712862605

Reminded of this approach where a tiny model was trained children's stories generated by a larger model:

https://www.quantamagazine.org/tiny-language-models-thrive-w...

intended · 2024-04-12T04:30:23 1712896223

No it won’t work.

This is not a brain. The best analogy is an English major.

They are good at language, not reasoning.

Humans see language and think reason. It seems we can’t separate the two.

crooked-v · 2024-04-11T20:06:20 1712865980

> train a LLM a closed set of curated knowledge

Google has one of these already, with an LLM that was trained on nothing but weather data and so can only give weather-data-prediction responses.

The 'knowing it doesn't know things' part is much harder to get reliable, though.

darby_eight · 2024-04-11T15:51:44 1712850704

With Wittgenstein I think we see that "hallucinations" are a part of language in general, albeit one I could see being particularly vexing if you're trying to build a perfectly controllable chatbot.

Y_Y · 2024-04-11T16:41:06 1712853666

This sounds interesting, could you give more detail on what you're referring to?

darby_eight · 2024-04-11T17:48:57 1712857737

I'm referring to his two works, the "Tractatus Logico-Philosophicus" and "Philosophical Investigations". There's a lot explored here, but Wittgenstein basically makes the argument that the natural logic of language—how we deduce meaning from terms in a context and naturally disambiguate the semantics of ambiguous phrases—is different from the sort of formal propositional logic that forms the basis of western philosophy. However, this is also the sort of logic that allows us to apply metaphors and conceive of (possibly incoherent, possibly novel, certainly not deductively-derived) terms—counterfactuals, conditionals, subjunctive phrases, metaphors, analogies, poetic imagery, etc. LLMs have shown some affinity of the former (linguistic) type of logic with greatly reduced affinity with the latter (formal/propositional) sort of logical processing. Hallucinations as people describe them seem to be problems with not spotting "obvious" propositional incoherence.

What I'm pushing at is not that this linguistic ability naturally leads to the LLM behavior we're seeing and calling "hallucinating", just that LLMs may capture some of how humans process language, differentiate semantics, recall terms, etc, but without the mechanisms that enable rationally grappling with the resulting semantics and propositional (in)coherency that are fetched or generated.

I can't say this is very surprising—most of us seem to have thought processes that involve generating and rejecting thoughts when we e.g. "brainstorm" or engage in careful articulation that we haven't even figured out how to formally model with a chatbot capable of generating a single "thought", but I'm guessing if we want chatbots to keep their ability to generate things creatively there will always be tension with potentially generating factual claims, erm, creatively. Further evidence is anecdotal observations that some people seem to have wildly different thresholds for the propositional coherence they can spot—perhaps one might be inclined to correlate the complexity with which one can engage in spotting (in)coherence with "intelligence", if one considers that a meaningful term.

beepbooptheory · 2024-04-11T21:04:15 1712869455

Wait, are you saying this something you read in both the Tractatus and the PI? They are quite opposed as texts! That's kinda why he wrote the PI at all..

I don't think Wittgenstein would agree, first of all, that there is a "natural logic" to language. At least in the PI, that kind of entity--"the natural logic of language"--is precisely the kind of weird and imprecise use of language he is trying to expose. Even more, to say that such a logic "allows" for anything (like metaphors) feels like a very very strange thing for Wittgenstein to assert. He would ask "what do you mean by 'allows'"?

All we know, according to him (in the PI), is that we find ourselves speaking in situations. Sometimes I say something, and my partner picks up the right brick, other times they do nothing, or hit me. In the PI, all the rest is doing away with things, like our idea of private language, the irreality of things like pain, etc. To conclude that he would make such assertions about the "nature" of language, of poetry, whatever, seems like maybe too quick a reading of the text. It is at best, a weirdly mystical reading of him, that he probably would not be too happy about (but don't worry about that, he was an asshole).

The argument you are making sounds much more French. Derrida or Lyotard have said similar things (in their earlier, more linguistic years). They might be better friend to you here.

darby_eight · 2024-04-12T16:22:17 1712938937

> They are quite opposed as texts!

The texts are quite different, this is true, but I don't find them contradictory. Whereas Tractatus was almost a facetious or flippant rejection of the millenia-long project to agree on a philosophical subset of language suitable for rigorous philosophy (although it continues today in the form of analytical philosophy), PI basically says "well we don't need to throw the baby out with the bath water", which I think is a fantastically mature response to a flawed tool that's still the best we have to reason about the universe. So: not contradictory in evaluation of fundamental compatibility of non-formal language for the formal needs of propositional philosophy, but perhaps contradictory in implied reaction to this realization.

notpachet · 2024-04-11T22:32:38 1712874758

> some people seem to have wildly different thresholds for the propositional coherence they can spot

This sums up the last decade remarkably well.

Y_Y · 2024-04-11T20:11:13 1712866273

Thanks for the fascinating response.

jjgreen · 2024-04-11T22:32:22 1712874742

What we cannot speak about we must pass over in silence.

all2 · 2024-04-11T17:33:09 1712856789

I would assume GP is talking about the fallibility of human memory, or perhaps about the meanings of words/phrases/aphorisms that drift with time. C.S. Lewis talks about the meaning of the word "gentleman" in one of his books; at first the word just meant "land owner" and that was it. Then it gained social significance and began to be associated with certain kinds of behavior. And now, in the modern register, its meaning is so dilute that it can be anything from "my grandson was well behaved today" or "what an asshole" depending on its use context.

Dunno. GP?

wizzwizz4 · 2024-04-11T17:35:40 1712856940

I don't remember Wittgenstein saying anything about that.

troq13 · 2024-04-11T19:00:49 1712862049

"I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs."

Probably true, but if you have quality, organized data, you will just want to search the data itself.

Traubenfuchs · 2024-04-13T16:41:09 1713026469

> In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.

Only a response makes it clear one has read and acknowledged the question and sometimes there are people expected to know, if they don‘t the should say so.

digging · 2024-04-11T16:02:16 1712851336

> I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs

I mean, it's inherent to LLMs to be unable to answer "I don't know" as a result of not knowing the answer. An LLM never "doesn't know" the answer. But they'll gladly answer "I don't know" if that's statistically the most likely response, right? (Although current public offerings are probably trained against ever saying that.)

aidenn0 · 2024-04-11T17:22:55 1712856175

LLMs work at all because of the high correlation between the statistically most likely response and the most reasonable answer.

digging · 2024-04-11T17:36:24 1712856984

That's an explanation of why their answers can be useful, but doesn't relate to their ability to "not know" an answer

TeMPOraL · 2024-04-11T19:22:10 1712863330

FWIW, we train into kids that "I don't know" is a valid response, and when to utter it. That training is more RLHF-type than source-materal-type, too.

digging · 2024-04-11T19:40:15 1712864415

I don't follow, what does this mean to the conversation?

TeMPOraL · 2024-04-11T19:54:08 1712865248

That knowing to say "I don't know" instead of extrapolating is an explicitly learned skill in humans, not something innate or inherent in structure of language, so we shouldn't expect LLMs to pick that ex nihilo either.

ben_w · 2024-04-11T18:27:38 1712860058

I suspect this is going to be a disagreement on the meaning of "to know".

On the same lines as why people argue if a tree falling in a wood where nobody can hear it makes sound because some people implicitly regard sound is the qualia while others regard it as the vibrations in the air.

weebull · 2024-04-12T07:59:05 1712908745

LLMs don't know anything except the most frequent observed response to a context made up of a sequence of tokens.

How often do the words "I don't know" get uttered in books, papers, articles, stack overflow, or any other resource of knowledge?

ben_w · 2024-04-12T13:57:09 1712930229

What does it mean for a human to "know" something?

I have some representation in my mind; as someone who doesn't have aphantasia, this representation comes with a mental image. Tower? Tall, linear, and in my case a skyscraper by default. Eiffel Tower? Paying attention to the extra context, the first word transforms the second into the eponymous structure. Model Eiffel Tower? Now the context makes it a tchotchke, probably 10cm tall. Lego model Eiffel Tower? The 1-ish meter tall one on display in the Lego shop.

Is my "knowledge" the abstract representation that is in my case connected to a mental image? The attention process can reasonably be considered as developing a vector in a very high dimensional concept space, and the next token comes from what would best suit the current location in that high dimensional space. It's entirely possible that the concept of "ignorance" is linearly separable within that space (much as gender is, see the word2vec trick with "king" - "queen" ~= "man" - "woman"), and the corresponding "ignorance" vector can be associated with the sequence of words "I don't know". I think it would take actual research into the internal vector space to answer that, and while I'd like to do that research, I have some higher priorities right now.

Is "knowledge" any belief? Any true belief? Any justified true belief? https://en.wikipedia.org/wiki/Gettier_problem

I take the position that there is no such thing as knowledge, and instead the best we can have is belief.

But then, what is "belief", and can the information within an LLM said to meet whichever definition you give?

digging · 2024-04-11T19:39:33 1712864373

Not really.

An LLM should have no problem replying "I don't know" if that's the most statistically likely answer to a given question, and if it's not trained against such a response.

What it fundamentally can't do is introspect and determine it doesn't have enough information to answer the question. It always has an answer. (disclaimer: I don't know jack about the actual mechanics. It's possible something could be constructed which does have that ability and still be considered an "LLM". But the ones we have now can't do that.)

TrevorJ · 2024-04-11T20:38:36 1712867916

Sure it does, if those tokens appear in the training data.

digging · 2024-04-11T23:04:24 1712876664

No, that's the same misunderstanding previously stated.

Answering "I don't know" because it a likely response to a particular string is completely different from being aware that one does not know the answer and saying so.

Both motivations lead to the same outcome, but they're unrelated processes. The response "I don't know" can represent either:

1. The most likely answer to a particular question, based on statistical data; or

2. An expression of an agent's internal state.

Figuring out that distinction is perhaps one of the most important questions ever raised.

mort96 · 2024-04-11T14:59:18 1712847558

During tokenization, the usernames became tokens... but before training the actual model, they removed stuff like that from the training data, so it was never trained on text which contains those tokens. As such, it ended up with tokens which weren't associated with anything; glitch tokens.

btown · 2024-04-11T16:26:15 1712852775

It's interesting: perhaps the stability (from a change management perspective) of the tokenization algorithm, being able to hold that constant, between old and new training runs was deemed more important than trying to clean up the data at an earlier phase of the pipeline. And the eventuality of glitch tokens was deemed an acceptable consequence.

zelphirkalt · 2024-04-11T15:36:10 1712849770

So it becomes a game of getting things into the training data, past the training data cleanup step.

Karellen · 2024-04-11T15:56:48 1712851008

More glitch token discussion over at Computerphile:

https://www.youtube.com/watch?v=WO2X3oZEJOA

qeternity · 2024-04-11T20:31:12 1712867472

> GPT-3 reportedly had only 50,257

The most common vocabulary size today is 32k.

Octokiddie · 2024-04-11T14:24:26 1712845466

I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?

throw_a_grenade · 2024-04-11T16:31:01 1712853061

This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades.

Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait.

madkangas · 2024-04-11T14:37:32 1712846252

I recognize the name John Levine at iecc.com, "Invincible Electric Calculator Company," from web 1.0 era. He was the moderator of the Usenet comp.compilers newsgroup and wrote the first C compiler for the IBM PC RT

https://compilers.iecc.com/

gtirloni · 2024-04-11T14:29:12 1712845752

It'd say it's more like a honeypot for bots. So pretty similar objectives.

Octokiddie · 2024-04-11T14:36:09 1712846169

So it served its purpose by trapping the OpenAI spider? If so, why post that message? As a flex?

Takennickname · 2024-04-11T14:38:26 1712846306

It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.

cwillu · 2024-04-11T14:45:32 1712846732

Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.

flutas · 2024-04-11T14:56:28 1712847388

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

niutech · 2024-04-12T21:35:49 1712957749

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

darkwater · 2024-04-11T14:58:12 1712847492

And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

AgentME · 2024-04-11T20:45:59 1712868359

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.

Pannoniae · 2024-04-11T21:34:41 1712871281

It wasn't commented out a few hours ago when I checked it. I think that's a recent change.

cwillu · 2024-04-11T15:36:15 1712849775

Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

andybak · 2024-04-11T16:46:14 1712853974

This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.

kubanczyk · 2024-04-12T20:48:25 1712954905

You've been able to convince me to accept his second pint. Friday it is.

fsckboy · 2024-04-11T15:30:16 1712849416

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.

gunapologist99 · 2024-04-11T15:45:13 1712850313

This must be why self-driving cars always ignore the speed limit. ;)

microtherion · 2024-04-11T18:50:42 1712861442

More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.

roughly · 2024-04-11T22:02:53 1712872973

Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.

yreg · 2024-04-11T22:26:11 1712874371

AI DRIVR claims that beta V12 is much better precisely because it takes rules less literally and drives more naturally.

queuebert · 2024-04-11T14:51:55 1712847115

Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?

everforward · 2024-04-11T15:15:51 1712848551

No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

Retric · 2024-04-11T16:05:45 1712851545

That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.

everforward · 2024-04-12T15:26:53 1712935613

You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.

Retric · 2024-04-12T16:09:09 1712938149

It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

everforward · 2024-04-12T17:45:27 1712943927

It’s trivial to treat it differently, but doing so runs the risk of being accused of cloaking and getting banned from Google’s index: https://developers.google.com/search/docs/essentials/spam-po...

> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.

> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.

Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.

At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.

Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.

Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.

You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.

Retric · 2024-04-12T18:38:13 1712947093

Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.

And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.

a_c · 2024-04-11T15:05:37 1712847937

What about making it slow? One byte at a time for example while keeping the connection open

bityard · 2024-04-11T22:02:32 1712872952

That would make it a tarpit, a very old technique to combat scrapers/scanners

happymellon · 2024-04-11T15:22:14 1712848934

A slow stream that never ends?

SteveNuts · 2024-04-11T15:47:18 1712850438

This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.

gtirloni · 2024-04-11T17:16:22 1712855782

Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }

beau_g · 2024-04-11T19:01:11 1712862071

Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."

iosguyryan · 2024-04-11T21:45:02 1712871902

Can I interest you in a scrape-share with Claude?

reasonabl_human · 2024-04-12T12:46:01 1712925961

Solid use case for Saul Goodman LLM alignment

throw_a_grenade · 2024-04-11T16:27:13 1712852833

You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.

starttoaster · 2024-04-11T17:22:34 1712856154

Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.

Calzifer · 2024-04-11T22:08:30 1712873310

Those are usually connection and no-data timeouts. A total time limit is in my experience less common.

Phelinofist · 2024-04-11T20:43:22 1712868202

Sounds like endlessh

Takennickname · 2024-04-12T14:59:26 1712933966

> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it

GaggiX · 2024-04-11T14:46:00 1712846760

It seems to respect it as the majority of the requests are for the robots.txt.

flutas · 2024-04-11T14:55:12 1712847312

He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /

Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.

otherme123 · 2024-04-11T15:29:43 1712849383

A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".

jeremyjh · 2024-04-11T16:50:59 1712854259

That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.

otherme123 · 2024-04-11T18:20:31 1712859631

I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.

For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.

vertis · 2024-04-11T20:43:40 1712868220

Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60

Which means he's changing it. The default for all other bots is to allow crawling.

jeffnappi · 2024-04-11T16:08:36 1712851716

His site has a subdomain for every page, and the crawler is considering those each to be unique sites.

sangnoir · 2024-04-11T16:22:07 1712852527

There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"

jameshart · 2024-04-12T00:03:44 1712880224

Of course it’s considering them as unique sites. They are unique sites.

swyx · 2024-04-11T15:28:26 1712849306

for the 1.2 million are there other links he's not telling us about?

flutas · 2024-04-11T20:12:21 1712866341

I'm assuming those are homepage requests for the subdomains.

swatcoder · 2024-04-11T16:20:55 1712852455

I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."

mminer237 · 2024-04-12T00:52:51 1712883171

How would one know he is disallowed without reading each site?

swatcoder · 2024-04-12T02:07:07 1712887627

The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.

dspillett · 2024-04-11T14:37:17 1712846237

So, it has worked…

pflanze · 2024-04-13T11:11:27 1713006687

Linkers & Loaders is their own book (I haven't checked the others).

They have a page at https://www.iecc.com/linker/ where they used to publish a draft of the book contents, but changed the page to say "Chapters were available in an excessive variety of formats, but are not any longer due to chronic piracy", when it got posted to HN at https://news.ycombinator.com/item?id=18424233 and I bundled the files for offline reading. I notified them via email about that asking if they are OK with it but got an unfriendly response that I pirated the files and that wasn't OK, so I took the link down again and they changed that text. (Shrug. I'm not a/the book author, they are. I'll say that I also suggested to them they ask on the page not to do what I did since then I wouldn't have, but they chose their more radical approach.)

agilob · 2024-04-11T17:11:22 1712855482

It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.

phyzome · 2024-04-12T01:55:53 1712886953

And yet it has the Amazon links, which makes it appear to have some economic purpose...

phyzome · 2024-04-12T11:05:53 1712919953

...oh right, it's probably so that every page has affiliate links, which should be a signal of low quality to a crawler.

schleck8 · 2024-04-12T10:06:16 1712916376

The books on there are affiliate links I think.

dhosek · 2024-04-11T21:19:10 1712870350

Am I the only one who was hoping—even though I knew it wouldn’t be the case—that OpenAI’s server farm was infested with actual spiders and they were getting into other people’s racks?

nitwit005 · 2024-04-11T23:16:58 1712877418

I was hoping that some large set of keywords generated spider images.

mildweed · 2024-04-12T04:29:07 1712896147

Dear Jane,

Can I have my drawing of a spider back then please?

yreg · 2024-04-11T22:28:53 1712874533

very xkcd

rubidium · 2024-04-12T02:56:13 1712890573

Going way back to the start

1317 · 2024-04-11T21:02:26 1712869346

He's not done his robots.txt properly, he's commented out the bit that actually disallows it

  # silly bing
  #User-agent: Amazonbot          
  #Disallow: /

  # buzz off
  #User-agent: GPTBot
  #Disallow: /

  # Don't Allow everyone
  User-agent: *
  Disallow: /archive

  # slow down, dudes
  #Crawl-delay: 60

haeffin · 2024-04-11T21:18:42 1712870322

The contents changed between then and now.

altdataseller · 2024-04-11T15:05:55 1712847955

If they follow robots.txt, OpenAI also has a bot blocking + data gathering problem too: https://x.com/AznWeng/status/1777688628308681000

11% of the top 100K websites already block their crawler, more than all their competitors (Google, FB, Anthropic, Perplexity) combined

Jordan-117 · 2024-04-11T16:14:42 1712852082

It's not just a problem for training, but the end user, too. There are so many times that I've tried to ask a question or request a summary for a long article only to be told it can't read it itself, so you have to copy-paste the text into the chat. Given the non-binding nature of robots.txt and the way they seem comfortable with vacuuming up public data in other contexts, I'm surprised they allow it to be such an obstacle for the user experience.

lobsterthief · 2024-04-11T18:00:45 1712858445

That’s the whole point. The site owner doesn’t want their information included in ChatGPT—they want you going to their website to view it instead.

It’s functioning exactly as designed.

jpambrun · 2024-04-11T20:53:49 1712868829

It's a stretch to expect a human initiated action to abide by robot.txt.

Also, once you click on a link in chrome it's pretty much all robot parsed and rendered from there as well..

colonelspace · 2024-04-11T20:56:54 1712869014

At bottom, all robot actions are human initiated.

s3p · 2024-04-12T19:19:39 1712949579

Exactly, so they should ignore robots.txt !

Zambyte · 2024-04-11T20:58:12 1712869092

I would say robots.txt is meant to filter access for interactions initiated by an automated process (ie automatic crawling). Since the interaction to request a site with a language model is manual (a human request) it doesn't make sense to me that it is used to block that request.

If you want to block information you provide from going through ClosedAI servers, block their IPs instead of using robots.txt.

fragmede · 2024-04-11T19:36:35 1712864195

If my web browser's extension "visits" the site and dumps it into ChatGPT for me to read its summarization of the site, what has been gained by the website operator?

rsolva · 2024-04-11T20:47:18 1712868438

Added friction. That is all a website owner can hope to achieve.

pjc50 · 2024-04-12T11:47:15 1712922435

Nothing.

This is why these things - search engines, AI crawlers, even adblock and video downloaders - exist in a slightly adversarial/parasitic relationship with the sites that provide their content to which they provide nothing back (or negative, if you cost them a page load without incurring an ad view).

I use adblock all the time but I'm very aware that it can only succeed as long as it doesn't win.

frizlab · 2024-04-11T14:14:30 1712844870

I’d let them do their thing, why not?! They want the internet? This is the real internet. It looks like he doesn’t really care that much that they’re retrieving millions of pages, so let them do their thing…

gtirloni · 2024-04-11T14:28:28 1712845708

> It looks like he doesn’t really care that much that they’re retrieving millions of pages

It impacts the performance for the other legitimate users of that web farm ;)

jeremyjh · 2024-04-11T16:54:19 1712854459

Some scrapers respect robots.txt. OpenAI doesn't. SP is just informing the world at large of this fact.

cactusplant7374 · 2024-04-11T18:17:45 1712859465

The CTO isn't even aware of where the data is coming from (allegedly).

Tubbe · 2024-04-11T20:47:15 1712868435

(admittedly)

Zambyte · 2024-04-11T21:00:58 1712869258

The (allegedly) implies they do know, but to avoid possible litigation they feign ignorance. The CTO of ClosedAI is probably not a complete idiot.

latency-guy2 · 2024-04-12T01:10:49 1712884249

> The CTO of ClosedAI is probably not a complete idiot.

Either way, ignorance is not an excuse despite how people generally react to it.

whoami_nr · 2024-04-11T21:54:40 1712872480

Thats what the whole thing is about. He is complaining that they don't respect robots.txt

reaperducer · 2024-04-11T15:55:55 1712850955

[flagged]

pksebben · 2024-04-11T16:08:29 1712851709

didn't click through to the site, didja?