Hacker News new | past | comments | ask | show | jobs | submit login
Using GPT-4 to measure the passage of time in fiction (tedunderwood.com)
155 points by surprisetalk on June 23, 2023 | hide | past | favorite | 44 comments



> The average length of time represented in 250 words of fiction had been getting steadily shorter since the early eighteenth century.

That is a fascinating detail in itself. My first thought is this is reflective of life moving faster in general.

For example, travel by horse vs plane is the difference between days (or weeks) and hours. If you're writing about this travel in modern era, the trip can be summed up in a partial sentence like "After landing in (new city), ..." but you can't really gloss over a days-long trip by horse in the same way.

Basically, "event density" has gone up: more things can happen in a smaller length of time, and number of words is reflective of how many things are happening.


Is it not the opposite? Rather than moving faster, more words are devoted to shorter increments of time.

> The average length of time represented in 250 words of fiction had been getting steadily shorter since the early eighteenth century.

Time passed per word has been decreasing. A page used to mean a week, now it means a day. In your horse vs plane example, this makes sense if they both gloss over it. EG "we rode 3 days to Rivendell" vs "I landed in Tokyo after a long flight".


It's been interesting to read the acoup.blog post series on deconstructing military logistics in fantasy stories. I don't know how much was just obsessive attention to detail, how much was having served himself, but Tolkein was extremely exacting about how long a trip would take and a realistic size of army that could be sustained. Then you get to A Song of Ice and Fire and George Martin has all but said he doesn't care and nobody should try to figure out a timeline or distances between locations because they won't work if you scrutinize it closely.

I don't know if that generalizes to all writers, but it gives the impression that many don't really care about realism like they used to.

But yeah, it could be the mismatch between history and current travel methods. Writers of the past, even when they wrote about the past, largely would have still been limited to traveling at the same speeds as distant ancestors via sail and horse. Writers of today who write about the past have no personal experience when they've only ever traveled by plane and automobile.


>but you can't really gloss over a days-long trip by horse in the same way.

Isn't your interpretation backwards?

The length of time is shorter now, meaning 250 words describes a shorter duration/less happening. They were glossing over more, historically.


> "event density" has gone up

An excellent portrayal of the dynamic and rapid nature of modern life.


It doesn't necessarily invalidate your point, but I think the main difference is that historically novels were much more verbose, in a style that would be considered prolix today. You can basically read any page of Dickens to illustrate the point! But maybe its just because there was less going on they could afford to dwell on the minute to minute.


Dickens work was originally serialized so there might have been a perverse incentive at work to stretch stories out.


> but it won’t produce anything new

these or some of the deeply embedded standard convictions we need to get rid of. We simply don't know.

The short answer is that of cause LLMs can come up with something new: They can learn to rationalise – the simplest example is learning to add numbers.

So pedantically speaking, the quote is straight up wrong.


I took that more to mean that most LLMs won't spontaneously produce new insights given a general topic ("a paper about Middlemarch" being the example given). Their default mode is just to barf out somewhat patronizing summaries or answer questions with dry facts (and occasional factual-sounding nonsense).

It takes coaxing (and some cherry-picking) to get anything creative out of LLMs -- and some people suspect newer versions are so locked-down to avoid sounding "scary" that they actively avoid saying anything too interesting or philosophical.

The line between "creative output" and "hallucinated madness" is also somewhat blurry (for humans and LLMs alike...)


>It takes coaxing (and some cherry-picking) to get anything creative out of LLMs

I assume by LLMs you actually mean "ChatGPT and similar commercial LLM services". Raw LLMs like LLaMA and Falcon will happily spout out the craziest, sometimes super creative ideas, it's just the lobotomising of ChatGPT and the like that destroys their creativity.


Well, it's a tradeoff, RLHF also makes ChatGPT better at some tasks like instruction following. "lobotomising" makes it sound like there are only downsides


Isn't a problem that the more creativity the more likely it'll also be nonsensical or off topic?


I've gotten them to be creative in ways that were unexpected, like outputting tokens that have absolutely no (obvious) relation to the current conversation

While trying to iterate on a name for a product, I had it output an internal monologue alongside outputting name ideas.

With each iteration of naming I'd ask it to probe its understanding in of human creativity and list how the usefulness/realism of the internal monologue.

After a few loops the internal monologue started modeling fleeting memories about things like "that time at camp where X happened".

Eventually it was modeling fleeting thoughts like "the warmth of the cup of coffee on the desk" as a source of inspiration, and random observations like "it looks like traffic is picking up outside, getting to yoga will take a while" in the middle of a name generation task

I made a website predicated on a similar idea after that: https://notionsmith.ai

Technically it's a hallucination generator, but I think you described why I made it (creativity and hallucinations really are similar for both LLMs and humans)


If you crank the temperature param up on GPT4 in the API you get madness very quickly, usually shortly after 1.4ish in my (albeit limited) testing.

Sadly it didn't seem to be a fun, usuable, creative madness to me (a la a human on certain drugs), but YMMV.


In the quote the author is speaking about students handing in an assignment and using ChatGPT to write it. He says:

> This may not count as plagiarism, but it won’t produce anything new.

First, it is plagiarism by definition. Any time one presents work as their own which they did not create themselves is plagiarism .

Second, and vastly more importantly going forward, students were never handing in “new” ideas on an assignment. Using an LLM to generate a “pastiche” that is then read by the student (hopefully) before being handed in is still a win in my book. Having students ask their questions to an LLM can be an amazing tool to further learning but when students and educators are terrified of this new tool they will hide their usage of it and then it becomes a tool for copying rather than learning and plagiarism is the only consequence.


>First, it is plagiarism by definition.

There's a bit of a motte-and-bailey situation here - the student clearly plagiarized if they passed off ChatGPT's work as their own, but this has no bearing on the more interesting question of whether ChatGPT itself "plagiarizes" from its training set or creates new knowledge by synthesizing it.

EDIT: Fixed the spelling of Motte-and-bailey


motte* and bailey


>Any time one presents work as their own which they did not create themselves is plagiarism .

Not sure this is true. If I hand in an assignment that is written strictly by scattering chicken bones into a grid representing the most common English words and writing down the result is that plagiarism? It will definitely be non-sense, but i'm not copying anyone else's work.

chatGPT is just a more sophisticated chicken bone grid.


> The short answer is that of cause LLMs can come up with something new:

Well, yes. They produce random output that adhere to statistical models.

I had ChatGPT 3.5 produce a wonderful whimsical portrait interview of a couple breeding Giant Tibetian Hamsters for home protection in Norwegian.

> They can learn to rationalise – the simplest example is learning to add numbers.

Certainly not? Composite systems that leverage LLMs can do a lot of things - but AFAIU LLMs will likely never rationalize or be able to "add numbers" in the normal sense; they can count only in as much they know that 1,2,3,4 is more probably coming after "count to four" than 10,2,4.


> but AFAIU LLMs will likely never rationalize or be able to "add numbers" in the normal sense; they can count only in as much they know that 1,2,3,4 is more probably coming after "count to four" than 10,2,4.

You might be interested in this experiment - https://thegradient.pub/othello/

The question is does a model trained on move sequences in a game of othello just learn "play X(t) after seeing the sequence X(t-3), X(t-2), X(t-1)" or does it build a representation of the board and use that to choose the next move to play.


> Certainly not? Composite systems that leverage LLMs can do a lot of things - but AFAIU LLMs will likely never rationalize or be able to "add numbers" in the normal sense;

This implies we know what “rationalising” actually is, which we don’t. There’s no reason to believe that our brains don’t operate in fundamentally the same way as LLMs. There’s no reason to think that the LLM approach to reasoning is any less valid than the “normal” way, whatever that even means.


>Well, yes. They produce random output that adhere to statistical models.

Not really. Nothing random about the output of LLMs. https://www.nature.com/articles/s41587-022-01618-2

>Certainly not? Composite systems that leverage LLMs can do a lot of things - but AFAIU LLMs will likely never rationalize or be able to "add numbers" in the normal sense; they can count only in as much they know that 1,2,3,4 is more probably coming after "count to four" than 10,2,4.

LLMs can add numbers just fine. arithmetic is one of the easiest domains to test on data it'd never have seen in training.


> Not really. Nothing random about the output of LLMs. https://www.nature.com/articles/s41587-022-01618-2

I can see what you are getting at, but the underlying implementation of LLMs really is reliant on non-deterministic, random sampling of the underlying model. The random sampling is weighted to favor certain selections over others, but rare selections are possible.

Its intrinsically built on a degree of randomness. Though it is also unfair to say something like "produce random output" as the probabilities themselves make it less than fully random.


In ML 101, I learned that NNs are general function approximators.

So given the right training data and algorithm, they should be able to learn to add numbers. Or any other computable function.

Also, we don’t know how GPT computes numbers really. I guess no one traced all the layers and understood the activations.

We also don’t do this with our children. We just assume that they understand it as soon as they don’t make many mistakes anymore.


I don't know why this mistake gets repeated, but NNs are universal continous function approximators. They can't approximate non-continuous functions in general, and there are plenty of non-continuous computable functions.

Also, this theorem is almost useless in practice: it only tells you that, for any function and desired error rate, there exists some NN which would approximate that function within that error rate. But there is no proof that there is some way to train the weights for that NN based on any known training mechanism, even if we magically knew the shape of that NN (which we don't). Obviously, since we don't know if an algorithm to train the NN even exists, we have even less idea of how long it might take, or how large the training set would have to be.

So this theorem doesn't really help in any way answer whether our current AI techniques (of which training is a fundamental component) could be used to approximate human reason.


> They can't approximate non-continuous functions in general, and there are plenty of non-continuous computable functions.

For a given error I guess. So it's a discussion about how good we can approximate such a function.


If we don't care about any bound on the error, any function can be said to approximate any other function.


It depends on the activation functions. Some may never be able to compute addition.


The whole article smells like bad science. Just looking at the first plot, you can immediately see something is off. LOESS CIs are a bit tricky, but here either the model is completely wrong or the CI calculation is. So the trend is definitely not as significant as they make it out to be. A realistic 95% CI would look like this: https://james-brennan.github.io/img/lowess1_22_1.png


The CI is for the mean, not for the entire data. This is the difference between a confidence and a prediction interval: https://stats.stackexchange.com/questions/225652/prediction-...

I do agree that it's fairly noisy though.


> learning to add numbers.

What does it mean to learn to add numbers? LLMs in general are not good at accurately portraying their own processes, and humans are equally bad at interpreting what LLMs do “under the hood”.

“To rationalize” means something like “to provide a reason (a rationale)” and I don’t think the current generation of models can do that.


You're right that LLMs can produce new knowledge, but you're misinterpreting the quote. The claim in the article is that LLMs won't produce anything new when asked for "a paper about Middlemarch". The rest of the article goes on to demonstrate how language models can create new knowledge.


I'm building LLM-based tools fulltime now, and one of the techniques that works wonderfully is to check the LLM output with LLM itself: asking it to figure out if the output actually follows the guidelines you've set. And if it doesn't, send the response back with the error message (or messages, nobody limits you to just one check) — it's quite good at correcting mistakes!

What's awesome here, is that you often have to use the most advanced LLM available for generation (GPT-4), but you can use much simpler models to categorise the answer!


I highly recommend the interview with the author on The Gradient podcast. A wonderful conversation on language, culture, and LLMs: https://thegradientpub.substack.com/p/ted-underwood-machine-...


  > But when I sat down with two graduate students (Sabrina Lee and Jessica 
  Mercado) to manually characterize a thousand passages from fiction 
  ...
  >  It took the three of us several months to generate this data
  ...
Dear lord this sort of grad school research would be my idea of hell.


A innovative case of using the LLMs.

> The total cost to my OpenAI account was $22. Of that amount, about $8 was testing the prompts and running on the (cheaper) Turbo API. The final run on GPT-4 cost $14; if I had completely replicated our work from 2017 on GPT-4 it would have cost $140.

Seems like coming up with good prompt itself is quite costly at around 1/3rd of the total cost. I wonder if there are any ways to iterate on the prompts cheaply?


$8 is cheap. The relatively high cost is only because the final analysis wasn't run on more passages.


I would call $8 cheap even if it means 1/3rd of the total cost.


The first paragraph leaves such a bad impression for an interesting text and research question.

In some sense, the author contradicts himself by using the model to do something non-trivial that parrots certainly can't do.


At best, it gives you a plausible range.

Anything involving numbers is red flag territory for LLMs, or at least math.

Wonder how it would do with medical lab data. I imagine it needs to ask questions of less than or greater than, which is math.


From the author's table here:

https://tedunderwood.files.wordpress.com/2023/03/screen-shot...

I wonder how sbert would do here given that it's a middle ground between words and full context.


Or, a quick-and-dirty bag-of-words approach: https://github.com/nsrivast/whenwhere


I would say one of the main points of the article is that it's cheap and easy to outperform BoW using ChatGPT.


Actually, it seems like it’s just regex:

regex = r"\d{4}"




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: