More

JackC · 2025-07-10T20:10:51 1752178251

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

narush · 2025-07-10T20:14:08 1752178448

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

JackC · 2025-07-09T22:57:47 1752101867

> Signal already offers reproducible builds

After a short google, I think it does not have reproducible builds for Mac, Windows, or iOS. It does for Linux and Android, though there's a long Android bug thread that sounds like the reproduction test script is typically broken.

JackC · 2025-06-04T13:10:00 1749042600

The difference in fields is key here: AI models are going to have a very different impact in fields where ground truth is available instantly (does the generated code have the expected output?) or takes years of manual verification.

(Not a binary -- ground truth is available enough for AI to be useful to lots of programmers.)

weatherlite · 2025-06-04T13:53:06 1749045186

> does the generated code have the expected output?

That's many times not easy to verify at all ...

dathinab · 2025-06-04T17:38:15 1749058695

you can easily verify a lot like:

- correct syntax

- passes lints

- type checking passes

- fast test suite passes

- full test suite passes

and every time it doesn't you feed it back into the LLM, automatically, in a loop, without your involvement.

The results are often -- sadly -- too good to not slowly start using AI.

I say sadly because IMHO the IT industry has gone somewhere very wrong due to growing too fast, moving too fast and getting so much money so that the companies spear heading them could just throw more people at it instead of fixing underlying issues. There is also a huge diverge between sience about development, programming, application composition etc. (not to be confused with since about idk. data-structures and fundamental algorithms) and what the industry uses, how it advances etc.

Now I think normally the industry would auto correct at some point, but I fear with LLMs we might get even further away from any fundamental improvements, as we find even more ways to still go along and continue the mess we have.

Worse performance of LLM coding is highly dependent on how much very similar languages are represented in it's dataset, so new languages with any breakthrough/huge improvements or similar will work less good with LLMs. If that trend continues that would lock us in with very mid solutions long term.

JackC · 2025-04-25T16:34:18 1745598858

> It only seats two yet has a bed big enough to hold a sheet of plywood.

Not really the point of the article, but, does it? This[0] says the bed is 60 inches long and 43 wide, and plywood is 96x48 inches. Is it like, any vehicle fits plywood if you cut it to the size of the truck or stack it on top?

[0] https://www.thedrive.com/news/the-slate-truck-is-two-feet-sh...

themaninthedark · 2025-04-25T17:17:33 1745601453

Tailgate down, plywood lying with one edge on the bed and the other on the side?

But I agree, I would expect it to be able to fully contain a standard sheet of plywood if it made that claim.

greenbit · 2025-04-26T00:32:48 1745627568

Perhaps .. but then try the same thing with drywall

goodness · 2025-04-25T17:12:22 1745601142

That appears to be the bed width between the wheel wells. I assume it would fit width wise on top of the wheels, which is still in the bed. As to the length, not even most full size trucks are long enough to fit the whole sheet. I guess the main point is that you wouldn't have any trouble getting the sheet of plywood home.

jandrese · 2025-04-25T17:24:43 1745601883

In my old Ranger there were a couple of spots in the bed where you could put a couple of 2x8 beams across it and have a place to stack 4x8 sheets. You did have to lower the tailgate, but they didn't stick out past the end of the lowered tailgate so there was no special requirements (flags etc...) for hauling them. It was very convenient. I would hope this truck has a similar feature, since it's almost free to add and increases the utility greatly.

0xffff2 · 2025-04-25T17:22:45 1745601765

8 foot beds do exist. They're very rare nowadays with nearly every truck being a super-extra-mega cab 4 door.

jandrese · 2025-04-25T17:26:17 1745601977

You generally have to find someone willing to sell you a fleet vehicle if you want a full 8 foot bed. Modern trucks are more like minivans with a vestigial bed sticking out of the back.

jes5199 · 2025-04-25T18:52:39 1745607159

I have on old fleet truck: four full doors and an eight-foot bed. I love it, it’s getting quite old, and I have no idea how I’m ever going to replace it

edaemon · 2025-04-25T20:07:39 1745611659

Yeah, it's interesting that their FAQ [1] just says it can fit "full size sheets of plywood" and their specs page [2] also does not list the actual dimensions, only the volume. A 60"x43" bed would technically fit a 96"x48" sheet, but you would have to lean one edge against the side of the bed.

That said, the article you linked appears to list the bed width at the wheel wells. They say the Maverick's bed is 42.6" wide but above the wheel wells it 53" wide or so. You can find plenty of pictures of people hauling plywood with one. I suspect the Slate is similar.

[1]: https://www.slate.auto/en/faq

[2]: https://www.slate.auto/en/specs

JackC · 2025-04-18T17:15:59 1744996559

> People could have invented crypto versions of real-life things like insurances and mortgages.

Crypto makes large technical sacrifices for the sake of being harder to regulate. I don't think that's a desirable quality for insurance and mortgages.

JackC · 2025-04-18T12:13:35 1744978415

> I work for an org with close ties to arXiv, and just like us they are getting a lot more demand due to AI crawling

Funny, I also work on academic sites (much smaller than arXiv) and we're looking at moving from AWS to bare metal for the same reason. The $90/TB AWS bandwidth exit tariff can be a budget killer if people write custom scripts to download all your stuff; better to slow down than 10x the monthly budget.

(I never thought about it this way, but Amazon charges less to same-day deliver a 1TB SSD drive for you to keep than it does to download a TB from AWS.)

Imustaskforhelp · 2025-04-18T12:30:30 1744979430

I don't understand, why don't you use cloudflare? Don't they have an unlimited egress policy with R1?

Its way more predictable in my opinion that you only pay per month a fixed amount to your storage, it can also help the fact that its on the edge so users would get it way faster than lets say going to bare metal (unless you are provisioning a multi server approach and I think you might be using kubernetes there and it might be a mess to handle I guess?)

sitkack · 2025-04-19T06:00:52 1745042452

Regardless, if you are delivering PDFs, you should be using a CDN.

If crawling is a problem, 1 it is pretty easy to rate limit crawlers, 2 point them at a requestor pays bucket and 3, offer a torrent with anti leech.

mcmcmc · 2025-04-18T12:37:22 1744979842

Could have something to do with Cloudflare’s abhorrent sales practices.

keepamovin · 2025-04-18T12:48:54 1744980534

Can you tell me more? I think my business needs some abhorrent sales practices. That's how it's done, right?

mcmcmc · 2025-04-18T13:39:08 1744983548

One example

https://robindev.substack.com/p/cloudflare-took-down-our-web...

ryao · 2025-04-19T05:46:37 1745041597

I suspect that is the result of this:

https://www.reddit.com/r/sales/comments/134u0mq/cloudflare_c...

They got rid of all of the “underperforming” sales people and hired new ones. That nightmare is the result. I suspect the higher the sales performance, the more likely they were doing things like this.

keepamovin · 2025-04-18T13:52:42 1744984362

Wow, okay. That's a little too extreme. How is cloudflare acting insecure when so large? hmmm, confused.

ryao · 2025-04-18T13:59:47 1744984787

The two are not comparable. The 1TB of transit at Amazon can be subdivided over many recipients, while the solid state drive is empty and only can be sent to one.

That said, I agree that transit costs are too high.

fc417fc802 · 2025-04-18T22:11:40 1745014300

So order multiple drives, transfer the data to them, and drop them in the mail to the client. That should always be the higher bandwidth option, but in a sane world it would also be less cost effective given the differences in amount of energy and sorts of infrastructure involved.

The reason to switch away from fiber should be sustained aggregate throughput, not transfer cost.

ryao · 2025-04-19T01:44:47 1745027087

The other guy was also comparing them based on transfer cost. Given that 1TB can be divided across billions of locations, shipping physical drives is not a feasible alternative to transit at Amazon in general.

fc417fc802 · 2025-04-19T02:51:50 1745031110

I'm not trying to claim that it's generally equivalent or a viable alternative or whatever to fiber. That would be a ridiculous claim to make.

The original example cited people writing custom scripts to download all your stuff blowing your budget. A reasonable equivalent to that is shipping the interested party a storage device.

More generally, despite the two things being different their comparison can nonetheless be informative. In this case we can consider the up front cost of the supporting infrastructure in addition to the energy required to use that infrastructure in a given instance. The result appears to illustrate just how absurd the current pricing model is. Bandwidth limits notwithstanding, there is no way that the OPEX of the postal service should be lower than the OPEX of a fiber network. It just doesn't make sense.

ryao · 2025-04-19T05:50:22 1745041822

That is true. I was imagining the AWS egress costs at my own work where things are going to so many places with latency requirements that the idea of sending hard drives is simply not feasible, even with infinite money and pretending the hard drives had the messages prewritten on them from the factory. Delivery would never be fast enough. Infinite money is not feasible either, but it shows just how this is not feasible in general in more than just the cost dimension.

JackC · 2025-04-09T19:17:34 1744226254

This might be what you mean, but for anyone reading -- the point of Simon's article is the whole agent and all of its tools have to be considered part of the same sandbox, and the same security boundary. You can't sandbox MCPs individually, you have to sandbox the whole system together.

Specifically the core design principal is you have to be comfortable with any possible combination of things your agent can do with its tools, not only the combination you ask for.

If your agent can search the web and can access your WhatsApp account, then you can ask it to search for something and text you the results -- cool. But there's some possible search result that would take over its brain and make it post your WhatsApp history to the web. So probably you should not set up an agent that has MCPs to both search the web and read your WhatsApp history. And in general many plausibly useful combinations of tools to provide to agents are unsafe together.

JackC · 2025-03-21T13:19:13 1742563153

I'll add "reduce code size and complexity" to the list of benefits. A python library to calculate a simhash, or track changes on a django model, or auto generate test fixtures, will often be 90% configuration cruft for other usecases, and 10% the code your app actually cares about. Reading the library and extracting and finetuning the core logic makes you responsible for the bugs in the 10%, but no longer affected by bugs in the 90%.

dkarl · 2025-03-21T14:14:54 1742566494

Hard agree. A library should not inflict complex use cases' complexity on simple use cases, but sometimes they do, either because they're poorly designed or because they're overkill for your use case. But often I see pain and complexity excused with "this is the library that everybody else uses."

Sometimes a simple bespoke solution minimizes costs compared to the complexity of using a massive hairball with a ton of power that you don't need.

One big caveat to this: there's a tendency to underestimate the cost and complexity of a solution that you, personally, developed. If new developers coming onto the project disagree, they're probably right.

jofer · 2025-03-21T14:54:54 1742568894

The big caveat is a big one. Choose your battles wisely!

There are plenty of things that look simpler than an established library at first glance (I/O of specialized formats comes to mind quickly). However, a lot of the complexity of that established library can wind up being edge cases that you actually _do_ care about, you just don't realize it yet.

It's easy to wind up blind to maintenance burden of "just a quick add to the in-house version" repeated over and over again until you wind up with something that has all of the complexities of the widely used library you were trying to avoid.

With that said, I still agree that it's good to write things from scratch and avoid complex dependencies where possible! I just think choosing the right cases to do so can be a bit of an art. It's a good one to hone.

JackFr · 2025-03-21T15:26:25 1742570785

> I/O of specialized formats comes to mind quickly

The classic "I'll write my own csv parser - how hard can it be?"

aleph_minus_one · 2025-03-21T16:05:29 1742573129

> The classic "I'll write my own csv parser - how hard can it be?"

I did as part of my work. It was easy.

To be very clear: the CSV files that are used are outputs from another tool, so they are much more "well-behaved" and "well-defined" (e.g. no escaping in particular for newlines; well-known separators; well-known encoding; ...) than many CSV files that you find on the internet.

On the other hand, some columns need a little bit of "special" handling (you could also do this as a post-processing step, but it is faster to be able to attach a handler to a column to do this handling directly during the parsing).

Under these circumstances (very well-behaved CSV files, but on the other hand wishing the capability to do some processing as part of the CSV reading), likely any existing library for parsing CSV would likely either be like taking a sledgehammer to crack a nut, or would have to be modified to suit the requirements.

So, writing a (very simple) own CSV reader implementation was the right choice.

dkarl · 2025-03-21T18:21:04 1742581264

> very well-behaved CSV files

You were incredibly lucky. I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

aleph_minus_one · 2025-03-21T22:38:33 1742596713

> I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

To be fair: problematic CSV files do occur. But for the functionality that the program provides, it suffices if in such a situation, an error message is shown to the user that helps him track down where the problem with the CSV file is. Or if the reading does not fail, the user can see in the visualization of the read data where the error with the CSV file was.

In other words: what is not expected is that the program gracefully has to

- automatically find out the "intended behaviour" (column separators, encoding, escaping, ...) of the CSV parsing,

- automatically correct incorrect input files.

jofer · 2025-03-21T17:41:53 1742578913

CSV is _way_ hairier than folks think it is!!

And for anyone who's not convinced by CSV, consider parsing XML with a regex. "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

I've said it many times myself and been eventually burned by it each time. I'm not saying it's always wrong, but stop and think whether or not you can _really_ trust that "little piece of data" not to grow...

mdaniel · 2025-03-21T21:49:58 1742593798

> "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

relevant:

> ruby-saml was using two different XML parsers during the code path of signature verification. Namely, REXML and Nokogiri

where "REXML" does exactly what you described, and hilarity ensued

Sign in as anyone: Bypassing SAML SSO authentication with parser differentials - https://news.ycombinator.com/item?id=43374519 - March 2025 (126 comments)

krab · 2025-03-21T18:26:20 1742581580

A plural of regex is regrets...

LPisGood · 2025-03-21T15:53:51 1742572431

What are some footguns? It does seem easy

mikepurvis · 2025-03-21T16:12:39 1742573559

It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.

Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.

Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?

naitgacem · 2025-03-21T17:55:56 1742579756

Even with numbers, some locales use a comma `,` as the decimal seperator, and some use the dot `.` so that can cause headaches out of the box.

mjw1007 · 2025-03-21T17:49:43 1742579383

Beyond the basic implementation of quoting and escaping, those are things you also have to worry about if you use someone else's csv parser.

And if you implement your own, you get to choose the answers you want.

lelanthran · 2025-03-21T17:56:22 1742579782

What do you mean "allowed to bail"?

Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.

mikepurvis · 2025-03-21T18:01:20 1742580080

Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.

It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.

rho4 · 2025-03-21T16:04:53 1742573093

multiline values, comma vs semicolon, value delimiter escaping

geysersam · 2025-03-21T15:29:49 1742570989

At my current workplace the word "bespoke" is used to mean anything that is "business logic" and everyone are very much discouraged from working on such things. On the other hand we've got a fantastic set of home made tooling and libraries, all impressive software engineering, almost as good as the of the shelf alternatives.

JackC · 2025-02-11T21:43:14 1739310194

Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers: https://storage.courtlistener.com/recap/gov.uscourts.ded.721...

The core story seems to be: Westlaw writes and owns headnotes that help lawyers find legal cases about a particular topic. Ross paid people to translate those headnotes into new text, trained an AI on the translations, and used those to make a model that helps lawyers find legal cases about a particular topic. In that specific instance the court says this plan isn't fair use. If it was fair use, one could presumably just pay people to translate headnotes directly and make a Westlaw competitor, since translating headnotes is cheaper than writing new ones. And conversely if it isn't fair use where's the harm (the court notes no copyright violation was necessary for interoperability for example) -- one can still pay people to write fresh headnotes from caselaw and create the same training set.

The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

anon373839 · 2025-02-11T23:16:57 1739315817

This is an interesting opinion, but there are aspects of it that I doubt will stand the test of time.

One aspect is the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark”. It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

The key fact underlying all of this, I think, is that when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis. Source text was paraphrased using curiously similar language to West’s paraphrasing. That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.

AnthonyMouse · 2025-02-11T23:54:35 1739318075

> That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The "competing product" thing is probably the most extreme part of this opinion.

The most important fair use factor is if the use competes with the original work, but this is generally implied to be directly competes, i.e. if you translate someone else's book from English to French and want to sell the translation, the translation is going to be in direct competition for sales to people who speak both English and French. The customer is going to use the copy claiming fair use as a direct substitute for the original work, instead of buying it.

This court is trying to extend that to anything downstream from it, which seems crazy. For example, "multiple copies for classroom use" is one of the explicit examples of fair use from the copyright statute, but schools are obviously teaching people intending to go into competition with the original author, and in general the idea that you can't read something if you ever intend to write something to sell in competition with it seems absurd and in contradiction to the common practices in reverse engineering.

But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.

mountainb · 2025-02-12T00:57:20 1739321840

No that is not an extreme interpretation of the fair use factors. This is a routinely emphasized factor in fair use analyses for both copyright and trademark. School fair use is different because that defense is written into the statute directly in 17 U.S.C. § 107. Also, § 108 provides extensive protections for libraries and archives that go beyond fair use doctrines.

The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition. Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price. Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.

But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?

AnthonyMouse · 2025-02-12T02:07:59 1739326079

> School fair use is different because that defense is written into the statute directly

It's written into the statute as an example of something that would be fair use.

> The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition.

People go to art school primarily because they want to create art. People study computer science primarily because they want to write code. It's their direct intention and purpose to compete with existing works.

> Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price.

So if you use Windows and then want to create Linux...

> Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.

How is that logic any different than for AI training?

> But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?

It not only doesn't have any explicit requirement for a formal school (it just says "teaching"), it also isn't limited to teaching, teaching is just one of the things specified in the statute as being the kind of thing Congress intended fair use to include.

mountainb · 2025-02-12T16:20:21 1739377221

>It's written into the statute as an example of something that would be fair use.

Statutory text controls what the courts can do, even and perhaps especially when it includes an example.

>People go to art school primarily because they want to create art. People study computer science primarily because they want to write code. It's their direct intention and purpose to compete with existing works.

Interesting perspective.

>So if you use Windows and then want to create Linux...

I don't understand your meaning.

>How is that logic any different than for AI training?

That is what Mark Lemley, law professor at Stanford, has argued in his many law review articles and amicus briefs: he believes that training is analogous to learning. The court here didn't agree with the Lemley view.

>It not only doesn't have any explicit requirement for a formal school (it just says "teaching"), it also isn't limited to teaching, teaching is just one of the things specified in the statute as being the kind of thing Congress intended fair use to include.

In practice courts tend to limit these exceptions to formal teaching arrangements.

zozbot234 · 2025-02-12T02:42:17 1739328137

Copyright covers expression, not ideas. The underlying problem here is that Ross Intelligence never went to the trouble of distilling the purely idea-based and factual element from their original sources; even their finalized search system still had a pervasive reliance on Westlaw's original and creative expression as embedded in their headnotes. Using Windows and then creating Linux is something entirely different because Linux goes to great effort in order not to use anything that's specific to Windows. Large-scale language models are probably somewhere in the middle, because their unique reliance on an incredibly wide variety of published texts makes it very unlikely that they'll ever preserve anything of substance about the expression in any single text.

brookst · 2025-02-12T02:27:48 1739327268

What a world we’re in where a school using text to teach children, who will remember it, talk about it with others, likely buy it for their own children… can be framed as a “massive indirect subsidy” rather than “free advertising”.

nerdile · 2025-02-13T04:27:52 1739420872

This reflects on the individuals choosing to create and proliferate such misleading or hyperbolic framing more than it does on the world that we all live in. In meatspace we usually reject these ideas and ignore the people pushing them.

DrScientist · 2025-02-12T13:55:20 1739368520

The case looks pretty straightforward to me - they copied the notes ( human or machine doesn't really matter ) to directly compete with the author of the notes.

If you wrote a program that automatically rephrased an original text - something like the Encyclopaedia Britannica - to preserve the meaning but not have identical phrasing - and then sold access to that information on in a way that undercut the original - then in my view that's clearly ripping off the original creators of the Encyclopedia and would likely stop people writing new versions of the encyclopedia in the future if such activity was allowed.

These laws are there to make sure that valuable activities continue to happen and are not stopped because of theft. We need textbooks, we need journalistic articles - to get these requires people to be paid to work on them.

I think it's entirely reasonable to say that an LLM is such a program - and if used on sources which are sustained by having paid people work on them, and then the reformatted content is sold on in a way to under cut the original activity then that's a theft that's clearly damaging society.

I see LLM's as simply a different way to access the underlying content - the rules of the underlying content should still apply - ChatGPTs revenues are predicted to be in the billions this year - sending some of that to content creators, so that content continues to be produced, is not just right - it's in their interest.

zozbot234 · 2025-02-12T18:02:24 1739383344

> automatically rephrased an original text - something like the Encyclopaedia Britannica - to preserve the meaning but not have identical phrasing

Note that it's very hard to do this starting from a single source, because in order to be safe from any copyright concern you'd have to only preserve the bare "idea" and everything else in your text must be independent. But LLM's seem to be able to get around this by looking at many sources that are all talking about the same facts and ideas in very different ways, and then successfully generalizing "out of sample" to a different expression of the same ideas.

DrScientist · 2025-02-13T11:23:39 1739445819

The concept clustering across multiple sources allows you to rephrase more accurately while retaining meaning - however the point I'm making is if you then point that program at Encyclopaedia Britannica and simply rephrase it then charge for access to the rephrased version - should you be allowed to do that?

zozbot234 · 2025-02-13T15:11:34 1739459494

The underlying problem is that "meaning" in the ordinary sense still includes plenty of copyrightable elements. If you point a typical LLM program at some arbitrary text and tell it to "rephrase" that, you'll generally end up with a very close paraphrase that still leaves intact to a huge extent the "structure, sequence and organization" (in a loose sense) of the original. So it turns out that you're still in breach of copyright. All you're allowed to use when starting from a single copywritten text is the ideas and facts in their very barest sense.

DrScientist · 2025-02-14T11:42:21 1739533341

So if I made a pop song with was entirely copied from existing songs - but ensured that each fragment was relatively short ( but long enough to be recognisable ), then I'd be ok?

ie the way to avoid copyright is to double down on the copying?

I can see how, for a human, you could argue that there is creativity in splicing those bits together into a good whole - however if that process is automated - is it still creative - or just automated theft?

bryanrasmussen · 2025-02-12T06:35:43 1739342143

I think that someone taking Biology 101 and ending up writing textbooks, as opposed to all the other people who just forgot what they learned once the elective was over or ended up working biologists with labs or teachers of biology and so forth, is quite different than someone saying hey I want to make a competing product to this successful company, let's take their content, re-write and use AI to make a competitor, and then actually going into direct competition with that company a couple years later

pigbearpig · 2025-02-12T03:43:57 1739331837

" court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim"

That is the opposite of the ruling. The judge said the ones that summarize and pick out the important parts are copyrightable and specifically excludes the headnotes that quote court opinion verbatim.

The judge:

"But I am still not granting summary judgment on any headnotes that are verbatim copies of the case opinion (for reasons that I explain below)"

anon373839 · 2025-02-12T04:27:34 1739334454

You're right as far as the MSJ is concerned, and I should've been more precise. I was focusing on the dictum in the preceding paragraph (because we're discussing the broader implications of the order rather than the nuts-and-bolts of the instant motion). In that paragraph, the judge wrote:

> More than that, each headnote is an individual, copyrightable work. That became clear to me once I analogized the lawyer’s editorial judgment to that of a sculptor. A block of raw marble, like a judicial opinion, is not copyrightable. Yet a sculptor creates a sculpture by choosing what to cut away and what to leave in place. That sculpture is copyrightable. 17 U.S.C. §102(a)(5). So too, even a headnote taken verbatim from an opinion is a carefully chosen fraction of the whole. Identifying which words matter and chiseling away the surrounding mass expresses the editor’s idea about what the important point of law from the opinion is. That editorial expression has enough “creative spark” to be original. ... So all headnotes, even any that quote judicial opinions verbatim, have original value as individual works.

I personally don't think this sculpture metaphor works for verbatim quotes from judicial opinions.

Matticus_Rex · 2025-02-12T13:39:00 1739367540

Yeah, I'm willing to bet that metaphor gets called out as ludicrous by a higher court, as it has broader implications across types of editorial expression that break down when examined.

The marble from which a sculpture is carved is not itself a copyrighted work, and if we imagine it as having copyright protection, to the extent it's recognizable after editorial expression it'd have to qualify as fair use itself.

dragonwriter · 2025-02-12T18:20:55 1739384455

Both the more general premise (a work must not be an infringement of someone else’s work to be a work subject to copyright) and the more specific premise (court decisions are subject to copyright in the United States) in your argument for why verbatim selection from a court decision is not analogous, for copyright, to a sculptor carving from a block of material are wrong, though.

zozbot234 · 2025-02-12T18:11:17 1739383877

> Yeah, I'm willing to bet that metaphor gets called out as ludicrous by a higher court, as it has broader implications across types of editorial expression that break down when examined.

It's not ludicrous at all. Whether a work of "selection" from an existing source can be copyrightable in its own right would probably have to be judged on pretty much a case-by-case basis, but even in the context of "selecting" from a ruling there are almost certainly many cases where that work is creative and original enough that it can sensibly be protected by copyright.

bee_rider · 2025-02-12T05:51:59 1739339519

> It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

I guess it depends on how long the source is, and how long the collection of quotes is, if we’d expect multiple lawyers to converge on the same solution. I don’t think it is totally obvious, though…

I’m also not sure if that’s a generally good test. It seems great for, like, painting. But I wouldn’t be surprised if we could come up with a photography scene where most professionals would converge on the same shot…

zozbot234 · 2025-02-11T23:24:12 1739316252

If close paraphrase can be detected, this ought to be proof enough that some non-trivial element of creativity was involved in the original text. Because purely functional and necessary elements are not protected by copyright, even when they would otherwise be creative (this is technically known as the 'scenes à faire' case) - and surely a "quote" which is unavoidable because it factually and unquestionably is the core of the ruling would have to fall under that.

DrScientist · 2025-02-12T15:46:23 1739375183

Isn't the argument that the act of selecting the right quote is the real work - and the work the copier avoided in the act of copying?

You could argue that all the words are already in the dictionary - so none of them are new, you are just quoting from the dictionary in a particular order......

The reason you have people, rather than computers interpreting the law, is you can make judgements that make sense. Fundamentally these laws are there to protect work being unfairly ripped off.

What was clearly done in this case was a rip-off which damaged the original creator - everything else is dancing on the head of a pin.

zozbot234 · 2025-02-12T17:55:02 1739382902

Copyright does not protect work ("sweat of the brow"), it only protects expression and creativity. Thus, whenever there is only one right expression or even a bare handful in any given context, copyright does not apply to that particular choice. By analogy, arranging words in some semi-arbitrary order can be an expressive choice, whereas using what's effectively a fixed phrase is not, even though the two might look similar and involve a comparable amount of "work".

DrScientist · 2025-02-13T10:40:28 1739443228

The intention of copyright is to protect useful work.

The detail of how to do that in fair way that doesn't block other people is complex[1] - you can never cover all possibilities in a written law - that's why you have people interpreting them and making judgements. All I'm saying is the guiding light in that interpretation is copyright is there to protect the justifiable work of people in a fair way.

Somebody taking those law notes and trivially copying them to directly compete is clearly not 'fair use'.

If those notes could have been created mechanically directly from the original source - why didn't the copier do that - rather than use the competitors work?

[1] given the endless creativity of humans to game systems.

zozbot234 · 2025-02-13T15:02:42 1739458962

> The intention of copyright is

..."to promote the progress of science and useful arts". I don't see anything in there about rewarding 'work' irrespective of whether that work involves any kind of creativity.

> If those notes could have been created mechanically directly from the original source - why didn't the copier do that

That's actually a very good question. In practice, I do absolutely agree that the notes involve plenty of originality and creativity.

DrScientist · 2025-02-13T17:27:41 1739467661

The intention of copyright is ..."to promote the progress of science and useful arts". I don't see anything in there about rewarding 'work' irrespective of whether that work involves any kind of creativity.

Not sure where you got that quote from, but I'd say the work aspect is implicit in the "promote the progress" - ie progress requires that people are able to get paid in their work to progress science or the useful arts.

If the progress was trivial and required no work then it wouldn't need protection or promotion.

And sure it's phrased that way to get the balance between fair use and protection - but if there was no need of protection then copyright wouldn't need to exist - as free reuse is the default.

fncypants · 2025-02-11T23:40:14 1739317214

I think this is the best takeaway. This case and its outcome is restricted to its facts. Most of the LLM activity today is very different than what happened here.

singleshot_ · 2025-02-12T15:26:46 1739374006

My experience using Westlaw Keycites at work is that they’re not primarily created by fishing a quote out of a holding, but instead by synthesizing a rule. If I want a summary, I read the Keycite; if I want a money quote, I root around in the case linked to the Keycite.

Have you seen different? I’m curious what area of law you practice and in what state, for comparison’s sake.

anon373839 · 2025-02-13T03:58:54 1739419134

Yeah, I'd agree that most are synthesized. But I do frequently see headnotes that are verbatim or nearly verbatim slices from the text. Just grabbing a case at random: Kearney v. Salomon Smith Barney, Inc., 39 Cal.4th 95 (2006). The 4th headnote reads:

> The federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.

And the opinion reads:

> [T]he federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.

6stringmerc · 2025-02-12T00:27:04 1739320024

The crux is Fair Use and until lobbyists change the four factor test, AI training has an uphill battle in court. It’s a very disliked observation in this forum, but I stand by my principles on this one because the courts see it my way. Derivative works, especially by artificial means, simply fail the test miserably and that’s the truth.

greggyb · 2025-02-12T01:52:02 1739325122

Collections of essays or poems are considered copyrightable. This seems analogous enough to me.

fsckboy · 2025-02-12T02:42:36 1739328156

>the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark” ... when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis

... so it follows that it was then Ross's annotators showing the creative spark

reissbaker · 2025-02-12T02:03:52 1739325832

I'll quote a longer portion of the transcript about generative AI, because I think it makes the opposite of your point:

Ross’s use is not transformative. Transformativeness is about the purpose of the use. “If an original work and a secondary use share the same or highly similar purposes, and the second use is of a commercial nature, the first factor is likely to weigh against fair use, absent some other justification for copying.” Warhol, 598 U.S. at 532–33. It weighs against fair use here. Ross’s use is not transformative because it does not have a “further purpose or different character” from Thomson Reuters’s. Id. at 529.

Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written. D.I. 723 at 5. That process resembles how Westlaw uses headnotes and key numbers to return a list of cases with fitting headnotes.

I think it's quite relevant that this was not generative AI: the reason that mattered is that "transformative" use biases towards Fair Use exemptions from copyright. However, this wasn't creating new content or giving people a new way to understand the data: it was just used in a search engine, much like Westlaw provided a legal search engine. The judge is pointing out that the exact implementation details of a search engine don't grant Fair Use.

This doesn't make a ruling about generative AI, but I think it's a pretty meaningful distinction: writing new content seems much more "transformative" (in a literal sense: the old content is being used to create new content) than simply writing a similar search engine, albeit one with a better search algorithm.

BoorishBears · 2025-02-12T07:14:45 1739344485

I came here to point this out, and it's especially clear if you contextualize this with the original decision from September: https://www.ded.uscourts.gov/sites/ded/files/opinions/20-613...

They were doing semantic search using embeddings/rerankers.

The point that reading both decisions together compounds is that if they had trained a model on the Bulk Memos and generated novel text instead of doing direct searches, there likely would have been enough indirection introduced to prevent a summary judgement and this would have gone to a jury as the September decision states.

In other words, from their comment:

> But I'm not sure "generative" is that meaningful a distinction here.

The judge would not seem to agree at all.

qingcharles · 2025-02-12T08:27:14 1739348834

Westlaw's headnotes are primarily just snippets of the case with tags attached. They are really crappy. I hate them. Some lawyers love them.

Westlaw protects them because they are the "value add." Otherwise their business model is "take published decisions the court is legally bound to provide for free and sell it to you."

An LLM today could easily recreate the headnotes in a far superior manner from scratch with the right prompt. I don't even think hallucinations would factor in on such a small task that was well regulated, but you can always just asterisk the headnotes and put a disclaimer on them.

Tteriffic · 2025-02-12T13:38:32 1739367512

Exactly. Why use the headnotes at all?

I always thought they were obviously were copyrightable. Plus they’re not close to perfect either.

AlexCoventry · 2025-02-12T02:32:02 1739327522

> You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

Surely creating a general-purpose AI is transformative, though? Are you anticipating that AI companies will be sued for contributory infringement, because customers are using a general-purpose AI to compete with companies which created parts of the training data?

llamaimperative · 2025-02-12T02:41:45 1739328105

IMO yes. The entire purpose of copyright law is to protect the incentive to create new material. A huge portion of the value prop of AI is that it captures the incentive normally bound for the creators of the training material (i.e. the whole point is you can ask the AI and not even see, never mind pay, the originator).

zozbot234 · 2025-02-12T02:53:58 1739328838

Ask the AI for what exactly? Factual information? That gets very low protection from a copyright point of view, especially when separate random answers by the AI will routinely show completely different rephrasings of the AI's response - implying that it can generalize well beyond the "expression" contained in any single answer, and effectively reference the underlying facts.

AlexCoventry · 2025-02-12T02:48:24 1739328504

I'm not a lawyer, but I think the bar for contributory infringement is much higher than that. I think you'd have to find representatives of the defendants actually indicating somehow that people should use it that way. It seems to me that Grokster, etc.'s encouragement of their users to infringe copyright was an important factor in them losing this case, for instance.

https://supreme.justia.com/cases/federal/us/545/913/

llamaimperative · 2025-02-12T02:56:03 1739328963

Encouragement is definitely not a required element. https://www.cantorcolburn.com/news-newsletters-387.html

Ajedi32 · 2025-02-11T22:28:36 1739312916

Interestingly, almost the entirety of the judge's opinion seems to be focused on the question of whether the translated notes are subject to copyright. It seems to completely ignore the question of whether training an AI on copyrighted material constitutes making a copy of that work in the first place. Am I missing something?

The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:

> There is no factual dispute: Ross’s output to an end user does not include a West headnote. What matters is not “the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public for which it may serve as a competing substitute.” Authors Guild, 804 F.3d at 222 (internal quotation marks omitted). Because Ross did not make West headnotes available to the public, Ross benefits from factor three.

But he only does so as part of an analysis of whether there's a valid fair use defense for Ross's copying of the head notes, ignoring the obvious (to me) point that if no copyrighted material was distributed to end users, how can this even be a violation of copyright in the first place?

unyttigfjelltol · 2025-02-11T22:51:38 1739314298

Ross evidently copied and used the text himself. It's like Ross creating an unauthorized volume of West's books, perhaps with a twist.

Obscurity ≠ legal compliance.

Ajedi32 · 2025-02-12T05:10:17 1739337017

So the use of AI actually has nothing to do with the ruling here? This is just about the fact that Ross made one local copy of the notes and never distributed it?

brookst · 2025-02-12T02:30:30 1739327430

How would training on copyrighted material be infringement in a way that merely producing the training material (but not iterating through training) would not be?

kevin_thibedeau · 2025-02-12T02:30:49 1739327449

There were data brokers who literally paid people to transcribe phone books before OCR was a viable option. That was protected, as data isn't copyrightable. It isn't hard to argue that case law metadata is no different even though it includes textual descriptions (themselves taken from public documents).

musicale · 2025-02-13T04:00:21 1739419221

> "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents"

This is a good distillation. A bit like "we trained our system on various works of art and music, and now it is being sold as a service that competes with the original artists and musicians."

bsder · 2025-02-12T17:05:21 1739379921

AI has yet to demonstrate that it can do anything different from what a group of people could sit down and do. Sure, the AI may be able to do it faster, but there hasn't yet been anything demonstrated that exceeds what humans can do.

If it would be illegal for a group of people to do something, it is also going to be illegal for an AI do so.

Why is that so surprising?

echelon · 2025-02-11T22:11:44 1739311904

If the copyright holders win, the model giants will just license.

This effectively kills open source, which can't afford to license and won't be able to sublicense training data.

This is very bad for democratized access to and development of AI.

The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.

If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.

dkjaudyeqooe · 2025-02-11T22:15:43 1739312143

License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

To the contrary, this just means companies can't make money from these models.

Those using models for research and personal use wouldn't be infringing under the fair use tests.

marcusb · 2025-02-11T23:54:51 1739318091

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Maybe the strategy is something like this:

1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.

2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models

The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.

dijksterhuis · 2025-02-12T00:51:16 1739321476

> the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.

the long tail exists, and there will always be a threshold for payments due to rights holders.

it used to be (like 10 years ago so i might not remember the details exactly) that if you earned less than £1 from youtube performing music rights in a quarter then any money you earned was put back into the pot and redistributed to those earning over £1.

it just wasn’t worth the cost to keep track of £0.00001 earnings for all the rights holder in the bottom of the long tail each quarter, or to pay the bank fees when the eventually earn £0.01 that can be paid to them.

definitely not perfect, but at least some people were getting paid, instead of none.

also, youtube’s data they gave us was fairly shit (video title, url). so that didn’t help. nor did the lack of compute/data proc infrastructure/skills. was historically a manual spreadsheet job trying to work out who to cut.

i had to do it a few times :/

edit —

> The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

what could happen, for music at least, is the same thing that happened with youtube, mashed up with live music analogies.

a licensing negotiation with BMI/ASCAP/PRS, and maybe major publishers directly if they get frustrated with the PROs. then PROs will use sampling of other revenue streams to work out what the likely popular things are for AI. then divvy up whatever the lump sum is between the most popular songs.

we used to do this for live music. i had to generate the sampled dataset in microsoft access each year and weed out the all the radio stings.

sorry for costing you a million pounds that one year ed sheeran :/

DrillShopper · 2025-02-12T15:34:38 1739374478

> figuring out what IP went into what model output is just too hard

Check out this one cool trick companies found for skirting copyright restrictions.

Lawyers HATE them!

AnthonyMouse · 2025-02-12T00:04:29 1739318669

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.

But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.

jll29 · 2025-02-12T00:45:28 1739321128

Applying copyright law more and more to things like software - and now to AI models - in other words, the status quo, makes little sense.

What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.

It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).

Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.

echoangle · 2025-02-11T22:19:02 1739312342

They didn’t train it on every available copyrighted work though, but on a specific set of legal questions and answers. And they did try to license them, and only did the workaround after not getting a license.

synthetic-echo · 2025-02-11T22:37:09 1739313429

I think they were talking about the "model giants" like OpenAI you mentioned. Not saying they're correct, but I will concede the amount of copyrighted information someone like OpenAI would want is probably (at least) an order of magnitude more than this particular case.

mvdtnz · 2025-02-11T22:56:38 1739314598

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Oh no. Anyway.

mvdtnz · 2025-02-11T22:56:07 1739314567

Open source model builders are no more entitled to rip off content owners than anyone else. I couldn't possibly care any less if this impacts "democratized access" to bullshit generators. At least if the big boys license the content then the rightful owners get paid (and have the option to opt out).

brookst · 2025-02-12T02:42:52 1739328172

The copyright lobby has really done a number on public policy. Copyright was never meant to be perpetual.

I’m good with your proposal if we also revert to the original 14 year + 14 year extension model. As it stands the 120 year copyright is so ridiculously tilted that we should not allow it to extend to veto power over technical advancements.

Tanjreeve · 2025-02-12T08:15:07 1739348107

Legal arbitrage isn't a technical advancement. The technical advancement was all the stuff that goes into LLMs not the part where we feed ever more copyright into models for AICorp to make money.

vkou · 2025-02-11T22:19:25 1739312365

I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.

Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.

JoshTriplett · 2025-02-11T22:18:30 1739312310

> If the copyright holders win, the model giants will just license.

No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.

If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.

simonw · 2025-02-11T22:21:33 1739312493

"The biggest models want to train on literally every piece of human-written text ever written"

They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.

ethbr1 · 2025-02-11T22:46:57 1739314017

Even restricted to that narrower definition, the major commercial model companies wouldn't be able to afford to license all their high-quality human text.

OpenAI is Uber with a slightly less ethically despicable CEO.

It knows it's flaunting the spirit of copyright law -- it's just hoping it could bootstrap quickly enough to make the question irrelevant.

If every commercial AI company that couldn't prove training data provenance tomorrow was bankrupted, I wouldn't shed an ethical tear. Live by the sword, die by the sword.

brookst · 2025-02-12T02:38:21 1739327901

Bold idea, requiring startups to proactively prove they have not broken the law. Should we apply it to all tech startups? Let’s see silicon startups prove they have not stolen trade secrets!

pabs3 · 2025-02-12T01:16:08 1739322968

Open source models can crowdsource open source training data. This was done for RNNoise for example.

alberto-m · 2025-02-12T13:55:34 1739368534

> Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers

For me (Italian) this is amazing! Most Italian judges and lawyers write in a purposely obscure fashion, as if they wanted to keep the plebs away from their holy secrets. This document instead begs to be read; some parts are more in the style of a novel than of a technical document.

dkjaudyeqooe · 2025-02-11T22:12:08 1739311928

> The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.

echoangle · 2025-02-11T22:15:47 1739312147

How is the system inherently generative?

currymj · 2025-02-11T22:38:50 1739313530

Generative is a technical term, meaning that a system models a full joint probability distribution.

For example, a classifier is a generative model if it models p(example, label) -- which is sufficient to also calculate p(label | example) if you want -- rather than just modeling p(label | example) alone.

Similar example in translation: a generative translation model would model p(french sentence, english sentence) -- implicitly including a language model of p(french) and p(english) in addition to allowing translation p(english | french) and p(french | english). A non-generative translation model would, for instance, only model p(french | english).

I don't exactly understand what this judge meant by "generative", it's presumably not the technical term.

jll29 · 2025-02-12T00:34:28 1739320468

Your definition of "generative" as a statistical term is correct.

However, and annoyingly so, recently the general public and some experts have been speaking of "generative AI" (or GenAI for short) when they talk about large language models.

This creates the following contradiction:

- large language models are called "generative AI"

- large language models are based on transformers, which are neural networks

- neural networks are discriminative models (not generative ones like Hidden Markov Models)

- discriminative models are the oppositve of generative models, mathematically

So we may say "Generative AI is based on discriminative (not generative) classifiers and regressors". [as I am also a linguist, I regret this usage came into being, but in linguistics you describe how language is used, not how it should be used in a hypothetical world.]

References

- Gen AI (Wikipedia) https://en.wikipedia.org/wiki/Generative_artificial_intellig...

- Discriminative (Conditional) Model (Wikipedia) https://en.wikipedia.org/wiki/Discriminative_model

echoangle · 2025-02-11T22:56:27 1739314587

Do you have some kind of dictionary where I can find this definition? Because I don’t really understand how that can be the deciding factor of „generative“, and the wiki page for „generative AI“ also seems to use the generic „AI that creates new stuff“ meaning.

By your definition, basically every classifier with 2 inputs would be generative. If I have a classifier for the MNIST dataset and my inputs are the pixels of the image, does that make the classifier generative because the inputs aren’t independent from each other?

currymj · 2025-02-11T23:08:18 1739315298

if you have an MNIST classifier that just takes in images, and spits out a probability of digits 1-9, that wouldn't necessarily be generative, if it is only capable of modeling P(which digit | all pixels).

But many other types of model would give you a joint distribution P(which digit, all pixels), so would be generative. Even if you only used it for classification.

https://en.wikipedia.org/wiki/Generative_model

I guess these days "generative" must mean "it is used to generate outputs that look like the training data".

But until recently, the meaning had to do with the information in the model, not how it's used.

zozbot234 · 2025-02-11T23:11:36 1739315496

You can derive the latter information (the joint distribution), given the former and a prior over "all pixels"-like data. So, the defining feature of "generative" models is that they feature a prior over their input data?

Nimitz14 · 2025-02-12T01:42:07 1739324527

Generative models model the data, whether that is p(x) or p(x,y) or (x,y,z) etc.

currymj · 2025-02-11T23:20:01 1739316001

Yes, though maybe not explicitly written down.

Tanjreeve · 2025-02-12T08:17:03 1739348223

Does it really matter what the judge calls it when the ruling is about its end effects and outcomes?

JackC · 2025-02-10T17:00:08 1739206808

Hi! Perma is made by the Harvard Library Innovation Lab, which I direct, and I wrote a bunch of the early code for it back in 2015 or so.

For HN readers, I'd suggest checking out https://tools.perma.cc/, where we post a bunch of the open source work that backs this. Due to the shift from warc to wacz, (a zipped-web-archive format developed by WebRecorder), it's now possible to pass around fully interactive high fidelity web archives as simple files and host them with client side javascript, which opens up a bunch of new possibilities for web archive designs. You can see some tech demos of that at our page https://warcembed-demo.lil.tools/ , where each page is just a static file on the server and some client side javascript.

It's best to think of Perma.cc itself, the service, as some UX and user support wrapping to help solve linkrot primarily in the law journal, courts, law journals, and journalists area (for example, dashboards for a law journal to collaborate on the links they're archiving for their authors), and our work on this as building from that usecase to try to make it easier for everyone to build similar things.

I saw some mentions of the Internet Archive, which is great, and is also kind enough to keep a copy of our archives and expose them through the Wayback Machine. One thing I've been thinking about recently in archiving is that there's a risk to overstandardizing -- you don't want things too much captured with the same software platforms, funded through the same models, governed by the same people, exposed through the same interfaces, etc. There's supposed to be thousands of libraries, not one library. Unlike "don't roll your own crypto," I'd honestly love to see more people roll their own archives.

Happy to answer any questions!

jedberg · 2025-02-10T20:01:38 1739217698

My first question was "If this is a free service, how do I know it will still be around in even a few years?". This was answered by your comment that it is (or at least appears to be?) funded by Harvard.

In which case, why isn't this prominently displayed on the main page? Or why not use a Harvard library URL, which will significantly boost the trust level? Especially vs a CC TLD which are known to be problematic?

JackC · 2025-02-10T21:01:10 1739221270

It is on core Harvard funds, and we also have paid accounts used by law firms and journalists.

As an innovation lab we often minimize Harvard branding with project websites because it's more instructive to win or lose on our own merits than based on how people feel about Harvard, in either direction.

stanislavb · 2025-02-10T23:28:04 1739230084

Yeah, but the the mare success of a service like perma.cc relies on trust. How does someone trust you that you will be here in 10, 20, etc years?

Harvard has been around for hundreds of years, Harvard has inbuilt trust, Harvard has funding. You should negotiate and arrange going behind its brand.

TheRealPomax · 2025-02-11T00:05:49 1739232349

Things like https://perma.cc/sign-up/courts

It's in several US states' interest to make sure this service keeps existing.

Imustaskforhelp · 2025-02-11T18:38:10 1739299090

wow hats off man!

But I am also wondering , is this sustainable , can I use this to archive this hackernews itself for example

And how can we verify the integrity of the web archive , could you please explain?

Thanks a lot in advance. I wish you all the best in your career & this project!

gabcoh · 2025-02-10T20:12:25 1739218345

I guess it’s not sufficiently prominent (given that you didn’t see it) but this is discussed in detail in the FAQ section

A4ET8a8uTh0_v2 · 2025-02-10T17:10:21 1739207421

I think the main question is:

- Why is it better than internet archive?

I personally see the benefit as potentially having internet archive stopping being the only game in town, but even that comes with certain costs ( which may not be great to the community as a whole -- depending on who you ask ).

I would love to hear your perspective on where you stand as related to other providers of similar services.

JackC · 2025-02-10T20:52:26 1739220746

I think the biggest distinction is between archiving platforms made primarily for authors and primarily for web crawlers.

If you're an author (say, of a court decision) and you archive example.com/foo, Perma makes a fresh copy of example.com/foo as its own wacz file, with a CPU-intensive headless browser, gives it a unique short URL, and puts it in a folder tree for you. So you get a higher quality capture than most crawls can afford, including a screenshot and pdf; you get a URL that's easy to cite in print; you can find your copy later; you get "temporal integrity" (it's not possible for replays to pull in assets from other crawls, which can result in frankenstein playbacks); and you can independently respond to things like DMCA takedowns. It's all tuned to offer a great experience for that author.

IA is primarily tuned for preserving everything regardless of whether the author cared to preserve it or not, through massive web crawls. Which is often the better strategy -- most authors don't care as much as judges about the longterm integrity of their citations.

This is what I'm getting at about the specific benefits of having multiple archives. It's not just redundancy, it's that you can do better for different users that way.

bArray · 2025-02-10T17:27:04 1739208424

> - Why is it better than internet archive?

With the internet archive, the purpose seems to be for public archiving. One could imagine a use-case where you want non-public archives, and are therefore not subject to any take-down requests, especially if they are considered court evidence for example.

By paying directly for your links to be archived, it directly helps fund the service and therefore keep it going. You would want to see some guarantees in the contract about pricing if you were to long-term rely on the service.

rakoo · 2025-02-10T18:26:56 1739212016

Irrelevant. The point is that there shouldn't be a single archive for anything, because then it has the longevity of the operators. Who can say whether Harvard or the IA will close its service first? Why choose ?

lrvick · 2025-02-10T19:59:10 1739217550

Is there any concept of signing data at time of archive, and verification at time of access, to prove it is not later tampered with, say by a bribed sysadmin?

Similarly are there any general supply chain integrity measures in place, such as code review of dependencies, reproducible builds, or creating archives reproducibly in independently administrated enclaves?

You note archives could be used for instances like Supreme Court decisions, so any anyone with power to tamper with content would certainly be targeted.

JackC · 2025-02-10T20:57:27 1739221047

We're coauthors on the wacz-auth spec, which is designed to solve this sort of thing by signing archives with the domain cert of the archive that created them. If you cross-sign with a private cert you can do pretty well with this approach against various threat models, though it has to be part of a whole PKI security design.

I think the best approach for high stakes archiving is to have a standard for "witness APIs" so that you could fetch archives from independent archiving institutions. That also solves for the web looking different from different places. That hasn't gelled yet, though.

makeworld · 2025-02-10T22:43:45 1739227425

WACZ files created by WebRecorder software like archiveweb.page are signed (by you) and timestamped (by a third party using RFC 3161).

pbhjpbhj · 2025-02-10T20:09:43 1739218183

And put the signatures on a blockchain so that the perma.cc holders, or the USA government, can't do easily alter things either.

russellbeattie · 2025-02-10T18:16:02 1739211362

Since you own the "perma.link" domain name (I just looked it up) why don't you use that instead of .cc which has issues?

husam212 · 2025-02-10T21:49:51 1739224191

It's really annoying that domain is not the main one, it's so much better!

Onavo · 2025-02-10T22:12:18 1739225538

What happens if you get a lawsuit or injunction demanding information removal or alteration? What if somebody archives a born secret or something sensitive?