More

PaulHoule · 2025-05-09T17:07:46 1746810466

The general story about the LLM-scraper problem is that (1) "companies like OpenAI run badly implemented web crawlers to get training data" but there is (2) with LLMs scrapers could do content understanding (inference) that would make them more useful and I think the even more impactful (3) LLMs will empower people to write scrapers that would never have written them before.

I kinda laugh at (3) because it's been a running gag for me that management vastly overestimates the effort to write scrapers and crawlers because they've been burned with vastly underestimating the effort to develop what look like simple UI applications.

They usually think "this will be a hassle to maintain" but it usually isn't because: (a) the target web sites usually never change in a significant way because UI development is such a hassle and (b) the target web sites usually never change in a significant way because Google will punish them if they do [1]

It is like 10 minutes to write a scraper if you do it all the time and have an API like beautifulsoup on your fingertips, probably 20 minutes to vibe code it if you don't.

I am still using the same HTML scraper to process image galleries today that I used to process Flickr galleries back in the 00's, for a while the pattern was "fight with the OAuth to log into an API for 45 minutes" or "spend weeks figuring out how to parse MediaWiki markup" and then "get the old scraper working in less than 15 minutes". Frequently the scraper works perfectly out of the box, sometimes it works 80% out of the box, always it works 100% by adding a handful of rules.

I work on a product that has a React-based site and it seems the "state of the art" in scraping a URL [2] like

   https://example.com/item/8788481

is to download the HTML and then all the Javascript and CSS and other stuff with no cache (for every freaking page) and run the Javascript and have something scrape the content out of the DOM whereas they could just go to

   https://example.com/api/item/8788481

and get the data they want in a JSON format which could be processed like item["metadata"]["title"] or just stuffed into a JSONB column and queries any way you like. Login is not "fight with OAuth" but something like "POST username and password to https://example.com/api/login with a client that has a cookie jar" I don't really think "most people are stupid" that often but I think it all the time when web scraping is involved.

[1] they even have a patent for it! people who run online ad campaigns A/B test anything, but the last thing Google wants is for an SEO to be able to settle questions like "will my site rank higher if I put a certain phrase in a <b>?"

[2] ... as in, we see people doing it in our logs

PaulHoule · 2025-05-09T15:11:30 1746803490

If people decomposed agile into little bits they probably wouldn't be doing agile anymore. The other day somebody posted a question about "how do I push back about management going to one week sprints?" My answer, which I didn't post then, was "if management really wants to go fast give up on sprints"

That is, Kanban + Continuous Integration >> Scrum by a lot. Seen from that viewpoint sprints are not something that speeds anything up (seductive name there!) but rather a bunch of phony deadlines and needless meanings (you really think people feel psychologically safe in a retrospective meeting that was scheduled just to have a meeting? is there really something worth talking about in an every two week one-on-one in your manager which isn't important enough to knock on their door and ask about right now?)

If I was evaluating a manager I'd probably get them to give me a list of practices that they say their team is following and then check to for conformance against that. I am less bothered with do they do code reviews or not but rather "did they tell me that they enforce five conventions in the code and looking at the code I find they rarely do... Lets sit in on a code review"

sshine · 2025-05-10T15:55:59 1746892559

It always struct me as ironic that something that takes one or two weeks is called "sprinting".

In competitive athletics you will not sprint more than 400m, because sprint is anaerobic.

The fastest 400m ever run was 43.03 seconds.

PaulHoule · 2025-05-09T14:34:24 1746801264

busted for me

pvg · 2025-05-09T14:38:39 1746801519

https://news.ycombinator.com/item?id=36417597

This one? How does the bustedness manifest?

PaulHoule · 2025-05-09T14:50:52 1746802252

"We're having some trouble serving your request. Sorry!"

pvg · 2025-05-09T15:23:18 1746804198

Couldn't reproduce it but now I can. I emailed it to hn@ycombinator.com

PaulHoule · 2025-05-08T19:48:03 1746733683

I've tended to use microservices in limited cases where the system had to serve a few requests that had radically different performance requirements, particularly memory utilization. I had a PHP server for instance that served exactly one URL for which PHP was not a good fit and a specialized server in another language for that one URL gave like 1000x better performance and money savings in terms of not needing a much bigger PHP server.

Using Spring or Guava in the Java world it is frequent that people write "services" that are simply objects that implement some interface which are injected by the framework. In a case like that you can imagine a service could have either an in-process implementation or an out-of-process implementation (e.g. via a web endpoint or some RPC.) Frameworks like that normally are thinking at the level of "let's initialize one application in one address space at a time" but it would be nice to see something oriented towards managing applications that live in various address spaces.

Trouble is that some people get this squee when they hear they can use JDK 9 for this project and JDK 10 for another project and JDK 11 for another project and they'd rather die than eschew the badly broken Python 3.5 for something better. If you standardized absolutely everything I think you could be highly productive with microservices because you wouldn't have to face gear switching or deal with teams who just don't know that XML serialization worked completely differently in JDK 7 vs JDK 8 thus the services they make don't quite communicate properly, etc.

PaulHoule · 2025-05-08T19:23:28 1746732208

When I hear "gift card" I reach for my gun. Relatives have been warned that nobody in my nuclear family will accept gift cards for a gift.

(Last time it happened my sister-in-law bought a Burger King gift card for my son although she didn't know that Burger King had gone out of business in my town. They had trouble spending all of it after making several out-of-town trips. I made my son give it to me after he screwed up. I almost used it down at the Oculus in NYC but didn't want to wait in line for 20 minutes to wait another 20 minutes to get a burger and just ate at the salad place across the street instead.)

I don't know exactly how, but I know stablecoins will eventually lead to big trouble. In 2025 it seems easy to run an honest stablecoin because with current high interest rates you can put the money in US treasuries or a bank account, collect some nice yield, spend a little to run the system, and make a profit. Trouble is that a lot of people won't be happy to hold a coin that pays 0% yield when you can find a bank account that pays 4%

https://www.bankrate.com/banking/savings/best-high-yield-int...

so people are going to insist on interest bearing coins and the trouble with that is that people aren't going to think 4% is enough and that's going to lead people into things that aren't honest and aren't safe.

PaulHoule · 2025-05-08T18:18:11 1746728291

Archived: https://archive.ph/DaYKB

PaulHoule · 2025-05-08T17:38:25 1746725905

What about the tickers you see on TV or at Times Square? That’s not communicating?

crazygringo · 2025-05-08T17:43:37 1746726217

How many people look at those?

Versus how many people across the world are finding out about the new pope?

The point is how many people are actually receiving this information. Not "could look up on their phone if they wanted".

PaulHoule · 2025-05-08T18:35:41 1746729341

I used to be a CNBC junkie. Before there was crypto I used to enjoy adopting a penny stock and watching the ticker for it very closely; you can learn a lot about market dynamics when you are trading a stock where you buy $2000 of stock and that is 30% of the volume for the day. (Try $KBLB for a stock where if you think the price is too high or too low you will find that both opinions are vindicated if you wait long enough.)

Dylan16807 · 2025-05-08T19:30:32 1746732632

And that's a million people doing that versus a billion people hearing about the pope.

(very very rough numbers of course)

PaulHoule · 2025-05-08T20:26:14 1746735974

Also those trades occurring "millions and millions" of times a day as opposed to a new pope every decade or so.

The comparison that I think matters is that the Pope and the Dalai Lama are the best-known religious leaders there are. I mean there used to be Billy Graham and the Ayatollah Khomeini but I think most people would struggle to name the leader of the Methodist church or Nichiren Buddhism or a rabbi of my than local importance.

Dylan16807 · 2025-05-08T20:45:43 1746737143

Is that supposed to affect the comparison? A million trades seen by a thousand people each* isn't impressive, and numbers like that happen in all kinds of situations.

This is about the huge number of people knowing about a single event right away.

* I say a thousand here because even someone glued to every number on CNBC is parsing nowhere near millions of numbers. A much smaller sliver of people will see each of those individual trades.

crazygringo · 2025-05-08T19:29:19 1746732559

That's great. Not sure what it has to do with the conversation though?

Nobody is claiming nobody follows stocks.

PaulHoule · 2025-05-08T17:31:33 1746725493

Got some reputable sources?

kburman · 2025-05-08T17:54:12 1746726852

I have seen and can confirm there was a huge drone attack in Jammu.

Edit: Actually I don't know if it was missile/drone/jet. But multiple projectiles were intercepted in air.

Link - https://x.com/MeghUpdates/status/1920497243187970273

PaulHoule · 2025-05-08T18:02:40 1746727360

I don’t have an account on X. Is there some source other than X?

kburman · 2025-05-08T18:04:00 1746727440

https://www.bbc.com/news/live/cwyneele13qt?post=asset%3Ab21e...

kumarharsh · 2025-05-09T11:10:44 1746789044

https://x.com/ShivAroor - an NDTV journalist reporting from near the conflict areas

PaulHoule · 2025-05-08T16:23:44 1746721424

I've found $30 sneakers at Wal-Mart that compare to $90 sneakers from Brooks, they are compatible with my feet like Brooks, they don't last as long as the Brooks but they last more than 1/3 as long.

Trouble is those were only available for a short time, I've gone back and there's been nothing that good.

PaulHoule · 2025-05-08T16:08:54 1746720534

EPR was so unbuildable that it could have been designed by Amory Lovins to eliminate nuclear power.

ViewTrick1002 · 2025-05-08T16:37:57 1746722277

It is not like it is going better for the AP1000 or NuScale. Including financing for the APR1400 bid in Czechia again leads to similar equivalent costs.