The general story about the LLM-scraper problem is that (1) "companies like OpenAI run badly implemented web crawlers to get training data" but there is (2) with LLMs scrapers could do content understanding (inference) that would make them more useful and I think the even more impactful (3) LLMs will empower people to write scrapers that would never have written them before.
I kinda laugh at (3) because it's been a running gag for me that management vastly overestimates the effort to write scrapers and crawlers because they've been burned with vastly underestimating the effort to develop what look like simple UI applications.
They usually think "this will be a hassle to maintain" but it usually isn't because: (a) the target web sites usually never change in a significant way because UI development is such a hassle and (b) the target web sites usually never change in a significant way because Google will punish them if they do [1]
It is like 10 minutes to write a scraper if you do it all the time and have an API like beautifulsoup on your fingertips, probably 20 minutes to vibe code it if you don't.
I am still using the same HTML scraper to process image galleries today that I used to process Flickr galleries back in the 00's, for a while the pattern was "fight with the OAuth to log into an API for 45 minutes" or "spend weeks figuring out how to parse MediaWiki markup" and then "get the old scraper working in less than 15 minutes". Frequently the scraper works perfectly out of the box, sometimes it works 80% out of the box, always it works 100% by adding a handful of rules.
I work on a product that has a React-based site and it seems the "state of the art" in scraping a URL [2] like
https://example.com/item/8788481
is to download the HTML and then all the Javascript and CSS and other stuff with no cache (for every freaking page) and run the Javascript and have something scrape the content out of the DOM whereas they could just go to
https://example.com/api/item/8788481
and get the data they want in a JSON format which could be processed like item["metadata"]["title"] or just stuffed into a JSONB column and queries any way you like. Login is not "fight with OAuth" but something like "POST username and password to https://example.com/api/login with a client that has a cookie jar" I don't really think "most people are stupid" that often but I think it all the time when web scraping is involved.
[1] they even have a patent for it! people who run online ad campaigns A/B test anything, but the last thing Google wants is for an SEO to be able to settle questions like "will my site rank higher if I put a certain phrase in a <b>?"
If people decomposed agile into little bits they probably wouldn't be doing agile anymore. The other day somebody posted a question about "how do I push back about management going to one week sprints?" My answer, which I didn't post then, was "if management really wants to go fast give up on sprints"
That is, Kanban + Continuous Integration >> Scrum by a lot. Seen from that viewpoint sprints are not something that speeds anything up (seductive name there!) but rather a bunch of phony deadlines and needless meanings (you really think people feel psychologically safe in a retrospective meeting that was scheduled just to have a meeting? is there really something worth talking about in an every two week one-on-one in your manager which isn't important enough to knock on their door and ask about right now?)
If I was evaluating a manager I'd probably get them to give me a list of practices that they say their team is following and then check to for conformance against that. I am less bothered with do they do code reviews or not but rather "did they tell me that they enforce five conventions in the code and looking at the code I find they rarely do... Lets sit in on a code review"
I've tended to use microservices in limited cases where the system had to serve a few requests that had radically different performance requirements, particularly memory utilization. I had a PHP server for instance that served exactly one URL for which PHP was not a good fit and a specialized server in another language for that one URL gave like 1000x better performance and money savings in terms of not needing a much bigger PHP server.
Using Spring or Guava in the Java world it is frequent that people write "services" that are simply objects that implement some interface which are injected by the framework. In a case like that you can imagine a service could have either an in-process implementation or an out-of-process implementation (e.g. via a web endpoint or some RPC.) Frameworks like that normally are thinking at the level of "let's initialize one application in one address space at a time" but it would be nice to see something oriented towards managing applications that live in various address spaces.
Trouble is that some people get this squee when they hear they can use JDK 9 for this project and JDK 10 for another project and JDK 11 for another project and they'd rather die than eschew the badly broken Python 3.5 for something better. If you standardized absolutely everything I think you could be highly productive with microservices because you wouldn't have to face gear switching or deal with teams who just don't know that XML serialization worked completely differently in JDK 7 vs JDK 8 thus the services they make don't quite communicate properly, etc.
When I hear "gift card" I reach for my gun. Relatives have been warned that nobody in my nuclear family will accept gift cards for a gift.
(Last time it happened my sister-in-law bought a Burger King gift card for my son although she didn't know that Burger King had gone out of business in my town. They had trouble spending all of it after making several out-of-town trips. I made my son give it to me after he screwed up. I almost used it down at the Oculus in NYC but didn't want to wait in line for 20 minutes to wait another 20 minutes to get a burger and just ate at the salad place across the street instead.)
I don't know exactly how, but I know stablecoins will eventually lead to big trouble. In 2025 it seems easy to run an honest stablecoin because with current high interest rates you can put the money in US treasuries or a bank account, collect some nice yield, spend a little to run the system, and make a profit. Trouble is that a lot of people won't be happy to hold a coin that pays 0% yield when you can find a bank account that pays 4%
so people are going to insist on interest bearing coins and the trouble with that is that people aren't going to think 4% is enough and that's going to lead people into things that aren't honest and aren't safe.
I used to be a CNBC junkie. Before there was crypto I used to enjoy adopting a penny stock and watching the ticker for it very closely; you can learn a lot about market dynamics when you are trading a stock where you buy $2000 of stock and that is 30% of the volume for the day. (Try $KBLB for a stock where if you think the price is too high or too low you will find that both opinions are vindicated if you wait long enough.)
Also those trades occurring "millions and millions" of times a day as opposed to a new pope every decade or so.
The comparison that I think matters is that the Pope and the Dalai Lama are the best-known religious leaders there are. I mean there used to be Billy Graham and the Ayatollah Khomeini but I think most people would struggle to name the leader of the Methodist church or Nichiren Buddhism or a rabbi of my than local importance.
Is that supposed to affect the comparison? A million trades seen by a thousand people each* isn't impressive, and numbers like that happen in all kinds of situations.
This is about the huge number of people knowing about a single event right away.
* I say a thousand here because even someone glued to every number on CNBC is parsing nowhere near millions of numbers. A much smaller sliver of people will see each of those individual trades.
I've found $30 sneakers at Wal-Mart that compare to $90 sneakers from Brooks, they are compatible with my feet like Brooks, they don't last as long as the Brooks but they last more than 1/3 as long.
Trouble is those were only available for a short time, I've gone back and there's been nothing that good.
It is not like it is going better for the AP1000 or NuScale. Including financing for the APR1400 bid in Czechia again leads to similar equivalent costs.
I kinda laugh at (3) because it's been a running gag for me that management vastly overestimates the effort to write scrapers and crawlers because they've been burned with vastly underestimating the effort to develop what look like simple UI applications.
They usually think "this will be a hassle to maintain" but it usually isn't because: (a) the target web sites usually never change in a significant way because UI development is such a hassle and (b) the target web sites usually never change in a significant way because Google will punish them if they do [1]
It is like 10 minutes to write a scraper if you do it all the time and have an API like beautifulsoup on your fingertips, probably 20 minutes to vibe code it if you don't.
I am still using the same HTML scraper to process image galleries today that I used to process Flickr galleries back in the 00's, for a while the pattern was "fight with the OAuth to log into an API for 45 minutes" or "spend weeks figuring out how to parse MediaWiki markup" and then "get the old scraper working in less than 15 minutes". Frequently the scraper works perfectly out of the box, sometimes it works 80% out of the box, always it works 100% by adding a handful of rules.
I work on a product that has a React-based site and it seems the "state of the art" in scraping a URL [2] like
is to download the HTML and then all the Javascript and CSS and other stuff with no cache (for every freaking page) and run the Javascript and have something scrape the content out of the DOM whereas they could just go to and get the data they want in a JSON format which could be processed like item["metadata"]["title"] or just stuffed into a JSONB column and queries any way you like. Login is not "fight with OAuth" but something like "POST username and password to https://example.com/api/login with a client that has a cookie jar" I don't really think "most people are stupid" that often but I think it all the time when web scraping is involved.[1] they even have a patent for it! people who run online ad campaigns A/B test anything, but the last thing Google wants is for an SEO to be able to settle questions like "will my site rank higher if I put a certain phrase in a <b>?"
[2] ... as in, we see people doing it in our logs
reply