Hacker Newsnew | past | comments | ask | show | jobs | submit | debugnik's commentslogin

> run-time performance is much worse (2.5x slower than LLVM -O0)

How come? The Copy-and-Patch Compilation paper reports:

> The generated code runs [...] 14% faster than LLVM -O0.

I don't have time right now to compare your approach and benchmark to theirs, but I would have expected comparable performance from what I had read back then.


The paper is rather selective about the used benchmarks and baselines. They do two comparisons (3 microbenchmarks and a re-implementation of a few (rather simple) database queries) against LLVM -- and have written all benchmarks themselves through their own framework. These benchmarks start from their custom AST data structures and they have their own way of generating LLVM-IR. For the non-optimizing LLVM back-end, the performance obviously strongly depends on the way the IR is generated -- they might not have put a lot of effort into generating "good IR" (=IR similar to what Clang generates).

The fact that they don't do a comparison against LLVM on larger benchmarks/functions or any other code they haven't written themselves makes that single number rather questionable for a general claim of being faster than LLVM -O0.


This is in relation to their TPCH benchmark which can be due to a variety of reasons. My guess would be that they can generate stencils for whole operators which can be transformed into more efficient code at stencil generation time while LLVM-O0 gets the operator in LLVM-IR form and can do no such transformation. Though I can't verify this because their benchmark setup seems a bit more involved.

When used in a C/C++ compiler the stencils correspond to individual (or a few) LLVM-IR instructions which then leads to bad runtime performance. Also as mentioned, on larger functions register allocation becomes a problem for the Copy-and-Patch approach.


And they'll gladly close it, them and every other bank. We lack alternatives so we lack leverage.

> This library is licensed under the GPLv3.

If the intention was to make it easier to spread the word, you've already failed.

Anyway, this whole library should have been a copy-pastable snippet for a dialog or toast (what's with the duplicate code?); the only value added is the translation, which most app devs already have a pipeline for.

The code part is so trivial that I suspect it doesn't even meet the legal bar for copyright protection in many jurisdictions.


> Anyway, this whole library should have been a copy-pastable snippet for a dialog or toast

People under-value copy-pasting. I'd rather copy/vendor a thousand lines of code (with license+credit intact) than add it as a dependency.

I'm working on a side project, and needed a CPIO library for Go. CPIO is a fixed thing, a good implementation is "done". U-root[1] has a really decent implementation, so I've vendored 2500+ lines of code, as otherwise I'd have to (indirectly) depend on almost 700.000. Great value.

[1]: https://github.com/u-root/u-root


Yeah this is very

    npm i is-even

OP, I recommend switching to the LGPLv3. It ensures users remain in control over your part of the code while avoiding this type of reaction.

Not really, it would have maybe avoided the first paragraph. I actually really like copyleft, but I assume the social statement here is more important than the code, thus making it easier to rally around it should be the priority.

A CC0 copy-pastable snippet, plus maybe this helper library with a permissive licence. The only way this would go popular is through slacktivism, so you need to remove any friction.


changed it to Apache V2.0 license

That's more fitting! I wish I had a popular app to spread the word from, I do like the spirit of your project.

changed to Apache V2.0 license

> Do they even bother with paying human authors now

I thought Medium was a stuck up blogging platform. Other than for paid subscriptions, why would they pay bloggers? Are they trying to become the next HuffPost or something?


A couple of years ago the OCaml and Julia languages already had to deal with a content farm that created wikis for them, filled them with LLM-generated, blatantly wrong or stupidly low quality content, and SEOed its way above actual learning materials. Cue in the newbies to these languages being incredibly confused.

This at least tries to generate the text out of the actual project, but I'm pessimistic and think it'll cause similar confusion.


> I was a huge fan of Podman, but I eventually gave up and use Docker Compose

You can mix them. I was using docker-compose with podman instead of docker before switching to quadlets. I still prefer the experience of compose files, but quadlets do integrate much better into systemd.


You already provided proof of a living legal identity when you got the ID, and it already expires to make you provide proof again every few years.


That's not not the kind of proof of life the government and companies want online. They want to make sure their video identification 1) is of a living person right now, and 2) that living person matches their government ID.

It's a solution to the "grandma died but we've been collecting her Social Security benefits anyway", or "my son stole my wallet with my ID & credit card", or (god forbid) "We incapacitated/killed this person to access their bank account using facial ID".

It's also a solution to the problem advertisers, investors and platforms face of 1) wanting huge piles of video training data for free and 2) determining that a user truly is a monetizable human being and not a freeloader bot using stolen/sold credentials.


> That's not not the kind of proof of life the government and companies want online.

Well that's your assumption about governments, but it doesn't have to be true. There are governments that don't try to exploit their people. The question is whether such governments can have technical solutions to achieve that or not (I'm genuinely interested in understanding whether or not it's technically feasible).


It's the kind of proof my government already asks of me to sign documents much, much more important than watching adult content, such as social security benefits.


It happens once if the user agent keeps a cookie that can be used for rate limiting. If a crawler hits the limit they need to either wait or throw the cookie away and solve another challenge.


Can that cookie then be used across multiple IPs?

That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.

This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.


Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978


Fair, then I obviously think Xe may have a kinda misguided understanding of their own product. I still stand by the concept I stated above.


latest update from Xe:

> After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I'm just going to submit a default rule that blocks huawei.


this kinda proves the entire project doesn't work if they have to resort to manual IP blocking lol


It doesn't work for headless chrome, sure. The thing is that often, for threats like this to work they need lots scale, and they need it cheaply because the actors are just throwing a wide net and hoping to catch it. Headless chrome doesn't scale cheaply so by forcing script kiddies to use it you're pricing them out of their own game. For now.


Doesn't have to be black or white. You can have a much easier challenge for regular visitors if you block the only (and giant) party that has implemented a solver so far. We can work on both fronts at once...

The point is that it isn't "implementing a solver", it's just using a browser and waiting a few seconds.

That counts as something that can solve it, yes. Apparently there's now exactly one party in the world that does that (among the annoying scrapers that this mechanism targets). So until there are more...

The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.


Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.


It might be a lot closer if they were using argon2 instead of sha. Sha is a kind of bad choice for this sort of thinh.


Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.


Most of those alt-coins are kind of fake/scams. Its really hard to make it work with actually useful problems.


Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.


this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.

reducing the problem to a cost issue is bound to be short sighted.


This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example


my comment is not about preventing crawling, its stating that with how much revenue AI is bringing (real or not), the value of crawling repeatedly >>> the cost of running these flimsy coin mining algorithms.

At the very least captcha at least tries to make the human-ai distinction, but these algorithms are just purely on the side of making it "expensive". if its just a capital problem, then its not a problem for these big corpo who are the ones who are incentivized to do so in the first place!

even if human captcha solvers are involved, at the very least it provides the society with some jobs (useless as it may be), but these mining algorithms also do society no good, and wastes compute for nothing!


> as AI scrapers bother implementing the PoW

That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:

> which is trivial for them, as the post explains

Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.

> figure out that they can simply remove "Mozilla" from their user-agent

And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.


> That's what it's for, isn't it? Make crawling slower and more expensive.

The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.


I thought the point (which the article misses) is that a token gives you an identity, and an identity can be tracked and rate limited.

So a crawlers that goes very ethically and does very little strain on the server should indeed be able to crawl for a whole week on a cheap compute, one that hammers the server hard will not.


Sure but it's really cheap to mint new identities, each node on their scrapping cluster can mint hundreds of thousands of tokens per second.

Provisioning new ips is probably more costly than calculating the tokens, at least with the default difficulty setting.


...unless you're sus, then the difficulty increases. And if you unleash a single scrapping bot, you're not a problem anyway. It's for botnets of thousands, mimicking browsers on residual connections to make them hard to filter out or rate limit, effectively DDoSing the server.

Perhaps you just don't realize how much did the scraping load increase in the last 2 years or so. If your server can stay up after deploying Anubis, you've already won.


How is it going to hurt those?

If it's an actual botnet, then it's hijacked computers belonging to other people, who are the ones paying the power bills. The attacker doesn't care that each computer takes a long time to calculate. If you have 1000 computers each spending 5s/page, then your botnet can retrieve 200 pages/s.

If it's just a cloud deployment, still it has resources that vastly outstrip a normal person's.

The fundamental issue is that you can't serve example.com slower than a legitimate user on a crappy 10 year old laptop could tolerate, because that starts losing you real human users. So if let's say say user is happy to wait 5 seconds per page at most, then this is absolutely no obstacle to a modern 128 core Epyc. If you make it troublesome to the 128 core monster, then no normal person will find the site usable.


It's not really hijacked computers, there is a whole market for vpns with residential exit nodes.

The way i think it works is they provide free VPN to the users or even pay their internet bill and then sell the access to their ip.

The client just connects to a vpn and has a residential exit IP.

The cost of the VPN is probably higher than the cost for the proof of work though.


> How is it going to hurt those?

In an endless cat-and-mouse game, it won't.

But right now, it does, as these bots tend to be really dumb (presumably, a more competent botnet user wouldn't have it do an equivalent of copying Wikipedia by crawling through its every single page in the first place). With a bit of luck, it will be enough until the bubble bursts and the problem is gone, and you won't need to deploy Anubis just to keep your server running anymore.


The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:

>> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.

>> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.


If you use one solution to browse the entire site, you're linking every pageload to the same session, and can then be easily singled out and blocked. The idea that you can scan a site for a week by solving the riddle once is incorrect. That works for non-abusers.


Well, since they can get a unique token for every site every 6 minutes only using a free GCP VPS that doesn't really matter, scraping can easily be spread out across tokens or they can cheaply and quickly get a new one whenever the old one gets blocked.

Wasn't sha256 designed to be very fast to generate? They should be using bcrypt or something similar.


Unless they require a new token for each new request or every x minutes or something it won't matter.

And as the poster mentioned if you are running an AI model you probably have GPUs to spare. Unlike the dev working from a 5 year old Thinkpad or their phone.


Apparently bcrypt has design that makes it difficult to accelerate effectively on a GPU.

Indeed a new token should be requested per request; the tokens could also be pre-calculated, so that while the user is browsing a page, the browser could calculate tickets suitable to access the next likely browsing targets (e.g. the "next" button).

The biggest downside I see is that mobile devices would likely suffer. Possible the difficulty of the challange is/should be varied by other metrics, such as the number of requests arriving per time unit from a C-class network etc.


That's a matter of increasing the difficulty isn't it? And if the added cost is really negligible, we can just switch to a "refresh" challenge for the same added latency and without burning energy for no reason.


If you increase the difficulty much beyond what it currently is, legitimate users end up having to wait for ages.


And if you don't increase it, crawlers will DoS the sites again and legitimate users will have to wait until the next tech hype bubble for the site to load, which is the reason why software like Anubis is being installed in the first place.


If you triple the difficulty, the cost of solving the PoW is still neglible to the crawlers but you've harmed real users even more.

The reason why anubis works is not the PoW, it is that the dev time to implement the bypass takes out the lowest effort bots. Thus the correct response is to keep the PoW difficulty low so you minimize harm to real users. Or better yet, implementing your own custom check that doesn't use any PoW and relies on ever higher obscurity to block the low effort bots.

The more anubis is used, the less effective it is and the more it harms real users.


I am guessing you don't realize that that means people using not the latest generation phones will suffer.


I'm not using the latest generation of phones, not in the slightest, and I don't really care, because the alternative to Anubis-like intersitials is the sites not loading at all when they're mass-crawled to death.


> Sadly the site's being hugged to death right now

Luckily someone had already captured an archive snapshot: https://archive.ph/BSh1l


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: