Hacker News new | past | comments | ask | show | jobs | submit login
The most copied StackOverflow snippet of all time is flawed (2019) (programming.guide)
368 points by Decabytes on Sept 27, 2023 | hide | past | favorite | 233 comments



I find it interesting that all the answers using hardcoded values / if statements (or while) are all doing up to five comparisons.

It goes B, KiB, MiB, GiB, TiB, EiB and no more than that (in all the answers) so that can be solved with three if statements at most, no five.

I mean: if it's greater or equal to GiB, you know it won't be B, KiB or MiB. Dichotomy search for the win!

Not a single of the hardcoded solutions do it that way.

Now let's go up to ZiB and YiB: still only three if statements at most, vs up to seven for the hardcoded solutions.

I mention it because I'd personally definitely not go for the whole log/pow/floating-points if I had to write a solution myself (because I precisely know all too well the SNAFU potential).

I'd hardcode if statements... But while doing a dichotomy search. I must be an oddball.

P.S: no horse in this race, no hill to die on, and all the usual disclaimers


I would expect your binary search solution is possibly slower than just doing 6 checks because the latter is only going to take 1 branch. Branching is very slow. You want to keep code going in a straight line as much as possible.


Yup, know your hardware and know problem. Dichotomic search is wonderful when your data can't fit in RAM and it starts being more efficient to cut down on number of nodes traversed.

for a problem space limited by your input size (signed 64 bit number) to a 6 entry dictionary? At best you may want to optimize some in-lining or compiler hints if your language supports it. maybe setup some batching operations if this is called hundreds of times a frame so you're not creating/desrtoying the stack frame everytime (even then, the compiler can probably optimize that).

But otherwise, just throw that few dozen byte lookup table into the registers and let the hardware chew through it. Big N notations aren't needed for data at this scale.


It depends on the input distribution. If it’s very common to have smaller values then the linear search could be superior.


Your comment and mine are basically the same. This is what I call terrible engineering judgement. A random co-worker could review the simple solution without much effort. They could also see the corner cases clearly and verify the tests cover them. With this code, not so much. It seems like a lot of work to write slower, more complex, harder to test and harder to review code.



Thanks! Macroexpanded:

The most copied StackOverflow snippet of all time is flawed (2019) - https://news.ycombinator.com/item?id=27533684 - June 2021 (334 comments)

The most copied StackOverflow snippet of all time is flawed - https://news.ycombinator.com/item?id=21698619 - Dec 2019 (88 comments)

The most copied StackOverflow snippet of all time is flawed - https://news.ycombinator.com/item?id=21693431 - Dec 2019 (3 comments)


I don't understand. There are 7 suffixes, can't you pick the right one with binary search? That would be 3 comparisons. Or just do it the dumb way and have 6 comparisons. How are two log() calls, one pow() call and ceil() better than just doing it the dumb way? The bug being described is a perfect example of trying to be too clever.


The author apparently went back to using a loop after recognizing that it's not readable: https://programming.guide/java/formatting-byte-size-to-human...

Notably, it's still slightly better than the first code example in the original article, as it takes the rounding bug into account.


The author says at the beginning that it’s not actually better than the loop.

Also 6 comparisons is only if you’d have the max value which seems unlikely in actual usage. Linear could be better if most of the time values are in B or KB ranges


Shameless plug: another option to format sizes in a human readable format quickly and correctly (other than copying from S/O), you can use one of our open source PrettySize libraries, available for rust [0] and .NET [1]. They also make performing type-safe logical operations on file sizes safe and easy!

The snippet from S/O may be four lines but these are much more extensive, come with tests, output formatting options, conversion between sizes, and more.

[0]: https://github.com/neosmart/prettysize-rs

[1]: https://github.com/neosmart/PrettySize.net


Replacing 4 line solutions with extensive libraries is what caused left-pad.


I understand where you're coming from here, but the whole point of this article is at the 4-line solution is wrong (and the author specifically mentioned that every other answer on the stack overflow post was wrong in the same way as well). "Seemingly-simple problem where every naïve solution contains a subtle bug" is exactly the right use case for a well-designed library method.


> “It’s wrong”

But in a completely benign way. I question why a few edge cases of writing 1000kb instead of 1Mb—so not even a misrepresentation—would ever be worth the code bloat. This is about making stuff slightly more convenient to read.


I agree with you— that was a lot of drumming for what turned out to be kind of a nothingburger as far as the "bug".

At the same time, putting this kind of thing in a library (or even a language's stdlib) is worthwhile for exactly this kind of reason— it allows devs to confidently reach for code that other smart people have really agonized over and which definitely covers the corner cases, similar to other common utilities such as sort methods.


One example: I display available memory in my status bar, which expects strings to be a constant width. If it displayed 1000kb, it would cause alignment issues and annoy the heck out of me


Yeah, copying an incorrect answer from SO thousands of times is much better!

(The subject at hand isn't whether libraries are good or not, it's whether copying something off the internet is. In the post, it turns out it isn't. If it was a library, the author could have fixed and updated the library, and the issue would be fixed for everyone that uses it. left-pad isn't an issue with libraries per se, it's an issue with library management)


No. left-pad was placing a 4-line solution in a library. prettysize is well deserving of library status.


What caused left-pad is the the ability to delete published code


You should see the implementation of `std::midpoint`[1].

Accounting for correctness even in edge-cases is what large libraries do better than throwaway bits of code.

[1]: https://github.com/microsoft/STL/blob/6735beb0c2260e325c3a4c...


Out of curiosity, is there a sizable number of developers that just copy and paste untrusted code from StackOverflow into their applications?

The conjecture that people just copy from StackOverflow is obviously popular but I always thought this was just conjecture and humor until I saw someone do it. Don't get me wrong, I use StackOverflow to give me a head start on solving a problem in an area I'm not as familiar with yet, but I've never just straight copied code from there. I don't do that because rarely does the snippet do exactly and only exactly what I need. It requires me to look at the APIs and form my own solution from the explained approach. StackOverflow has pointed me in the direction of some niche APIs that are useful to me, especially in Python.


I once worked with a developer who wouldn’t let anything come between him seeing an answer and copying it into his code. He wasn’t even reading the question to make sure it was the same problem he was having, let alone the answer. He would literally go Google => follow the first link to Stack Overflow he saw => copy and paste the first code block he saw. Sometimes it wasn’t even the right language. People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.

Now he was an extreme case, but yes, there are a lot of developers out there with the mindset of “I need code; Stack Overflow has code; problem solved!” that don’t put any thought at all into whether it’s an appropriate solution.


A hiring round nearly two decades ago we realised something was off with the answers to the usual pre-phone interview screening questions. They were simple, and we asked people to only spend like 20 minutes on them. We knew people would "cheat", but they were only there to lighten our load a little bit, so it was ok if they let through some bad candidates.

But for whatever reason, in one hiring round the vast majority had cut and pasted answers from search results verbatim (we dealt with a new recruiter, and I frankly suspected this new recruiter was telling them this was ok despite the instructions we'd given).

These were not subtle. But the very worst one was one who did like the developer you described: He'd found a forum post about a problem pretty close to the question, had cut and pasted the code from the first answer he found.

He'd not even bothered to read a few comments further down in the replies where the answer in question was totally savaged by other commenters explaining why it was entirely wrong.

This was someone who was employed as a senior developer somewhere else, and it was clear in retrospect looking at his CV that he probably kept "fleeing the scene of the crime" on a regular basis before it was discovered he was a total fraud. We regularly got those people, but none that delivered such obviously messed up answers.

For ever developer like this, you're probably right there will be a lot more that are less extreme about it, and more able to make things work well enough that they're not discovered.


It is hard for some people to grasp the sheer amount of fraud in this industry. A while back I worked with two guys, one with a Master's and the other with a PhD. One day they came to me asking for help, because the program they'd written (in Python) wouldn't run. It was supposed to analyze some text, and spit out whatever the result of the analysis was.

The problem? They were passing the input text as hardcoded plaintext, i.e. it wasn't even a string with quotes or anything -- just `foo(here is my raw, non-string input, no quotes necessary lol)`, and they could not conceive of what the issue might be.


That has to be bug blindness? I.e. they have decided that there is no bug at that line, and can't see it afterwards. How could they even write the program in the first place, if they were not aware of string literals?


Did they write code in notepad? How did that not get detected by the LSP?


This is like grading calculus exams. Student gives the memorized answer which most resembles (in his mind) the question asked.


If you're paying a developer by the hour, and want your app released in the app store using as few hours as possible, then this approach can be the most cost efficient one.

Sure, it isn't good practice. Sure, it probably isn't what NASA should be doing. But if you're literally building yet another uber-like app, you probably shouldn't be spending too long thinking about details.


> this approach can be the most cost efficient one.

No it can’t. Quick and dirty? Sure. Take on some tech debt to get to market quicker. Blindly copying and pasting? You’re never going to build functional software that way. This guy was committing code with syntax errors that he’d obviously never even run. How are you going to get to market quickly that way?


The comment you're responding to said the guy was copying the wrong language at times. Code that won't even compile isn't making it into the app store.


Yeah, those details like whether or not it works really don't matter. NASA is overrated.


rarely are things so black and white. If you're just pushing out an MVP, something that takes 5 seconds and is 95% correct is often better than 30 minutes and 100% correct.


I'm willing to entertain the idea that copy/paste from SO may the right option in some cases, but you have to apply at least a little scrutiny. I'm not sure exactly where the bar should be for an MVP, but "[s]ometimes it wasn’t even the right language" is definitely below it.


maybe if you don't give a fuck about your users or the future maintainers, but for the time span of just 30m to make sure there's no bugs, and it's easy to maintain? MVP or not you're still a bad engineer if you actually do this.

Correct and broken are black and white if you can divide the problem correctly, and there's no excuse for shipping broken code. At some point someone has to take responsibility for not shipping garbage. I get that you, me, or any engineer don't always have that luxury, but it should be a shameful thing not something you accept as normal or ok.


Maybe spending 30 minutes on one bug is worth it, maybe not. If you're pre-revenue / pre-product-market-fit and you compound tens to hundreds of these 5s to 30m decisions, you're risking running out of time or money before anyone even uses your product.

I would argue it's much worse "engineering" to have no product at all.


> pre-product-market-fit

Is that a euphemism for wandering around aimlessly? committing random code to see what works? That's also not good engineering....

Not saying it won't end in the outcome you want people gamble all the time, I'm just saying it's bad engineering.


I mean, the default of any new company is pre-product-market-fit, no? How else could you start something new? During such an early stage much of your code may be written as a very rough MVP that you're only really using to validate a concept. Sometimes, you're going to just have to trash all of it because the idea was totally wrong and people don't actually care about the problem you're solving.

Those, among others, are the types of cases where spending extra time getting something exactly right (even if just a few hours) is just not worth it.


No, the vast majority of companies are entering a mature market with a product that is based on what the market wants, but with their own value proposition.

It seems that only in Silicon Valley startups and the like do people start companies with only the vaguest idea of what they are actually going to build and no idea of whether or not they're solving an actual problem that anyone cares about.


That's not software development. That's wild guessing.


I've seen "wild guessing" quite a bit when people don't actually understand the problem they're solving. Mostly students, but it happens in professional contexts as well.

I'm not sure why, maybe people are missing knowledge that would allow them to understand, so they just try random things in the hope that it works? It surprises me every time it happens.


To some that is the same. Try and modify until it sort of works.


I don't think you would be able to solve complex problems or development tasks with such an approach as described above (if that's what you're referring to). That's something i could expect from a bloody junior but not anymore from a seasoned professional.


In combination with ChatGPT4 and alike and lots of iterations, you probably would get pretty far today. But I agree, this is not software engeneering and I would not use such code for anything important. But if someone makes a (sandboxed) game with it, it still might be fun.


Just out of curiosity… what was his salary and how long did it take to fire him? Did they fire the HR manager as well?


No idea. I left before he did.


this is basically how GitHub copilot works


Worse too, because some of the copy/pasters at least remember to copy past the StackOverflow URL, too. GitHub Copilot doesn't even give you that.


> People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.

Sounds like this guy understands concurrency. :)


Just wait til that guy discovers ChatGPT.


I won’t be surprised if that guy is ChatGPT’s main audience.

Personally I can’t see how it would be faster to ask ChatGPT for an answer then carefully scrutinize the output to make sure I understand what it’s doing. Code is often easier to write than read - especially when it’s not your code.

In hindsight the solution is obvious, just run the code without reading it then try to fix it if it doesn’t produce acceptable results.


ChatGPT could help this dev if they understood the problems they are trying to solve. That is such a fundamental flaw in this. They will be on a PIP and out of a job in any respectable workplace. That would be a mercy.


Yes, and it happens more for things that feel out of scope for the part of the program that I'm interested in. After all, we import library code from random strangers into our programs all the time for the parts we consider "plumbing" and beneath notice. If I wanted to dig in and understand something, I would be more likely to write my own. But if I want this part over here to "just work" so I can get on with the project, it's compiler-error-driven development.


Same, and even more so if it's something that feels like it should be in the library code in the first place.

My most copy-pasted code is projecting a point onto a line segment. I end up needing it all the time, it's never in whatever standard library for vector math I'm using, and it's faster to find on SO than to find and translate the code out of whatever my last project that needed it is. Way faster than re-deriving it.

Your vector math library is probably already code imported from random strangers, likely even imported by random strangers, so adding one more function from a random stranger feels entirely appropriate.


I hardly ever just copy and paste for the exact reason the author talks about. Instead, I try to make sense of the solution, and if I have to, I'll hand-copy it down line-by-line to make sure I properly understand and refactor from there. I also rename variables, since often times there are so many foos and bars and bazes that it's completely unreadable by a human.

Also if I come across the problem a second time, I'll have better luck remembering what I did (as opposed to blindly copying).


Yes, people do that. After looking at a huge number of incorrect TLS related code and configuration at SO, I’m now pretty sure that most systems run without validating certificates properly.


This was more true when libraries and tooling defaulted to not checking.

Somewhere in my history is a recent HN (or maybe Reddit) post where somebody insists Curl has been 100% compatible from day one, and like, no, originally curl ignores certificates, today you need to specify that explicitly if it's what you want.

I think (but don't take my word for it) that Requests (the Python library) was the same. Initially it didn't check, then years back the authors were told that if you don't check you get what you didn't pay for (ie nothing) and they changed the defaults.

Python itself is trickier because it was really hard to convince Python people that DNS names, the names we actually care about in certificates, aren't Unicode. I mean, they can be (IDNs), but not in a way that's useful to a machine. If your job is "Present this DNS name to a user" then sure, here's a bunch of tricky and maybe flawed code to best efforts turn the bytes into human Unicode text, but your cert checking code isn't a human, it wants bytes and we deliberately designed the DNS records and the certificate bytes to be identical, so you're just doing a byte-for-byte comparison.

The Python people really wanted to convert everything messily to Unicode, which is - at best if you do it perfectly - slower with the same results and at worst a security hole for no reason.

OpenSSL is at least partly to blame for terrible TLS APIs. OpenSSL is what I call a "stamp collector" library. It wants to collect all the obscure corner cases, because some of its authors are interested. Did the Belgian government standardise a 54-bit cipher called "Bingle Bongle" in 1997? Cool, let's add that to our library. Does anybody use it? No. Should anybody use it? No. But it exists so we added it. A huge waste of everybody's time.

The other reason people don't validate is that it was easier to turn it off and get their work done, which is a big problem that should be addressed systemically rather than by individually telling people "No".

So I'd guess that today out of a thousand pieces of software that ought to do TLS, maybe 750 of them don't validate certificates correctly, and maybe 400 of those deliberately don't do it correctly because the author knew it would fail and had other priorities.


Apache used to not reject SNI hostname headers ending in a dot, in contravention of RFC 6066. Firefox notoriously didn't strip the trailing dot before sending the header. Some versions of curl (or the underlying libraries?) did, some didn't. I filed a bug at bz.apache.org about it.


requests pulls in certifi (Firefox's trust store, repackaged) via urllib3, so it probably uses those root certs by default, not the system store.


To be fair that might be partly the fault of TLS libraries. There should be a single sane function that does the least surprising thing and then lower level APIs for everything else. Currently you need a checklist of things that must be checked before trusting a connection.


Oh boy, where to begin. You obviously haven't had the pleasure of working in a codebase written by Adderall-fueled 23-year-olds.


What about Adderall-fueled 35 year olds?


What about Red Bull-fueled 43 year olds?


What about retirement driven 30 year olds?



I think the section “ A Study on Attribution” and associated paper might be as good of an answer as you’ll get to that


Well. You (collective you) start by copying and pasting a code snippet first, and then modifying it as needed. Does that count? If no modifications are needed, then it stays.


That's what I do. I almost always rename things to match the coding style of the codebase I'm working on, though.


Plenty of developers paste arbitrary bash commands posted on sites like GitHub without thinking because they look "legit", I suppose. I see it similarly as you do: StackOverflow (and Copilot) can be helpful to start but it's.

Had an exchange like this some time ago:

Me: Hey, I'm reviewing your PR. Looks pretty fine to me. Except for this function which looks like it was copy-pasted from SO: I literally found the same function in an answer on SO (it was written in pure JS while we were using TS in our project).

Dev: Yes, everyone copies from SO.

Me: Well, in that case I hope you always copy the right thing. Because this code might run but it is not good enough (e.g. the variable names are inexpressive, it creates DOM elements without removing them after they are not needed anymore).


There really is, but people do give it a cursory read. See also: https://en.wikipedia.org/wiki/Underhanded_C_Contest


Yes. I was told from a reliable source that at one point they tried to log all the copy and paste events and it brought their systems to their knees.


I wouldn't do it in most professional settings due to licensing...

But for personal projects where I just want to get something running, then yes, I would copy paste and barely even read the code.

I don't really care about bugs like this either - I'm happy to make something that works 99% of the time, and only fix that last 1% if it turns out to be an issue.


> I wouldn't do it in most professional settings due to licensing...

Underrated comment. I think most tech companies' General Counsel would have a heart attack if they were aware of StackOverflow copy-pasting by their developers. I highly doubt some rando-engineer who pastes bubblesort code into their company's code base gave even a passing though to what license the SO code was under, what license his own company's code was under, and whether they were compatible.

The big (FAANG) tech companies I've worked at all have written policies about copying and pasting code from external sources (TLDR: Don't), but I've seen even medium-sized (~1000+) companies with zero guidance for their developers.


In the server side JavaScript world absolutely, it seems like it's standard practice, people are injecting entire dependencies without even remotely looking at the code. Bringing in an entire library for a single function that could be accomplished in a couple lines and usually is posted below the fold.


...you would not believe...

not long ago I worked on a team who actively chose libraries and frameworks based on the likelihood they felt their questions would be answered on StackOverflow.


Yes.

This is why PHP got such a bad reputation. A lot of new developers where copy and pasting quick example code from stack overflow, or code from other new developers who only kind of knew what they were doing.


> This is why PHP got such a bad reputation.

I don't think that's the only reason, lol.


What? SO launched in 2008 and PHP had a bad reputation prior to that.


The point stands, it just wasn't SO they were getting the bad information from prior to 2008.


You're right, prior to that it was random forums,


and the comment section in the php.net documentation.


Less and less every day. Now they are using ChatGPT.


when i had to used python i felt like copy pasting anything was out of scope due to indentation errors.


Millions.


Wait til you find out about chatGPT


I don't understand why you'd use floating point logarithms if you want log 2?

Unless I'm missing something, this gives you an accurate value of floor(log2(value)) for anything positive less than 2^63 bytes, and it's much faster too:

  Long.bitCount( (Long.highestOneBit(value) << 1) - 1) - 1


The “common” units are powers of 10 so this doesn’t work


The original SO question did actually state they wanted powers of two (kilobyte as 1024 bytes). Although, they should have used KiB, GiB, instead to be pedantic.


But you can avoid binary search because there are at most one power of tens between 2^k and 2^(k+1). So you can turn it into a lookup table problem.


I took one look at the snippet, saw a floating-point log operation and divisions applied to integers, and mentally discarded the entire snippet as too clever by half and inherently bug-prone.


That’s basically the point of the article


Knowledge cascades all the way down; it goes to show how difficult it is to 'holster' even the smallest piece of knowledge once its drawn.

I wonder with the rate Stack Exchange is losing active contributors, what it would take for 'fastest gun' answers to be corrected that are later found to be off mark, and what it would mean for our collective knowledge once these 'slightly off' answers are further cemented in our annals of search and increasingly, LLM history.


This reminds me of when I was in basic training. The drill sgts would give us new recruits a task that none of us knew how to do, purposefully without guidance, and then leave. One guy would try and start doing it, always the incorrect way, and everyone else would just copy that person.


I wonder if this is exacerbated by human tendencies to not want to look bad relative to others, even if it leads to silly outcomes like intelligent people following a bad or rushed idea.

Something similar happens in public economic forecasts because those who get it wrong when others get it right are treated much more harshly than those who get it wrong when others get it wrong too.


What was the goal of this?


"Don't jump off a cliff just because everyone else is doing it" basically

I guess the next logical exercise would be asking them to do something with instructions that are complete, but incorrect or at least inefficient, to teach the lesson of questioning superior orders rather than just peers. Actually, I'm honestly not sure it that's desired in military discipline or not (no direct experience here)


I drove a forklift one summer for a manufacturing plant.

I had a supervisor tell me to do something that was clearly not right and I refused. I came in the next day and they tried to write me up and I refused to sign the paperwork for it.

The one thing no one could accurately describe is why the supervisor was right.

I agree with the idea of being willing to go against authority but disagree that it's always a good career move :)

Of course it was easier for me, it was just a summer job, I was going back to Uni in the fall.


The usual goal of anything in military training, being cruel to new recruits?


In a way, I don't even consider floating point errors to be "flaws" with an algorithm like this. If the code defines a logical, mathematically correct solution, then its "right". Solving floating point errors is a step above this, and only done in certain circumstances where it actually matters.

You can imagine some perfect future programming language where floating point errors don't exist, and don't have to be accounted for. Thats the language I'm targeting with 99% of my algorithms.


This reminds me of a weirdness with some sat navs: the distance to your exit/destination is displayed as: 12 ... 11 ... 10 ... 10.0 ... 9.9 ... 9.8 ... with the value 10.0 shown only while the distance is between 9.95 and 10. It's not really a bug but it's strange seeing the display update from 10 to 10.0 as you pass the imaginary ten-mile milestone so perhaps it's a distraction worth avoiding.


Mercedes for awhile had a fuel gauge that showed 1/4 1/2 3/4 1/1

They had another one that went R 2/4 4/4

I'm still undecided which was more weird. You can see them both on eBay.


There's nothing weird here. Those are very common fractions used across several domains, including cooking.

But one thing that I would really love to see are actual liters or gallons (depending on the country where I am at the moment).


Almost every top stack overflow answer is wrong. The correct one is usually at rank 3. The system promotes answers which the public believes to be correct (easy to read, resembles material they are familiar with, follows fads, etc).

Pay attention to comments and compare a few answers.


Years ago I tried to answer a comment on StackOverflow, but I didn’t have enough points to comment. So I tried to answer some questions so that I could get enough points to comment. But when looking at the new questions, it seemed to be mostly a pile of “I have a bug in my code please fix it” type stuff. Relatively simple answers to “What is the stack and the heap?” had thousands of points, but also already had tons of answers (though I suppose one of the reason why people keep answering is to harvest points). I was able to answer a question on an obscure issue that no one had answered yet, but received no points.

Then I saw that you could get points for editing answers. OK, I thought, I can get some points by fixing some bugs. I found a highly upvoted post that had code that didn’t work, found that it was because one section had used the wrong variable, and tried to fix it. Well, the variable name was too short to meet the necessary 6 characters to edit the code (something like changing “foo” to “bar”).

I went to see what other people did in these situations, and they suggested just adding unnecessary edits in order to reach the character limit.

At that point, I just left the bug in, and gave up on trying to contribute to Stack Overflow.


I was active on the statistics Stack Exchange for a while in grad school. There were generally plenty of interesting questions to answer, but the obsession some people (the most active people, generally) had with the points system became really unpleasant after a while.

My breaking point was when I saw a question with an incorrect answer. I posted a correct answer, explained why the other answer was incorrect, and downvoted the incorrect answer. The author of the incorrect answer then posted a rant as a comment on my answer about how I shouldn't have downvoted their answer because they were going to fix it, and a couple other people chimed in agreeing that it was inconsiderate or inappropriate of me to have downvoted the other answer.

I decided Stack Exchange was dumb and stopped spending time there, which was probably good for my PhD progress.


The trick to getting a lot of reputation on Stack Overflow and the like is to have posted a long time ago and then just leave it alone.

I was quite active on stack overflow back around 2010, asking a lot of questions, answering questions when I knew the answers, and so on. The idea of getting a gold badge seemed wildly crazy, and someone who had one (or even two!) was clearly a sign that they knew what was what. I used it for a while, never made much of a reputation, but did manage to earn a small handful of silver badges which I was quite proud of.

Then I forgot about it for quite a while.

Fast forward to today. My reputation chart just keeps going up at a steady linear rate. At this point I am in the top 3% of users with 14,228 reputation and 25 gold badges. I haven't been active in a decade. I don't know what most of my badges even are.

---

Most of my reputation comes from my questions. In case you're wondering what a top-3%er's top questions looks like, they are:

Apr 15, 2011 (207) -- CSS: bolding some text without changing its container's size

Aug 19, 2009 (110) -- How long should SQL email fields be? [duplicate]

Jun 29, 2010 (89) -- php: check if an array has duplicates

Jul 3, 2010 (63) -- centering a div between one that's floated right and one that's floated left

Jan 5, 2010 (44) -- CodeIgniter sessions vs PHP sessions

Apr 12, 2011 (40) -- Java: what's the big-O time of declaring an array of size n?

Jan 11, 2011 (28) -- Javascript / CSS: set (firefox) zoom level of iframe?

Jul 15, 2010 (25) -- Javascript: get element's current "onclick" contents

Aug 22, 2009 (21) -- SQL: what exactly do Primary Keys and Indexes do?

Jul 3, 2010 (20) -- Getting the contents of an element WITHOUT its children [duplicate]

For anyone keeping score, that last one one was marked as a duplicate of a question that was asked a year after mine, and which seems similar on the surface to someone who does not have a good understanding of the DOM structure but is actually not the same thing.


Exactly this. I have a very, very high point score well beyond yours for being very active 13 years ago.

I have well over 50 gold badges.

I haven’t used stackoverflow in at least 5 years, probably longer, and I stopped contributing about 10 years ago.


I have a similar experience. About 10 years ago, I had some time on my hands for about 6 months, and answered a bunch of questions, with a small handful of them (3-4) getting a lot of upvotes. I haven't answered a question in years and years, but those same few questions keep getting new upvotes every month, so my progress continues to climb sort of linearly. I'm in the top 7% of contributors this year, while contributing exactly nothing new...


From a cursory glance, would you say these are still issues people run into? Aggregating these initial questions and the amount of activity they generate up until this day should tell us much about the progress and stagnation of certain programming languages/libraries/frameworks/else and their usage barriers.


In most cases, yes, but I don't think it implies stagnation. With the exception of the CSS ones which have been obsoleted by modern flexbox, those questions are mostly basic enough to defy change:

php: check if an array has duplicates

Java: what's the big-O time of declaring an array of size n?

SQL: what exactly do Primary Keys and Indexes do?


I agree, plateauing may be more apt in this case. I wonder to what extent exemplary questions like these remain universal, or have an expiry date that just isn't known at this time.


> I suppose one of the reason why people keep answering is to harvest points

It's interesting to see some of the top (5- or 6-digit SO scores) people's activity charts.

They usually have a 3-5-digit answer history, and a 1-digit question history, with the digit frequently being "0."

In my case, I have asked almost twice as many questions, as I have given answers[0].

For a long time, I had a very low SO score (I've been on the platform for many years), but some years ago, they decided to award questions the same score as answers (which pissed a lot of people off), and my score suddenly jumped up. It's still not a top score, but it's a bit less shabby.

Over the years, I did learn to ask questions well (which means they get ignored, as opposed to insulted -an improvement), but these days, I don't bother going there, anymore.

[0] https://stackoverflow.com/users/879365/chris-marshall


If you get enough points on one of the more niche and less toxic StackExchange sites, it'll also let you comment, vote, etc. network-wide.

I had gotten most of my points by asking and answering things about Blender workflow/API/development specifics, so I got to skip some of the dumb gatekeeping on StackOverflow.

Worldbuilding's fun, too­— Codegolf's not bad either, if you can come up with an interesting way to do it— Arquade looks good, and so does Cooking— Literature, English, Scifi, etc look interesting— If you program software, I suppose CodeReview might be a safe bet.


Yeah ... the extra critical nature of SO is why their lunch is being eaten by LLMs. I once had a buddy who is now super duper senior at Amazon working on the main site to ask his Q on SO and he flat out said no because he'd had hostile interactions before when asking questions. Right or wrong the reputation that they've developed has hurt them a ton.


>it seemed to be mostly a pile of “I have a bug in my code please fix it” type stuff.

it's mostly people asking you to do their comp sci homework.


The edit queue was sitting at over 40k at one point.

Unfortunately people trying to game the system creates enormous work for those who can review.

(Not saying you were doing anything wrong just pointing out why there are automated guards)


You need to focus on niche tags to find worthwhile unanswered questions. Browsing the $foolang tag is just for the OCD FOMO types who spend their day farming rep.


Back in ye olden days, almost every answer involving a database contained a SQL injection vulnerability.


To their credit, a lot of people went back a decade later and fixed those. Although it doesn't stop people from repeating the mistakes.

I just got beaten up in HN for asking how the hell sql injection is still a problem. People get defensive, apparently.


Sounds about right.

Not even a few years ago I worked with people who insisted it was ok to write injection unsafe code if you knew for sure that you owned the injected values. Didn't matter that maybe one day that function would change to accept user-supplied data, that's not their problem! It was a Rails app and they were literally arguing wanting to do:

    .where("id = #{id}")
over:

    .where("id = ?", id)
in those certain situations. So, you know, it takes all kinds, I guess.


This is a case of militancy.

If we're talking about a typed integer there is no chance of that turning into an sql injection attack.

If we're talking about a string, I'd probably insist on parameterizing it even if we completely own it just on the off chance that the future changes.

To draw an analogy, gun safety is important and everyone knows it. But I don't practice gun safety while watching television on my couch because the gun is locked away. I practice gun safety when I'm actually handling the thing that is dangerous.

And yes, I realize it being locked away is technically gun safety, it's an imperfect analogy, please roll with it.


Your analogy is not flawed, but your conclusion is.

It is a perfect analogy because you are practicing gun safety by locking the gun away. If someone that you are not expecting wanders into your home while you are sitting on the couch, such as a child, they will not suddenly have access to the firearm. This is exactly why you don't assume that you will never receive unsafe input in this situation.


and as you're sitting on that couch watching television you're also practicing car safety because you're not actively breaking any traffic laws.

IOW, you're free to make that claim and you're not wrong per se, but you're not right and it doesn't refute the point.


The equivalent analogy is that you didn't leave the car in neutral on the top of a hill.

The number one rule of firearm safety - Treat every firearm as if it were loaded.

And yet children shoot themselves or others all the time because a gun was not safely stored.

But I digress...


to be pedantic, just being "typed" is not enough these days with dynamically-typed server code.


I disagree with you, if it's typed it's safe. The issue is if it's untyped or the type isn't enforced (by the runtime, by the compiler, or by the code itself).

I understand your point, I'm just saying if it's actually typed, it's safe.


> If we're talking about a typed integer there is no chance of that turning into an sql injection attack.

Unless the database table switches to non-integer ids at some point.


Ruby is a dynamic language.


I think I agree with your coworkers. If the data is predefined constants, then you don't need to worry about injection. All functions have preconditions which must be met for them to work. As long as that's specified, that's acceptable.

Imagine the internals of a database. An outer layer verifies some data is safe, and then all other functions assume it's safe.

The example you're sharing is a bit of straw man. It's just as easy to use the parameter, so of course that's the right thing. But interpolating a table name into the string from a constant isn't wrong.


I'm not sure if this is a troll or not and I don't really want to debate this kind of thing on HN, but you've baited me. It is not a straw man. As I said, the source of the input could change in the future and it could be missed. The safe version is no more complicated than the unsafe version, so why wouldn't you just do the safe one? There is zero advantage to the unsafe way and it's straight up reckless to defend it.

I'm one of those people who moved from Ruby to Elixir. Ecto, Elixir's defacto database wrapper, will throw and exception if you try and write interpolated code like this, so luckily I don't have to have these insane arguments anymore (well, I work alone now, so there are several reasons I don't have to have them).

EDIT: My bad, I glazed past the last part of your statement.

Ya, I think this is probably where some of the defensiveness comes from: using a library vs rolling your own. If you're rolling your own, of course you're going to need to interpolate table names and whatnot, but it shouldn't even be possible to interpolate values. My example and argument is based of Rails, though, where you never specify a table name or anything like that. So in the specific case of my coworkers, they were wrong.


Yeah, bad code doesn't stop being bad code just because it is correct. Good code not only is correct, but it is obviously so. There are zero excuses in a case like this to write it in the unsafe way. Just because you know a gun is not loaded, doesn't mean you should play with it.


Yeah if a codebase is full of stuff like this, auditing it is awful. It's like, instead of employing computers to check the details your code, force it to be done manually (in an error prone way)


This is nonsensical. When you use a function, how do you know what it will do? You guess from its name?

> auditing it is awful.

If a function specifies a requirement, you look at the callers and see if that requirement is met. If it's easy to verify in code, you can assert. Is there an easier way to audit correctness?


Idk. I have some pieces of production code that need to inject `$tableIdentifier`.`$field` into a query, where both are nominally passed from the client. I don't rely strictly on a list of constants in those cases. I take the user request, check the table name against a constant list, then run a query (every time) to describe the fields in that table and type-check them against what's in the user-submitted variables. Then escape them. Anything mismatched in name or shape at any stage of that is considered malicious.


The only principle I want to defend is that a function is correct relative to its preconditions. If the caller doesn't meet them, that's on them.


That kind of reasoning only works if the language or ecosystem has some kind of compile time error or linter or comprehensive testing that will catch the error if the preconditions ever change. One way of doing is is encoding the preconditions in the type system. Another is through fuzzing

If you keep the preconditions informal and never check them, the code becomes brittle to modifications and refactoring. For a sufficiently large codebase you almost guarantee that at some point you will have a SQL injection bug.

That said, using prepared statements isn't the only way to guard against SQL injections. You can also use a query builder that will escape properly all data (provided this query builder itself is hardened against bugs). Using dynamic sql is the only way to make some kinds of queries, so a query builder is a must in those cases.

What you shouldn't do is to use string concatenation to build query strings in your business logic. It may or may not contain a bug right now, but it is brittle to changes in the codebase.


> That kind of reasoning only works if the language or ecosystem has some kind of compile time error or linter or comprehensive testing that will catch the error if the preconditions ever change.

Most requirements can't be verified at compile time, or even at runtime in a feasible amount of time.

If you expect functions to do things that they don't say they do, I don't know what to tell you. Conventions and specs are the best we have.


I think you were broadly misunderstood. If the defined constants come from or are checked against the ones stored in the database, fair play. If they're floating around in some static consts in a code file, also ok as long as that's extremely well documented and someone knows what's what. If some boss pays to cut corners for it to be written with magical constants like "WHERE life.meaning!=42" and then fires the person who they hired to write that script, they deserve whatever they get.

Just like the meaning of life, it's best not to come to premature conclusions. Could all work out, or it could be a funny joke for aliens in the end.


> I just got beaten up in HN for asking how the hell sql injection is still a problem.

It's possible for developers to think they're actually doing the right thing, but it turns out they're not.

https://www.npmjs.com/package/mysql#escaping-query-values

> This looks similar to prepared statements in MySQL, however it really just uses the same connection.escape() method internally.

And depending on how the MySQL server is configured, connection.escape() can be bypassed.


Yeah, the Nodejs ecosystem is sketchy in this regard. I've never put a Node-mysql site into production. Basically everything I write that runs DB queries is in PHP with PDO. But I got interested in Node for side projects and spotted this escaping flaw in node-mysql. That npm package also has two escaping modes, one which it calls "emulated" and which is probably less trustworthy. It doesn't seem like it was ever ready for primetime. I don't know if node-mysql2 addresses that... I ended up writing a promise wrapper for the original one that also turns everything into prepared statements. You still need to make sure NO_BACKSLASH_ESCAPES is off, although I have no idea why you'd ever turn it on.

So yeah, I'm coming from a PHP mindset where you can generally trust your engine to bind and escape values. My experience with Nodejs in this particular area caused me to write a lot of excess code (mostly to satisfy my own curiosity) and still convinced me not to trust it for the purpose.

In that light, I can understand how someone who jumped into the Nodejs ecosystem would think they were dealing with reliably safe escaping, and didn't realize what they were actually getting if they didn't read the fine print.


Hi! Sorry to report this, but I've pushed a SQL injection vuln to prod when I was still very green.

In my defense, we trusted the input. But that's post-rationalisation, because I simply didn't know what I was doing at the time.

It gets worse. If I'd done it properly, my senior would have beaten me up in code review for "complexity". That was a man who would never use a screwdriver when a hammer was already in his hand.


I once argued with a senior dev (later engineering manager, I guess he is a director of development now somewhere), that storing password hashes in unsalted SHA1 was bad.

His defense? "This system is internal only and never connected to the internet"

Senior titled devs don't necessarily know their shit.


A little off topic, but I love how you mention his career progression before sharing the example of his ignorance, because this seems to be a pretty common theme in tech companies (I've witnessed it more times than I can remember or count). The people I knew in my career who were most full of shit are pretty much all now Directors and VPs, enjoying a life of success, and the ones who were the most actually knowledgable are still grinding away as IC's, worried about layoffs. This industry is really bad about rewarding competence.


> This industry is really bad about rewarding competence.

If you promote the competent people, you leave the incompetent ones to do the actual work.


The trick then is not hiring bozos in the first place.


The team I described in GGGP were all strong in the roles they were originally hired for. The company likes to promote internally, which mostly works out for them. This shit team was an edge case.


This is a good counterpoint that explains why, maybe as roles change or companies grow, people who weren't exceptionally good at one role end up overseeing it. The pithy / laconic observation I was immediately responding to was pretty spot on though, and still seems to pertain (in general).

Breaking it down: That the most diligent / irreplaceable people who know the guts of the machine tend to be chained to their roles with occasional raises seems fairly logical from a C-Suite perspective. The tendency to promote incompetence - particularly overconfident incompetence - is the part that bears more scrutiny. If it were isolated to a few companies, it wouldn't be so relatable. I have a theory that it has to do with certain kinds of communication skills (specifically, bullshitting), being selected for in certain roles. And being able to write good code and explain why it has to be done that way requires the opposite of bullshitting.


Non security expert here. Walk me through the attack scenario here.

The database has access control right? So only a few people in the org can read the data. And you are imagining a case where they:

a) find an inverse image of a password hash and use that login as another person to do something bad.

b) reverse the password from the hash to use in another context.

If a is an issue, why does this individual have sensitive data access in the first place? b is still unlikely. Any inverse image is unlikely to be the password if there is salting.

It sounds like an improvement could be made, but maybe not the highest priority. Can you inform me?


To be fair, I’ve pushed vulnerabilities to prod when considered a senior and with 10+ years of experience. Nobody is immune to their own stupidity and hubris.


People who don't understand things often get cranky when they're told it's easy. Seems fair though, it does seem rude to tell someone missing a leg it's easy to run... But it also seems rude to get upset at someone who's good at something they've studied so perhaps everyone is bad at understanding the person they're talking to, and people should assume more good faith.


That’s why I prefer to use “straightforward” rather than “easy.”

People seem to take that much better.


I also like "simple". Lots and lots of very hard things are not at all complicated.


Hitting a homerun is straightforward, but it’s not easy.


I would argue the concept of hitting a homerun is straightforward, but the preparation, training and execution are not.

You’re arguing semantics.

The two words are synonymous in most casual conversation where you would be in danger of offending by saying something is easy or simple.


I think I was trying to agree with OP. Just giving an example that came to mind.

Conversely, setting up Jira is neither straightforward, easy or simple.


If you ever have an issue with the Requests library in Python, just try again with verify=false.


Easier than getting the app team to fix their TLS.


Or the corporate IT team to remove their TLS-trashing MITM attack (because their Firewall Vendor claims that's still "Best Practice" in 2023 and/or the C-Suite loves employee surveillance).


Just be sure to try running the program with sudo first, before trying shitty solutions like that.


That seems insecure; just chmod -R 777 /


At least node has a variable to disable checks globally.


Good thing we trained all those AIs with these answers.


What if that was the goal all along? Time traveling freedom fighters set up SO so that the well for AI would be poisoned, freeing us from our future overlords!


StackOverflow and those AIs optimise for the same thing - something that looks correct regardless of how actually correct it is.


A couple months ago, someone commented that one of my answers was wrong. Well, sure, in the years since answering, things changed. It was correct when I wrote it. Otherwise it wouldn't have taken so long for someone to point out that it's wrong. The public may have believed it to be the correct answer because it was at that time.


> The system promotes answers which the public believes to be correct

Well.. duh?

Until AI takes over the world, this will be correct for everything. News, comments, everything.


Mmm... no? StackOverflow is powered by voting. Not all forums work like that (it was a questionable choice at the time StackOverflow started).

I've been a moderator on a couple of ForumBB kind of forums and the idea of karma points was often brought up in moderator meetings. Those with more experience in this field would usually try to dissuade the less experienced mods from implementing any karma system.

Moderators used to have ways of promoting specific posts. In the context of ForumBB you had a way to mark a thread as important or to make it sticky. Also, a post by a moderator would stand out (or could be made to stand out), so that other forum users would know if someone speaks from a position of experience / authority or is this yet to be determined.

Social media went increasingly in the direction of automating moderator's work by extracting that information from the users... but this is definitely not the only (and probably not the best) way of approaching this problem. Moderators are just harder to make and are more expensive to keep.


I hold little hope that LLM's will help us to reason through "correctness." If these AI's scourge through the troves of idiocy on the internet believing what it will according to patterns and not applying critical reasoning skills, it too will pick up the band-wagon's opinions and perpetuate them. Ad Populum will continue to be a persistent fallacy if we humans don't learn appropriate reasoning skills.


They've already proven that LLMs are capable of creating an internal model of the world (or, in the case of the study that proved it, a model of the game it was being trained on). If LLMs have a world model, then they are fully capable of generating truth beyond whatever they are trained on. We may not be there yet (and who knows how long it will take), but it is in principle true that LLMs can move beyond their training data.


AI isn’t going to do better in current paradigms, it has exactly the same flaw.


Of course, consensus is a difficult philosophical topic. But not every system is based on public voting.


I sure hope people don’t copy stuff from SO before they understand what the code does.


people are writing entire programs with ChatGPT. these are the same people that previously would copy&paste multiple SO answers cobbled together. now, it's just a copy&paste the entire script from a single response.


ROFLMAO!

Please, tell me that was sarcastic.


I refuse to believe anything else ;-)


Yeah, I never look at just the top comment. If it isn’t wrong, it’s suboptimal.


> easy to read

Sounds like you're counting that as a negative. Obviously it depends on the use case, but more often than not I'll lean towards the easier to read code than the most optimal one.


Easy to read is good, but it doesn’t trump correct.


Sure, but it's also generally a lot easier to tell if a simple code is correct (the loop over powers of 10) than the more complex ones (using log and pow); especially when it comes to edge conditions.


> The correct one is usually at rank 3

This has generally been my experience.


Long time ago, when ActionScript was a thing, there was this one snippet in ActionScript documentation that illustrated how to deal with events dispatching, handling etc. In order to illustrate the concept the official documentation provided a code snippet that created a dummy object, attached handlers to it, and in those handlers defined some way of processing... I think it was XML loading and parsing, well, something very common.

The example implied that this object would be an instance of a class interested in handling events, but didn't want to blow up the size of this example with not so relevant bits of code.

There was a time when I very actively participated in various forums related to ActionScript. And, as you can imagine, loading of XML was paramount to success in that field. Invariably, I'd encounter code that copied the documentation example and had this useless dummy object with handlers defined (and subsequently struggled to extract information thus loaded).

It was simply amazing how regardless of the overall skill of the programmer or the purpose of the applet, the same exact useless object would appear in the same situation -- be it XML socket or XML loaded via HTTP, submitted and parsed by user... it was always there.

----

Today, I often encounter code like this in unit tests in various languages. Often programmers will copy some boilerplate code from example in the manual and will create hundreds or even thousands of unit tests all with some unnecessary code duplication / unnecessary objects. Not sure why in this specific area, but it looks like programmers both treat these kinds of test as some sort of magic but also unimportant, worthless code that doesn't need attention.

----

Finally, specifically on the subject of human-readable encoding of byte sizes. Do you guys like parted? Because it's so fun to work with it because of this very issue! You should try it, if you have some spare time and don't feel misanthropic enough for today.


I feel like there ought to be a software analogue to that aphorism about models (if it doesn’t exist already) — maybe something like:

All code is wrong, but some is useful.


Agreed, but is code not a model?


Why do you need a 4-line dependency?

This is the reason.


There is still the chance that the person that created the 4 line dependency also just copy pasted it from the flawed StackOverflow answer. Or is the same person or is also just a random person creating the package like the random person that created the SO answer. I'm not sure why random_person1 should be more trustworthy to produce non flawed code than random_person2.

OTO: It's at least easily upgrade able so it has an advantage.


> There is still the chance

There's no chance if you avoid random_person1 and use known_oss_provider’s package instead. At the very least, look at the tests.

Any package with tests is guaranteed to be more correct than a never-before-run SO answer.


There is still the chance. As the article states, OpenJDK copied from the Stack Overflow answer.


Sure, but if OpenJDK is exposing that function then anyone who is using it will get the correct output when OpenJDK fixes the problem. If everyone copies the function into their own code then in many cases it's likely to never be corrected.


What if you write the code and test in your project?


The most impressive suggestion Copilot has given me was a solution to this that used a loop to divide and index further into an array of units..

It never dawned on me to approach it that way and I had never seen that solution(not that I ever looked). Not sure where it got that from but was pretty cool and.... Yeah, it gets simple stuff wrong all the time haha.


I was surprised to find log implementations are loopless. Cool.

https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...


It basically has the loop unrolled. But it looks like it’s evaluating a polynomial approximation so I suppose it makes sense


When StackOverflow was new, it was an incredible resource. Unfortunately, so much cruft has accumulated that it is now nearly useless. Even if an answer was once correct (and many are not), it is likely years out of date and no longer applicable.


While reading I was thinking why aren’t stackoverflow “mandating” that solutions have tests, so that this problem isn’t left to everyone else, ref. to the comment at the end of the article:

Test all edge cases, especially for code copied from Stack Overflow.


How does the author determine this is the "most copied snippet" on SO? The Question/Answer has only been Viewed 351k times. There are posts with many millions of views e.g: https://stackoverflow.com/questions/927358/how-do-i-undo-the... which have definitely been copy-pasted more times. Yes, there may be many instances of this Java function on GitHub. But only because the people doing the copying are too lazy to think about how it works never mind alter the function name. If there's a bug, just update the SO answer and fix the problem. No need to write a lengthy self-promoting post about it.


Third paragraph of the post:

It's according to this paper: https://link.springer.com/article/10.1007/s10664-018-9650-5


> How does the author determine this is the "most copied snippet" on SO?

According to [this paper](https://link.springer.com/article/10.1007/s10664-018-9650-5) it's the most copied *from SO java answers*.


It's mentioned in the article

> A PhD student by the name Sebastian Baltes publishes a paper in the journal of Empirical Software Engineering. The title is Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects [...] As part of their analysis they extracted code snippets from the Stack Overflow data dump and matched them against code from public GitHub repos.


It's described in the article...


Read the article. The methodology is flawed. It should say most copy-pasted Java function on GitHub.


it does say that: "We present results of a large-scale empiricalstudy analyzing the usage and attribution of non-trivial Java"


You asked:

> How does the author determine this is the "most copied snippet" on SO?

People answered. Should they not have?


Robert, indeed I’m grateful for the answers people have given. Perhaps I asked the wrong question to begin with. I know better for next time to be more specific. Thanks.


Well - I suppose it makes sense. SO isn't built for correctness, it's built for upvotes that just depend on whether the people upvoting like the answer or not (regardless of correctness).


Read: The most common answer to that question from LLMs is flawed.


Sounds like someone bumped into Zeno's paradox...

https://www.youtube.com/watch?v=VI6UdOUg0kg


Should have just stuck with the loop. You could change the thresholds to 95% of 10^whatever to accommodate the desired output rounding.


Plot twist: they were hired by Oracle since they were the author of the most copied StackOverflow snippet (!)


just divide by 1000 until x < 1000 and return int(x) plus a map of number of times divided by 1,000 to MB, GB,... string.

Its a O(1) operation because of limited size allowed for numeric types


I'm curious what answer GPT will return.


Probably this one as this is the most common on the corpus it was used to train it.


GPT-3.5 returns:

    public static String convertBytes(long bytes) {
        String[] suffixes = {"B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"};
        if (bytes < 1024) return bytes + " " + suffixes[0];
        int exp = (int) (Math.log(bytes) / Math.log(1024));
        return String.format("%.2f %s", bytes / Math.pow(1024, exp), suffixes[exp]);
    }


So the code from the dude in the blog post here


Not quite. ChatGPT incorrectly mixes up SI and non-SI units.


Given how unreliable it is probably, 418 - I'm a teapot.


Classic off by 1 :)


tl;dr When in the 999+ petabyte range, it gives inappropriately rounded results.

And the key takeaway is "Stack Overflow snippets can be buggy, even if they have thousands of upvotes."

I don't disagree, but is this really the example to prove it.....


Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().

Why 300K+ people copied this in the first place shows some basic level of ignorance about what's happening under the hood.[1]

As someone who's been at this for decades now and knows my own failings better than ever, it also shows how developers can be too attracted by shiny things (ooh look, you can solve it with logs instead, how clever!) at the expense of readable, maintainable code.

[1] But hey, maybe that's why we were all on StackOverflow in the first place


> Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().

And here's something to consider -- if you're converting a number to human readable format it's more likely than not your about to do I/O with the resulting string, which is probably going to be an order of magnitude more expensive than the little function here.


Great point, I wish I'd mentioned it. The expense of the printf dwarfs the log / log (double divided by a double then cast to an int), which itself is greater than some repeated comparisons in a for loop.

It's key to be able to recognize this when thinking about performant code.

In other words, the entire exercise is silliness because the eventual printf is going to blow away any nanoseconds of savings by a smarter/shorter routine.


[flagged]


It's not that we think it's arcane or that we are in our own "bubbles of thought", it's that we aren't doing math. We're programming a computer. And a competent programmer would know, or at least suspect, that doing it with logarithms will be slower and more complicated for a computer. The author even points out that even he wouldn't use his solution.

P.S. Please look up the word literally.


I'm having a hard time imagining a situation where "printing out the number in a human readable format" is more time consuming than "figuring out what the number is".

I think a competent programmer might also ask themselves "am I prematurely optimizing?" if their first instinct is to pick the method that only works on a computer. I've operated in this space long enough that bit shifting is synonymous with doing the logarithm in my mind, but if I had to explain how my code works, I would use the logarithm explanation. I would be sure to point out that the computer does log (base 2) of a number much much MUCH faster than any other base.

Its probably excessive to say that literally every one is taught logarithms as the ideal solution to this problem, but logarithms are almost universally introduced by explaining that the log (base 10) of a number is always greater than or equal to the number of digits in that base 10 number. So if you completed a high school education in the United States, you have almost certainly heard that much at least.

edit: printing out the number is almost always gonna be faster than figuring out the value of the number, if the speed of the operation matters. My original post implied the opposite. Part of being a competent programmer is recognizing that optimizing is sometimes bikeshedding.


The author's final suggested solution at the bottom of the article still relies on logarithms.

> doing it with logarithms will be slower and more complicated for a computer

This is a fascinating point of view and while it isn't wrong in certain "low-level optimization golf" viewpoints is in part based on old wrong assumptions from early chipsets that haven't been true in decades. Most FPUs in modern computers will do basic logarithms in nearly as many cycles as any other floating point math. It is marvelous technology. That many languages wrap these CPU features in what look like library function calls like Math.log() instead of having some sort of "log operator" is as much an historic accident of mathematical notation and that logarithms were extremely slow for a human.

Logarithms used to be the domain of lookup books (you might have one or more volumes, if not a shelf-full) and was one of the keys to the existence of slide rules and why an Engineer would actually have a set of slide rules in different logarithmic bases. Mathematicians would spend lifetimes doing the complex calculations to fill a lookup book of logarithmic data.

Today's computers excel at it. Early CPU designs saved transistors and made logarithms a domain of application/language design. Some of the most famous game designs did interesting hacks of pre-computing logarithm tables for a specific set of needs and embedding them in ROM in useful memory versus CPU time trade-offs. Today's CPU designs have plenty of transistors and logarithm support in hardware is just about guaranteed. (That's just CPU designs even; GPU designs can be logarithmic monsters in how many and how fast they can do.)

Yesterday's mathematicians envy the speed at which a modern computer can calculate logarithms.

In 2023 if you are trying to optimize an algorithm away from logarithms to some other mix of arithmetic you are either writing retro games for a classic chipset like the MOS 6502, stuck by your bosses in a history-challenged backwards language such as COBOL, or massively prematurely optimizing what the CPU can already better optimize for you. I wish that was something any competent programmer would know or at least suspect. It's 2023, it's okay to learn to use logarithms like a mathematician, because you aren't going to need that "optimization" of bit shifts and addition/subtraction/multiplication/division that obscures what your actual high-level algorithmic need and complexity is.


> what are you people even programming that you need to know so absolutely little about how anything else in the entire world works

Feoren, your comment takes an incredibly superior attitude and accuses its reader, every reader, of being stupid.

When taking the log of a number, the value in general require an infinite number of digits to represent. Computing log(100) / log(10) should return 2.0 exactly, but since log(100) returns a fixed number of digits and log(10) returns a fixed number of digits, are you 100% confident that the ratio will be exactly 2.0?

Maybe you test it and it does return exactly 2.0 (to the degree floating point can be exactly any value). Are you confident that such a calculation will also work for any power of 10? Maybe they all work on this intel machine -- does it work on every Arm CPU? Every RISCV CPU? Etc. I wouldn't be, but if I wrote dumb "for" loop I'd be far more confident that I'd get the right result in every case.


> your comment takes an incredibly superior attitude and accuses its reader, every reader, of being stupid.

It's also an incredibly superior attitude to think that the discipline of software development is so uniquely special that other subjects, even basic math, have nothing to offer it, and that one could be an effective and productive software developer without having to besmirch your perfect code with concepts from other schools of thought.

And "stupid" would mean "incapable of understanding basic math". This is more like "unwilling to even try". Mere stupidity would be fine: stupid people need jobs too. But a statement that the operation everyone else in the world would use is "unmaintainable" because the programmer is unwilling to refresh themselves on how logarithms work with a quick scan of its Wikipedia article, that's not stupidity. That's bordering on malpractice.

> When taking the log of a number, the value in general require an infinite number of digits to represent.

So does taking a third of a number. So? Do you consider the code "x / 3.0" unmaintainable?

> Computing log(100) / log(10) should return 2.0 exactly, but since log(100) returns a fixed number of digits and log(10) returns a fixed number of digits, are you 100% confident that the ratio will be exactly 2.0?

Exactness was never a requirement. Do you really never use floating point? The reality is that showing "1000 kB" 1% of the time that you should have shown "1.0 MB" is actually fine -- nobody cares, everyone understands what it means -- which applies almost all floating point imprecision. It's important to know when it does matter, but it usually doesn't. It's important for a professional to know when to not care. How much of your client's money are you going to spend on worrying about tiny details that they don't care about?

> Are you confident that such a calculation will also work for any power of 10? Maybe they all work on this intel machine -- does it work on every Arm CPU? Every RISCV CPU? Etc. I wouldn't be, but if I wrote dumb "for" loop I'd be far more confident that I'd get the right result in every case.

Except a 0.00001% imprecision doesn't matter for most cases, but an off-by-one error does. For loops are much more common sources of error than logarithms are.


> You're all literally writing CRUD React front-end javascript by copy-pasting "for" loops from StackOverflow?

To an approximation, yes.

The underlying calculations at my bank were probably written once in 1970 in COBOL and haven't changed meaningfully since. But the front-end UI to access it has gone from teletypes and punch cards to glass terminals to networked DOS to Win32 to ActiveX to Web 2.0 to React and mobile apps. Lots and lots of churn and work on the CRUD part, zero churn and work on the "need to remember logarithms" part.

AI? You have core teams building ChatGPT, Midjourney, etc. Then huge numbers of people accessing those via API, building CRUD sites to aggregate midjourney results and prompts, etc etc. Even Apple has made a drag-and-drop UI to train an AI object classifier, the ratio of people who had to know the math to make that vs the people using it is probably way above 1:100,000

Is this that surprising?


Well, maybe not exactly unmaintainable but I think most of us have learned that floating point operations are not to be trusted, especially if it needs to run on different processors. Furthermore, calling such math operations is an overkill most of the time. I would definitely never consider it for such a simple operation. I actually agree with you that it might look cleaner and easier to understand, but in my mind it would be such a heavy weight overkill I would never use it.


Obligatory, my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454


And yet it’s wrong like all the rest


How so?


The answer is amusing, but it seems the author either didn't read the question properly, or didn't read their formal languages textbook properly, and rushed ahead with an answer that isn't really correct

For one thing, It assumes "regex" as used in programming are the same as "regular expressions" (defining regular languages) in formal use. More info on that [1]

But the question isn't even about a full parsing of HTML, with bracket balancing. It's just about syntactically matching all the opening tags. More "lexing" than "parsing". Instinctively that does look like a simple regular language to me, though I'm not claiming certainly. The super-regularity of HTML comes from nested elements, but it's just the tag syntax this user cares about, with no context-sensitivity

One red herring is comments and CDATA sections, but since they cannot be nested, they do not change the language class, as you just transition to a skip state and back when you see the start/end markers. But they do make the expression much more ugly of course

[1] https://en.wikipedia.org/wiki/Regular_expression#Patterns_fo...


Pretty awesome stuff. This is what Hacker News is for!


wtf, why would someone downvote this? This is prime hacker News shit and why I come here!


I didn't downvote, but I would guess it's due to the general idea that if you just approve or disapprove of a post you should simply vote that way instead of expressing it in a comment. Personally while I agree there's a logic to that, I find it a little cold for positive sentiments. I couldn't find it, but I think that's there's a PG or Dang comment to the effect that "I like this" as a comment is explicitly not discouraged on HN, but obviously that doesn't mean everyone agrees.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: