So, my spouse was a CPU designer at AMD for many years and now does secure computing work for, well, the US government. I showed her your comment. She laughed. A lot.
Well, that's a bit of a sarcasm. Yes you have to have a quite serious lab for that, a level above what most fabless semi companies have, and skills on par with a process developer.
Yet, "firmware recovery" people in China use that regularly to make a living. Hardened/encrypted MCU firmware extraction costs under $20k here.
There are plenty of retrocomputing folks who would be heavily interested in ROM/firmware recovery from "hardened" chips, for entirely legal archival and/or interoperability purposes. $20k would be peanuts for this use case if success could be reasonably assured even in the "hardest" cases.
Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support.
Absolutely not. HDF5 is an awful format with terrible implementations. For example, try writing a python program with multiple threads where each thread writes to a different HDF5 file. This should just work -- there's no concurrent access. And yet it doesn't because HDF5 implementations are piles of ancient C code that use lots of global state. There's no technical reason for this; one could easily store all the state needed in a per-file object. But back in the day, software eng standards were lower (especially for scientists) and HDF5 changes at a glacial place.
I've been bitten by this particular bug, but you really have to wonder: given how poorly it speaks to the software engineering behind HDF5 implementations, what else is broken in the code or specifications?
If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.
When last I looked, there were no open source HDF5 implementations that were smart enough to do range requests to cloud hosted hdf files. Has this changed?
Ah, thanks for these! But I see nothing has changed.
* pyfive is interesting but immature and doesn't seem to have any cloud bucket support
* h5s3 is an abandoned experiment that hasn't been touched in two years
* h5py is fine but again, no cloud support
* kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)
Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.
It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.
To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.
This is good info. I've been wary of hdf5 for some time. Nothing concrete (until this bug) but from my research it just consistently smelled fishy. The main turnoff for me was the possibility of data corruption bricking the entire dataset.
Pity, as it has on paper a lot of great concepts and features. Maybe it'll be mature enough someday, though my money is on something better from the ground up coming along.
Honestly, most of the portability advantage is moot nowadays. Chunk s3-like storage, smb, and ability to copy files from ext to ntfs (at least on nix) means that sharing your data across platforms isn't the struggle it used to be. Windows is rapidly becoming/already is a second class citizen in science-data heavy workflows.
I ended up going with a NAS and just file system primitives for my computer vision image workflow, works great.
The main turnoff for me was the possibility of data corruption bricking the entire dataset.
A glib high level overview of my last job for 6 years was "write out HDF5 files". In that time, I don't recall seeing a true data corruption problem with HDF5.
Now, I ran into many other problems with HDF5, typically surrounding the newer features that came along in 1.10, and its threading limitations. The older folks at that job would mention historical issues with data corruption (often from reading files as they're being written to), but I never saw it myself.
It's... complicated. Certainly you can write parallel code in Python using the GIL; there are several scenarios. The shortest answer is "the multiprocessing library, when used carefully, can speed up the runtime of your CPU-intensive, multiple process python program by spreading work across multiple processors/cores". The longer answer is: many IO-bound Python programs can be sped up using multithreading within a single Python process (because the application is mostly waiting for IO), and many CPU-intensive Python programs can be sped up using multithreading where work is done in C functions that release the GIL.
Many python programs I write end up using 8+ cores on a single machine using either multiprocessing or C functions with released GIL.
No, you can certainly write in parallel, despite the GIL. The GIL makes this inefficient if your work is CPU-bound, but for IO-bound workloads it can be fine.
But the HDF5 library does not really support multi-threading at all. Compiling the library with the "threading" option just locks around every API call, so you're back to a single thread whenever you enter (compiling without it will just crash your program).
And the library does quite a lot of work when you call into it; chunk lookup, decompression, and type conversions all happen behind that lock. You can use the "direct chunk access" functions (H5Dread_chunk?) to bypass a lot of that work and do it yourself, so you get back to using multiple threads again, and that can be a big win, but having to do it sucks, and I don't think h5py exposes this functionality at all.
What's the plan for when grandma develops dementia? The paranoia will make her unwilling to part with her beloved firearm...do we just wait until she murders a family member or home health aid?
That seems like a much more important question than bickering over precisely which firearm grandma should have as she slides into dementia.
I agree we need more plans for dementia, Alzheimer's and similar mental and physical defunks.. with guns and other rights as well.
There will likely come a point when people need to step in and take control of elderly people's finances, weapons, other possession, choices for nutrition and healthcare, all that.
There are people at many ages that are stripped of their rights for different reasons. I hope the world exposes more of these concerns. I recently helped with a site for exposing some people who used the law for taking over people's stuff - family pushed through a busy judge, took over her house and bank accounts, made her broke after a couple years then disappeared. She was left to the state to care for with no money, no house, everything - and she was quite capable of planning her future and retirement, but it was all taken.
I think there are many people who have had rights taken from them in unfair ways. In same cases states are moving to restore voting and gun rights who have served their time and such.
At the same time we need more tests for proving that people are mentally and physically fit enough to wield weapons, cars, and other things safely.
In some of those cases it may make sense to have special teams show elderly people how to use different weapons and different cars should the need arise. In other cases it could be deemed that they should not drive or try to defend themselves.
I'd like to think our society will take care of these older people so they never need to worry about transportation or fighting. We need to make a lot of changes for the future if this is to be however. Right now our society is set to leave a lot of old people to languish without good care and safety - and most just look the other way as they suffer.
However, as the person who actually runs the relevant promo committees for the org, both would be fine, i'd care a lot more about why the choice was made.
> Googlers have the experience but not the distance to meaningfully comment on google.
Yes, having experience is still better than outright making things up because you have neither the experience, nor data or any kind of basis for a claim.
Option 1 will take a long time. Writing a libc takes a long time, and transitioning the whole of Google codebase to a new libc takes even longer to iron out all of the incompatibilities.
If one goes with option 2, he/she may have many more achievements to write on his/her promotion packet during the time he/she would have spent implementing a new libc. So I hardly think promotion is the incentive here.
The idea that children should spend their days staring at screens and never interacting with other human beings seems really sick. But oh so on brand for the Valley.
It's not just the Valley, it's also the model of traditional schooling. The majority of classroom time in many places is spent sitting quietly and listening to a teacher lecture.
Of course we all understand why school is run this way, but the alternatives would cost a lot more money. Isn't that the story of life everywhere...
It fits into the narrative that the Zuckerburg's want for their target market -- a peron that you can track everything they read and learn and consistently target with ads all day long.
I worked at FB briefly so maybe I can explain. FB has a corporate culture that really discourages critique. When things are broken, especially internal things, people look at you funny if you speak up about it. A big part of that is that quarterly bonuses are given for "making an impact" and your group's status (and part of your bonus) is based on delivering a consistent set of "impacts" over time. So it is better for your comp to do things badly really fast since you get (1) did something super fast! and (2) get to record a big impact a few months later when you fix the obvious brokenness.
Pretty quickly, people learn to keep their mouth shut.
Also, many, many FB engineers are early-career folk who are fresh out of school. More senior folk are few and far between and are even more strongly incentivized to keep their mouth shut, because their bonuses are bigger.
Yesterday, this same rationale came up with why Google keeps launching new products and then abandoning them over and over and over.
I guess this is what happens when a startup gets big. They keep all the toxic baggage of startup culture (edit: "move fast and break things") while gaining the impact on people's lives that big companies have.
I think Apple is the only one of the FAANG that's jettisoned startup culture, and I think that's why they're doing so incredibly well.
The way that white supremacy works is that black people are penalized even in the small enclaves where they constitute a local majority. Note that the laws that punish crack (used by lower income black people) 100 times more harshly than powder cocaine [0] are federal in nature.
I think that's dependent on language culture. Because the C++ community doesn't take language simplicity or comprehensibility seriously, there are lots of C++ developers who can't use or reason about surprisingly large parts of the language. So the community has rallied around the notion that "library" developers need to understand everything and that most developers will just glue together bits that the library devs made.
I mean, how many C++ developers actually write serious template code? How many of them could reliably explain what the keywords in post do?
The idea that every developer is a library author (or the lisp extension that every developer is a language author) is common in many other language communities but it relies on the community working hard to make mastery of the language feasible for lots of people. The C++ community never bought into that notion; they inherited a very stratified class structure from Bell Labs.
To clarify: of course they like simplicity when it costs nothing. But they consistently value other goods over simplicity.
For example: maintaining backwards compatibility. The community believes that it is more important that 20 year old C++ code run unmodified than that the language should be simplified. There's lots of stuff you could do to simplify the language but options dry up in a world where 20 year old code must be able to run unmodified.
So sure, the committee talks a lot about simplicity, but it isn't willing to sacrifice much.
Don't get me wrong: I'm glad that finally, in 2020, C++ will be almost but not quite as good as Common Lisp was at metaprogramming back in 1982. But it remains the case that eval-when and defmacro are both more powerful and dramatically simpler than anything the C++ committee has ever considered.
Two other goals for C++ are 'zero-cost abstractions' and 'leaving no room for a lower level language'. It does better on both of these goals than Common Lisp and they are important reasons for its popularity (along with backwards compatibility and easy interop with C APIs).
Zero-cost abstractions only exist in a world where you don't highly value language simplicity and comprehensibility.
Simplicity and comprehensibility were things the committee had to give up in order to pretend they had "zero-cost" abstractions. Nothing in life comes free: everything, including all abstractions, comes at some cost.
> Nothing in life comes free: everything, including all abstractions, comes at some cost.
Yes. As a slogan, it is imprecise. But it's always been talking about a very specific kind of cost: runtime costs. You're 100% right about there always being some kind of cost, but the slogan doesn't disagree with you.
(Some prefer "zero-overhead principle" instead to make this a bit more clear.)
(2) Even reducing it to runtime costs, it seems a bit nonsensical. Are C++ exceptions a zero cost abstraction? All the googlers I argued with about them would insist that they have unacceptably high runtime costs.
OK, but templates are surely zero (runtime) cost abstractions, right? Unless you start to worry about duplicate code blowing out your instruction cache but if that's a problem, no profiler in the world will ever be able to tell you, so I guess you'll never know just how costly the abstraction is, so you might as well continue believing it is zero...?
Zero cost abstraction is not the same thing as a free lunch. It's a goal that using the abstraction will have no runtime cost relative to implementing the same or equivalent functionality manually. It goes hand in hand in C++ with the "don't pay for what you don't use" principle (again talking about runtime performance cost).
When it comes to exceptions, it's generally true on x64 that you don't pay for what you don't use (there's no performance penalty to exception handling if you don't throw) although that hasn't always been true for all platforms and implementations. It's also generally true that you couldn't implement that kind of non local flow control more efficiently yourself, although the value of that guarantee is a little questionable with exception handling.
I'd argue that no language has a really good story for error handling. It's kind of tragic that we have yet to find a good way to deal with errors as an industry IMO. The most promising possible direction I've seen is in some of the possible future extensions to C++ - it's widely recognized as an area for improvement.
Template code bloat is another case of not imposing more cost than if you implemented it yourself and you have pretty good mechanisms in C++ for managing the tradeoffs.
> When it comes to exceptions, it's generally true on x64 that you don't pay for what you don't use (there's no performance penalty to exception handling if you don't throw) although that hasn't always been true for all platforms and implementations. It's also generally true that you couldn't implement that kind of non local flow control more efficiently yourself, although the value of that guarantee is a little questionable with exception handling.
I'm inclined to agree with you but just about everyone at Google says the opposite and most C++ shops I've seen agree with them. I've made this argument and lost repeatedly. So, it seems like the community can't even agree on which abstractions are zero cost (or maybe whether some zero cost abstractions are actually zero cost?). To the extent that the community itself has no consensus about these things, maybe they're not a marketing slogan that's helpful to use.
Lots of things come with tradeoffs but relative to a certain set of goals and priorities there are design decisions that are net better than others. Not everything is a zero sum game.
In this context 'zero-cost abstractions' refers to zero runtime performance cost and C++ comes closer to achieving that than most other languages. It doesn't mean zero compile time cost or zero implementation complexity cost but both of those things can end up better or worse due to design decisions and quality of implementation. Given that zero cost refers to runtime performance however, the committee is not 'pretending' they have zero cost abstractions.
It is true that simplicity and comprehensibility are not C++'s highest goals / values but they are not ignored or seen as having no value. Indeed they are major topics of discussion when new features are being considered. Sometimes they are in tension with or even in direct conflict with other goals but not always.
> For example: maintaining backwards compatibility. The community believes that it is more important that 20 year old C++ code run unmodified than that the language should be simplified. There's lots of stuff you could do to simplify the language but options dry up in a world where 20 year old code must be able to run unmodified.
This pains me but every time I think "just toss XXX out, gddammit!" I think of IPv6. C++11 is still the most popular dialect of C++, even for new development I believe, and c++14 is the hot new thing to many people.
> Don't get me wrong: I'm glad that finally, in 2020, C++ will be almost but not quite as good as Common Lisp was at metaprogramming back in 1982. But it remains the case that eval-when and defmacro are both more powerful and dramatically simpler than anything the C++ committee has ever considered.
C++ is held back by having statements. If the basic structure were an expression a lot of programming, much less metaprogramming, would be simpler.
That's (maybe) great for folks in China, but what about folks outside? Is the rest of the world really better off if Google can easily be bullied and manipulated by the Chinese government?
Once they start taking orders from Beijing about what results to show in China, why shouldn't they take orders about what to show in the US? After all, the Chinese government controls their access to an enormous market worth lots of money. Will we be able to find articles about the concentration camps for Muslims in China after they're operating in China?
> Once they start taking orders from Beijing about what results to show in China, why shouldn't they take orders about what to show in the US?
At the end of the day, Google is an American company, and the US won't let that happen. Right now, Chinese companies are given a free monopoly in their home country, which gives them billions of dollars that they can use to purchase foreign companies. China would a much easier time pressuring Tencent and Baidu than Google or Microsoft. This is all implying that China cares about what foreigners think of them, which they don't.
> Is the rest of the world really better off if Google can easily be bullied and manipulated by the Chinese government?
I think you overestimate the influence they would have, have you noticed Microsoft or Apple getting bullied by the Chinese government? Both companies bends over backwards to please China, if they don't get affected much why would Google?
Japan is intensely racist. My spouse lived there for a year. When she went about her business outside, small children who saw her would literally burst into tears screaming to their parents about gaijin.
That never happened to me, and I have been living there for 8 years now.
But I would agree that Japan is generally racist, as in most Japanese believe there are fundamental differences between people due to what they call race. I am not sure whether most Japanese would consider Korean a different race, but I would not be surprised either.
This is all completely wrong.