Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Ion (amzn.github.io)
307 points by BerislavLopac on July 22, 2020 | hide | past | favorite | 110 comments



After working (internally) with Ion and related tooling, I'd say that I was opposite of fan. Protobuf strength is in good tooling/codegen around it (especially inside of Google), and with Ion you just have "superset of json".

Ion never had nice code wrappers around serialized structures, and most of the time, especially with rich structures it was frustrating experience.


In 2013 I was generating code from SDL definitions which could also be used for data validation during serialization/deserialization. I had plenty of “model” packages which were just SDL definitions, build config, and maybe some unit tests to validate the schema constraints. (Edit: SDL is “schema definition language” which was a schema definition tool written for Ion with definitions in Ion.)

These were used in services and reactors which never touched raw Ion (at least not in any way different from Coral or BSF).

Full disclosure, I spent a lot of my free time working on Ion, both the supported implementations as well as my own. The additional data types are worth it alone, imho. Having to use JSON for most things now I’m frustrated at what is “missing”.


I built the first version of IntelliJ plugin[1] to make working with the reactor stuff easier. Doesn't look like there have been much improvements to it.

1: https://github.com/amzn/ion-intellij-plugin


But do you see any reason why such tooling could not be built?

I suppose it's mostly an under-investment of time, not a shortcoming of the format itself.


Might it be though? Protobuf's tooling seems like a byproduct of the fact that you can't read protobuf and it's strict and type safe enough that you can generate lots of things.

Ion is readable and (seemingly) not very strict about schema. Seems like that would not readily incentivise additional tooling.


Protobuf actually has a canonical text format. It's easy to produce in C++ or Java https://developers.google.com/protocol-buffers/docs/referenc...

The format has much less extra syntactical noise than JSON.

For example,

    name: "vii"  # comments allowed!
    id: 23923373
Pretty nifty as it allows readable configuration files with structured data.


if it is an "easy to produce or consume in language X" it does not mean it is canonical - it means that language X has an extension that allows to do so. Is there a place in protobuf spec or documentation mentioning this to be a part of the protocol?


Language X is at least C++ (as linked), Java https://developers.google.com/protocol-buffers/docs/referenc... , and Python https://github.com/protocolbuffers/protobuf/blob/master/pyth...

I think the Protobuf spec focuses on the binary serialization - the text format and JSON representations are not related to that at all, of course.


Protobuf can be converted to (and from) JSON for readability


If you're working against a schema that means presumably there is a schema, and that defeats essentially the whole purpose of using a self-describing format like Ion. At that point, use something like protobuf that is schema-ful.


I think that talking about Ion, without talking about PartiQL, is not setting people up with proper context.

PartiQL is AWS's specification for a parser/query language that is compatible with standard SQL, but can query semi-structured or unstructured data (think JSON, Parquet, CSV/TSV etc)

https://aws.amazon.com/blogs/opensource/announcing-partiql-o...

PartiQL uses Ion as it's backbone and data format:

https://partiql.org/faqs.html#why-do-you-choose-ion-to-exten...

https://github.com/partiql/partiql-lang-kotlin/blob/master/e...

I looked pretty deeply into this, but failed a bit short of understanding what they meant when "if your query engine supports PartiQL." Does that mean writing a new DB that delegates incoming queries to PartiQL? Not sure.

Anyways, they use it in Quantum Ledger DB, and a few other internal projects:

https://docs.aws.amazon.com/qldb/latest/developerguide/ql-re...

So maybe that can give some more context around "what the hell is this, why does it exist, how would you use it?"


ion precedes PartiQL by many years, maybe even a decade. The sole reason ion exists is to make parsing json faster and less ambiguous so that a few specific edge cases are handled efficiently. So far so good, right?

The problem is that it spreads, like an infection, to surrounding services. Inside Amazon there are literally hundreds of libraries that duplicate standard json libraries in various languages but support ion instead of json. All of this is just to deal with interoperability.

Ion is slower than protobufs and less universally understood than json. Honestly it's just an annoyance.


Yeah It’s more like a spec, they expect you to write the query optimizer, analyzer basically a new query engine. They offered a document based “reference” implementation on Kotlin though so I think expert could follow it.

Point is AWS might not make up its mind yet on whether this does more harm to their DB business or not


Ahh. I guess for SQL you could just pass the queries through, and for NoSQL engines try to convert the AST format.

When I was researching this area, it seems like Apache Calcite is the way to go, and already does this though?

It lets you use standard SQL and has adapters which translate from the abstract query AST to the particular implementation.

https://calcite.apache.org/docs/adapter.html

You can query anything with this, exactly the same way. It's kind of wild I've never heard it talked about tbh.

I went looking for solutions to multi-datastore querying when I fiddled with a business intelligence side project. Pretty useless to only be able to query one type of data, and too time consuming to implement individual mappings.

Apache Calcite and Metabase Query Language (quite the exact usecase there for Metabase, haha) were the only things I could find.


Given Postgres already supports document types, do you know what is the main advantage of using AWS' instead?


Does Postgres support column-oriented formats, like Parquet?


> Does Postgres support column-oriented formats, like Parquet?

With an appropriate FDW, sure, and I'm pretty sure I've seen an FDW for parquet specifically, as well as other columnar formats.


To give Amazon more money.

Oh you mean advantage to you? haha...well..


That sounds great, but is Kotlin really the only current implementation?


If you want a working implementation (see comment below) and not a reference for the parser, you could use Apache Calcite:

https://calcite.apache.org/

But again, yeah it's a JVM thing so your options are that.


It amazes how everyone likes to hate on JVM and yet nearly every enterprise company uses it in the big data space....


Everyone likes to hate on COBOL, and plenty of enterprises use it in all kinds of spaces, too.


The communication gap between Businesses using COBOL and Academic institutions teaching IT is the reason for the hatred for the language among young computer graduates. Whether you love or not you live with and support your family, so teach the youngsters to love COBOL, for their grandmother is not going to die any sooner!


Previously (2016, 165 comments): https://news.ycombinator.com/item?id=11546098


So it didn't really catch on?


How does this compare to protobuf, thrift, msgpack etc?

It’s roughly the same vintage as protobuf and thrift, from google and Facebook respectively, so perhaps it’s just Amazon’s equivalent, which they just never released as quick as the others did?

Obvious pros and cons, or yet another serialization format with no obvious benefits over anything else?


Just from reading their page and being familiar with the formats you mentioned:

vs. protobuf: ion is self describing, vs needing a schema

vs. thrift: similar, thrift needs a schema to interpret a binary file

both thrift and protobuf are really binary formats, though they have a canonical textual representation, it's not actually used to serialize. Sounds like ion supports serializing as text as a first class concept.

vs. msgpack: ion has a corresponding text format, whereas msgpack is only binary. Additionally, ion has a symbol type, msgpack doesn't.

I think the biggest benefit here is that it's a new chance for a format that fixes some of json's rough edges to gain critical mass. There's probably nothing ultra special about it that hasn't been solved in other formats, but maybe the timing will be right and everyone will just adopt it as a json replacement (sort of how people just gave up on xml and switch to json seemingly overnight). It's impossible to predict stuff like that.

Edit: upon noticing that it was released in 2016, it seems less likely everyone will jump on the ion bandwagon ...


If I'm not mistaken, there were plenty of text protobuf files internally used for a lot of things, and much much less anything less (okay, xml was prevalent for our team, maybe due to being java-inclined). Even seen examples of text protos pushed through the command line (it's possible, but need to get it right)


There are some painpoints that are being addressed:

1) timestamp : I have had issues with a round-tripping timestamp representation quite a bit 2) decimal : currency is denoted in decimal rather than float and shows the Amazon retail heritage. This is very useful. 3) symbols : I've had cases where symbol table/dictionary would have made big difference in serialized size


Re time stamp and decimal, probably no surprise that it is used heavily by QLDB, where having a very clear time for a change is important and a common use case is logging debits and credits as a financial ledger.


See Zish for an improved JSON:

https://github.com/tlocke/zish

It has timestamps and decimals. Full disclosure, I am the author.


I think using decimals (or arbitrary size integers) for currency is common knowledge by now.


I don't know. It was common knowledge for me in college (as in it was taught as part of the curriculum) but as far as I can tell in the intervening 30+ years that knowledge seems to have been lost and relearned many times over.


Terrifyingly, I discovered recently that Plaid’s API uses floats instead of decimal. For example, security prices:

https://plaid.com/docs/#security-schema


cash values should be represented in fixed precision to maintain the integrity of the transaction and your book, while the prices for securities represent something different.

In securities transactions, the quantity and quote are critical. You aren’t buying securities from Plaid, right?

If you try to liquidate or resize based on the Plaid quote, your brokerage or counterparty is going to provide a totally different quote, and one from a system engineered to provide quotes aligned exactly to the market standards.

I don’t see the risk/terror.


Protobuf/Thrift are schema-based serialization systems.

Ion is directly comparable to JSON/MessagePack/BSON/CBOR.

I would expect Ion will have slightly different time/space tradeoffs than the other binary schemaless formats.


It seems much more directly comparable with CBOR/JSON as they mention it a lot https://amzn.github.io/ion-docs/guides/why.html#dual-format-... . I use CBOR quite a bit. It sounds like it doesn't really offer too much different in the binary form other than in the textual form it maintains better types than JSON and the textual version matches the binary version (where JSON / CBOR are mismatched in terms of types). So, seems nicer as a cohesive textual/binary format. I'd be interested in seeing how well packed the data is in Ion vs CBOR.


I wish these comparisons would occur to the documenters without the need for a prompt. Also Apache Avro .. which is common with Kafka?


It is probably closer to something like avro, but with more features and less java focus.


As others have already pointed out, this was released in 2016 and already discussed on HN [0], and seemingly hasn't taken the world by storm since. But just glancing at the amzn Github activity, and it looks like the docs and the tooling [1] are recently and frequently updated (including a new CLI in Rust [2])?

Can anyone currently at Amazon shed some light on how prevalent Ion is internally?

[0] https://news.ycombinator.com/item?id=23922278

[1] https://github.com/amzn/ion-docs/commits/ https://github.com/amzn?q=ion&type=&language=

[2] https://github.com/amzn/ion-cli


I left Amazon a bit over a year ago, after being there seven years. It always struck me as a combination of "not invented here" syndrome and a solution in search of a problem. It has no real world benefits over JSON, the tooling is limited, but you inevitably have to deal with some other team that regrets choosing it and now it's their API. I'm so happy I never have to look at it ever again, and seeing this post today is a real throwback to wasted engineering effort. Just let out go, Amazon.


It has a decimal type. That alone is reason enough for amazon to use it over json.


You can easily encode a decimal type as binary data. Not a huge deal.


Depends on the part of Amazon but it is pretty prevalent in Retail. The fact that it is both binary and self describing makes it pretty good for data at rest. You can still parse and understand that archival transaction data from 8 years ago.

The support for S-Expressions is both a blessing and a curse. The ability to write logic with native data structures in it is fundamentally interesting, but it leads to lots of reinvention of somewhat crappy Lisp implementations.

The tooling ecosystem has been slowly improving outside of JVM, particularly the latest JS implementation.

In a vacuum, the support for type annotations, timestamps, decimals and binary serialization make it superior to JSON for use cases where self describing data is appropriate.


Looks nice. I saw that there is no PHP implementation yet. Doing it and publish it on Github would give me something, besides a "kudos" from Amazon? I am not asking for a position at Amazon, but maybe an interview?


The easiest way to get an interview at Amazon is to get a referral. If you can demonstrate competent programming abilities and have a half decent attitude, it shouldn't be too hard to get a referral from someone at Amazon, regardless of what projects you have under your belt.


I’ve published two libraries for Amazon products. They didn’t care.


I was thinking about spinning up a support library for Haskell.. but it’d be a pretty serious investment of time when everyone’s employment is cut back or up in the air already. It would be nice to get a crate of sanitiser or something.


Who's going to maintain it? Are you just doing that for an itw or are you offering real support to the library? That's the reason why people are paid to work on software vs someone on its free time.


Why would you want that? There are so many better companies to work for that don’t abuse their most vulnerable workers.


Have you tried applying?


Haven’t you heard it sucks working there? If you are replaceable by a grad student, they will let you go before your options are vested.


PHP is a banned language internally Amazon and Amazon subsidiaries, so they will not care.

... why am I getting downvoted for offering direct experience as an AMZN engineer? Amazon InfoSec forbids PHP. See also: https://news.ycombinator.com/item?id=23030330


You're getting downvoted because one true statement - "PHP is banned for internal use" - does not imply the other - "they will not care".

Google also bans PHP but has official PHP client libraries for all its APIs.

Both companies care about having and maintaining PHP SDKs so long as their paying customers want to consume their products/APIs using PHP.


untrue for client libraries, they would care.

https://aws.amazon.com/sdk-for-php/


SDK != Corp Policy

You can't use PHP internally at Amazon. Downvotes and ignoring facts do not suddenly make my factual comment "untrue".

See also: https://news.ycombinator.com/item?id=23030330


Now I'm confused, are you saying Amazon uses Hack internally which compiles to PHP? The Hack website doesn't have much info and I'm not familiar with it. There's clearly an Amazon Github repo for an an AWS SDK written in PHP, but you're adamant that Amazon does not use PHP at all. So which is it?


They vend an AWS SDK in PHP for their customers, but they never run any internal software on PHP.

The AWS SDK in PHP helps generate web requests to Amazon's services, most written in Java.


They ran the internal wiki on PHP while I was there. Don’t know if it’s been upgraded since.


Ion != Internal-only Amazon software.

Many AWS services use it as an interface language. Many AWS customers use PHP.


Out of curiosity is Mason still a thing?


Hahaha ... yes. :(


That's not really how that works. Just don't make it the only language you know.


It is absolutely how it works when there is a internal policy against using PHP.

That's literally fact. That's literally how it works.

Additional resource: See also: https://news.ycombinator.com/item?id=23030330


Sure, PHP is against internal policy.

But if you take the initiative to open source a client library in PHP and it gets the attention of AWS it absolutely could result in an interview.

If you are interviewing and you whiteboard your solution in PHP they won't hold it against you. The language is less important than the concepts. Granted, if the only language you know is PHP that could be a risk in your career. I think that holds true for any developer, though.

Source: Used to work and interview at AWS


Yeah, it's literally not how it works, writing code in a banned language doesn't prevent you from getting interviews or being taken seriously.


As it should be be.


Interesting they don't have a kotlin or swift version. Do their iOS clients just communicate with plain json? Are they all secretly written in javascript?


iOS shopping app is a web app and it uses plain JSON.


From the document:

> The following timestamp encoded as a JSON string requires 26 bytes

> ...

> This timestamp requires just 11 bytes when encoded in Ion binary

So, we just use JSON, and our solution to this problem has been to pass 64 bit unix timestamps around. It doesn't provide arbitrary precision, but for most use cases it is more than enough practical range & precision to get the job done. And of course we store & transmit everything as UTC, so there is no weirdness around needing to store additional timezone information. To give you an idea, our database columns are named things like CreatedUnixTimestamp.

It is also trivial to compare 64-bit timestamps without conversion, so any SQL storage of these as integers should yield massive speedups to queries against these types - Assuming you are coming from some more complex datatype like a string or byte array.


> So, we just use JSON, and our solution to this problem has been to pass 64 bit unix timestamps around.

Passing an integer does not have the same semantics as passing a timestamp. Relying on out-of-band info to parse a document is a problem in the making.

> but for most use cases it is more than enough practical range & precision to get the job done.

Parsing s-expressions would also get the job done, even if it's a primitive s-expression that only supports cons cells and a string data type. However, people find value in enabling the parser to validate booleans, arrays, and objects.

ION is just a logical next step. Timestamps are quite naturally a fundamental data type in comm between web services, particularly in binary form.


Pass 64-bit Unix timestamps around as JSON numbers? That's a bad idea, seeing as they're 64-bit floats. You're better off formatting your 64-bit integers as strings.


53 bits of usable range is plenty for our purposes. Our serializer & database are not hobbled by the limitations of javascript, so the representation is only compromised as it is processed at the end client. This is not a concern for us.

For reference, MAX_SAFE_INTEGER can represent something around the year 285428751.


I'm rather attached to things like time zones.


You can have whatever timezone you want if you store/transmit things as UTC. The final client device javascript should be the point at which the conversion to local time occurs, because the browser is best aware of the correct timezone.

Everything on the server is just done in terms of UTC. I actually cannot think of a reason I would want to process a timestamp in terms of local time on the server.


Well, one example is a notification that fires on the client’s timezone rather than UTC.


I would say https://cognitect.com/blog/2014/7/22/transit is a better option, no?


It’s not listed but there is a Go library

https://github.com/amzn/ion-go


What has bugged me a lot with JavaScript that it lacked standard presentation of dates and decimals (like money), making it feel inferior for application development. Happy to finally seeing this addressed on both JavaScript tself and then also in serialisation formats.

(Though looks like Ion is not solely targeting JS, but I make an assumption it is nice to consume Ion data in frontend)


Nope, backend if anything. For example, their new QLDB product uses it to get consistent hashing of documents on account of Ion being a canonical format.


I’ll start use it when AWS adopt it :) This is only used in retail orgs... the ecosystem is the biggest issue.

Edit: in Public API


For the public API, customers want JSON, so they get JSON. Internally there's Coral, and something like Coral/Protobuf outright superior for the use case of an API where a schema can be distributed in advance. The only real use case for Ion is when you have data that's already JSON-formatted for whatever reason and you want to compress it for storage or transit.


Yep and Coral is also open sourced as AWS Smithy, it makes no sense to assume AWS usage means anything or vice versa.


Coral being open sourced is huge! Why wasn’t this on HN?


Last time I checked, QLDB is in AWS



I was hoping to see a UUID type, since so many people choose either unreadable base64 or wasteful strings. It looks like 0x12341234_1234_1234_1234_123412341234 should convey the bits, but it won't pprint or validate the way a dedicated type would. Ditto for IPv6 addresses.


An interesting point - I browse with JavaScript disabled. The example at the bottom of the page rendered for me without newlines, in a manner that meant the thing rendered in a completely unparsable way due to comments like:

  // Field names
This experience has reminded me why JSON is such a great format.

And having a whinge while I'm writing, "superset of JSON" is basically false advertising even though it is true; JSONs refusal to admit that line breaks are a thing is a major feature. I don't care it if it is technically correct and useful to some customers, if line breaks matter it is inappropriate to talk about a format's relation to JSON because people will get the wrong idea. The JSON brand is so strong because it is nigh-impossible to get wrong. This format gets screwed up - eg, for people who don't like JS.


I think "superset" is a clear relationship. It means "legal JSON is legal Ion", just like "legal JSON is legal YAML". I don't think it's inappropriate to point that out. In fact, it's an excellent feature.


The base64 encoded text in the example is: 'To infinity... and beyond!'


I really like the type::value pattern. It provides some attractive options for embedded languages. If python allowed

    fun x:
      query = sql :: 
         select * from table
    
I'd be pretty happy.


It kind of does:

    class SQL(str):
        ...

    query = SQL("""
        select * from table
    """)


> This binary format supports rapid skip-scanning of data to materialize only key values within Ion streams.

(1) awesome but (2) 'key values' is a confusing way to say this


Lots of empty promises:

* int: arbitrary size integers

* decimal: arbitrary precision, base-10 encoded real numbers

* timestamp: arbitrary precision date / timestamps, with ISO 8601 format "2019-05-01T18:12:53.472-0800".

So exact same drawbacks as JSON basically:

* Large integers will be casted to 32 or 64 bits in most languages no matter what.

* Arbitrary decimal will be casted to float or double as well.

* ISO timestamps are not well specified when it comes to millisecond, microsecond and timezone.


Any of the elements of an ISO 8601 time element can have decimal fractions added to any number of digits. But only the lowest element (according to Wikipedia because I don't have the actual standard in front of me). But you can definitely have a timestamp of 2020-07-23T12:37:55.758145Z

The standard acommodates timezones as offsets from UTC, because it's a representation of a timestamp, not a local time at a particular geographical location. So things like daylight savings time periods are not relevant.


This is awfully negative. JSON explicitly does not declare the represented range of floats or integers, and doesn’t have a distinct arbitrary-precision decimal type. I haven’t read the Ion spec, only the description, but since it’s advertising arbitrary precision, presumably any implementation that does not support that is not a correct implementation at all.


In practice that means having (or adding) support for arbitrary numbers and decimals in the languages/platforms they want to cover. I am skeptic they would do that in C for example.


The C implementation bundles the ICU decNumber library for decimal numbers.


Fun fact: Ion is used heavily in KFX, the book format for Kindles.


Why do we need more random serialization formats?


After switching as much as possible over to JSON or LZ4 compressed JSON, life is good. Never going back to another serialization format.


The obligatory xkcd: https://xkcd.com/927/


If I recall correctly, Ion preceed even Google's protobuf, and is 20+ year old technology. This isn't result of "yet another standard" but parallel evolution


This is a funny joke.

Also, this is the way any progress is made. Between 15 competing standards, some win over.

Were it not so, we'd still use whatever Cobol used for data serialization.


Why the name?


Ion is actually two formats, with Ion data having a canonical representation both in binary and in human-readable text. The text format's file extension is ".ion" and the binary format's file extension is ".10n", and I think that's the entire motivation.


embrace extend extinguish

also

https://xkcd.com/927/


Oh no, the xckd.com/927 begins.

We had JSON5, now we have Ion. Google and Microsoft will probably run their own, too, soon.

Why the IT community always forks their standards and never merges baffles me since >20 years.


Google has already protobuf for a long time.


Ion predates JSON FYI. I heard Amazon invented it internally many years ago




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: