Typical implementations of payment processing and related security issues

jorangreef · on July 18, 2022

I had the experience of being part of a team in 2020 doing performance and safety analysis on Mojaloop [1], the open-source payments switch. The emphasis was mostly on identifying performance bottlenecks. For example, graphing waterfalls of database queries, estimating expected vs actual concurrency, digging into latency spikes, surprising timeout interactions, and missed group commit opportunities.

However, on the safety front, one of the most challenging issues was guaranteeing the rollback of funds in the event of failure as part of the two-phase commit protocol for moving money—coordinating this across multiple SQL queries, database transactions, Kafka queues, even multiple code repositories, especially as different systems experience clock drift or as disks or machines fail.

You want to ensure that the money either moves, or doesn't move, that it doesn't get lost somewhere in between. Yet most payment systems re-implement all this business logic on an ad hoc basis, again and again.

We therefore extracted these primitives from Mojaloop “once and for all” to create TigerBeetle [2], an open-source financial accounting database, that provides multi-AZ replication, automated leader election, and two-phase payments out-of-the-box.

Moving this business logic to the financial database layer opened up three orders of magnitude more performance, with TigerBeetle able to process a million transactions a second, all with an extremely high safety standard. For example, we do FoundationDB-style deterministic simulation testing, but with automated storage fault injection, such as 30% corruption on all replicas including the leader. Design docs, and links to talks are all in the GitHub repo [2].

Happy to answer questions from our experience, or chat if you're working on similar systems!

[1] https://mojaloop.io

[2] https://github.com/coilhq/tigerbeetle (Zig)

jjoonathan · on July 18, 2022

> You want to ensure that the money either moves, or doesn't move, that it doesn't get lost somewhere in between

Speaking of which, I finally hit this nightmare scenario -- in one hand I have an international wire proof of payment from a customer, in the other hand I have my bank insisting that they never received a transfer. It's weeks later, I never shipped, and my customer says they never got the money back -- but they haven't been applying pressure to me, so if they are a scammer they are very bad at it. They also know an atypical amount of RF engineering for a scammer. At the moment it really does look to me like the banking system ate his money.

I sense a bit of Fear of God motivating your story, something that motivated your management to care -- who is that authority, so that I might get them on my case?

rowls66 · on July 18, 2022

From my experience in cross border payments, sanctions screening (Office of Foreign Asset Control - OFAC) is the most likely source of the problem. Most likely, your customer's bank uses a US bank as a correspondent for all of their USD payments. The correspondent bank will scan every payment instruction that they receive against a list issued by OFAC. If there is a match, the correspondent bank will not execute the payment, and will seize the money from your customer's bank's account with the correspondent. In order to get the money un-seized, your customer will need to provide his bank with a lot of documentation proving that he is not the bad guy on the OFAC list. This can take a while. Your customer is probably aware of all of this and that is why he is not hassling you.

So yes, the banking system did eat his money, but that is what it is designed to do.

wiredfool · on July 18, 2022

I had an OFAC compliance department hold up a wire from US->IRE because the bank it was going to (Permanent TSB) shared the suffix TSB with some Russian bank. There are 6 chartered banks in Ireland, but I had to dig out all sorts of info about the bank to prove that the Irish bank wasn't in Russia.

I think it was 3 weeks or so to sort it out, when usually this is a nextish business day thing.

jjoonathan · on July 18, 2022

That sounds plausible. RF equipment moving around the world at a time of international tensions could very well draw this kind of attention. Historically I have seen it at customs, but in a world of increasingly non-tangible goods it probably makes sense to start going after payments instead.

It would have been nice if I had heard this from the bank, but I understand that nobody involved considers it a high priority to keep me informed. Thanks for the theory!

koreth1 · on July 18, 2022

This was my guess as well. And based on my experience working on international payments, an additional headache is that in some cases, depending on exactly which jurisdictions are involved, some of the parties can be forbidden by law from giving you detailed information about what's happening.

This is another "doing what it's designed to do" situation (the idea is to avoid revealing certain information to bad actors) but it can be pretty frustrating to deal with.

jorangreef · on July 18, 2022

> I sense a bit of Fear of God motivating your story, something that motivated your management to care -- who is that authority, so that I might get them on my case?

Haha! :) The authority is none other than Remzi Arpaci-Dusseau at UW-Madison, along with NASA's Power of 10: Rules for Developing Safety-Critical Code [1]. Storage faults and safety bugs, or the general lack of assertions in so many software projects, do indeed terrify us as a team!

It's also all credit to Coil's leadership, for caring and being willing to invest long term in better open source infrastructure for payments for everyone—and particularly Interledger, as an open interoperability protocol, to create an ”open network of networks” for payments, to fix the problem of payments for everyone [2].

I'm glad also to see that sibling comments have given tips on how to unblock that nightmare scenario you encountered!

[1] https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...

[2] http://interledger.org/

enlightens · on July 18, 2022

Are you in the US?

https://www.helpwithmybank.gov/index.html

That's a website from the Office of the Comptroller of the Currency, the org that regulates US banks. Dig through the topics to find one most applicable, make sure you did everything they recommend, and then hit the contact button. Or just go to the contact page to contact the OCC directly, most of their recommended steps are “call the bank and ask what’s up”.

Also good is the CFPB, and while I’m not sure if a business situation will be something they can address, it wouldn't hurt to contact them as well

https://www.consumerfinance.gov/

jjoonathan · on July 18, 2022

It's a UK -> US transfer. I'm on the US side and I'm not entirely sure which step broke down, but I have had a difficult time getting the US banks to talk to me so this information is helpful.

Thanks!

bsaul · on July 18, 2022

Since we're on HN and we love talking about PL : how come you chose such a recent language as Zig for something as critical as a paiement system ?

I'm sure Zig is, in theory, safer than let's say, C. But doesn't the fact that it's so recent make it more fragile regarding implementation bugs ?

OmarIsmail · on July 18, 2022

They address it here: https://github.com/coilhq/tigerbeetle/blob/main/docs/DESIGN....

jorangreef · on July 18, 2022

Thanks for a great question!

When we made the decision:

1. We were thinking long term. For example, a distributed database is a significant investment. I love the spirit and orthogonality of C, but we didn't want to pay the safety tax of C over the next 20 years. At the same time, Zig has that same spirit—it's not only a perfect replacement, but a leap forward in developer velocity, power, performance and ergonomics.

2. We realized that TigerBeetle would take 2 years to get to a production release. This meant that our timeline would intersect with Zig's stability—like skating to where the puck is going to be at, or catching the swell as it breaks, rather than riding out the last of an old wave. The number of surfers (quantity) was not a concern. Rather, we were impressed by the sheer quality and early maturity of Zig's compilation story, not to mention the quality of the Zig community in general. For example, it would be hard to find a better std lib crypto than what Zig already has right now, thanks to Frank Denis of libsodium.

3. The design of TigerBeetle is also a single-threaded control plane. We use io_uring for the data plane to eliminate multi-threaded context switches, so multi-threading for async I/O is less of a necessary evil than it used to be. All memory is statically allocated at startup. We never call free() so there are no UAFs. You can start to see why Rust's borrow checker made less sense for our domain than it would have for others. Of course, concurrency bugs can happen on a single thread, but we make use of other techniques to mitigate them.

4. We wanted the open source to be accessible to newcomers wanting to read or contribute to the project. We didn't want to pay the cost of a steep learning curve over the lifetime of the project. Zig turned out to be a hit here, as we've received feedback from engineers who've worked on Spanner or FoundationDB—remarking on Zig's readability, and this has proved to be a force multiplier, as they've come back again and again to the source, and even sent invaluable bug reports. For example, a bug that turned out to be in Apple's O_DSYNC, not even in TigerBeetle.

5. We do exhaustive fuzz testing and deterministic simulation testing from the inside out. See TigerBeetle's VOPR [1] for more details about our internal audit function.

6. We do exhaustive fuzz testing and deterministic simulation testing from the outside in. We're working with Will Wilson's new Antithesis startup https://antithesis.com to use their deterministic Linux hypervisor as our external audit function, to test our compiled TigerBeetle binaries in a deterministic cluster environment. This is different to Jepsen and more advanced in so many ways. For example, there's coverage guided fuzzing, and if we find any bugs, we can replay—again and again.

7. Finally, and most importantly, Andrew Kelley's design decisions, and Zig's approach to safety really resonated with us: no macros, no hidden control flow, no unused variables (this has caught several bugs for us already), checked arithmetic enabled by default in safe builds, spatial memory safety, out-of-memory safety, and explicit memory allocators—crucial since TigerBeetle does not do any dynamic memory allocation after startup. Zig is a superbly well-designed language. For example, Zig's comptime is immensely powerful.

If you're curious, we also speak to this as a team in the Q&A at the end of the Zig SHOWTIME talk [2] we did last year, and I go into this in detail in a talk I did at the Recurse center last month [3], also touching on why we picked VSR over RAFT or Paxos for the consensus protocol (that's a whole 'nother can of Paxos!). :)

P.S. We run a $20k bug bounty for our consensus code. If anyone can find a Zig bug that violates TigerBeetle's consensus or replication, then we have bounties up to $8192.

[1] https://github.com/coilhq/tigerbeetle#simulation-tests

[2] https://www.youtube.com/watch?v=BH2jvJ74npM

[3] https://www.youtube.com/watch?v=rNmZZLant9o

jasfi · on July 18, 2022

> Moving this business logic to the financial database layer opened up three orders of magnitude more performance

This is because there is an IO cost in sending a request to the DB and then receiving a response. If the DB is on another machine the cost is even higher. Do this enough times and the wait times add up a lot. By shifting business logic to stored procedures you avoid this.

That's also why SQLite is very fast, as it runs in your application's memory as a library. But then your data is tied to the same limitations as the machine the application is on.

jorangreef · on July 18, 2022

> By shifting business logic to stored procedures you avoid this.

Thanks, we considered stored procedures to bring the number of database queries down from 18 queries per payment to 1 query per payment. However, that would have provided only an order of magnitude improvement, and brought with it complexity of testing, compared to the state machine [1] we have in TigerBeetle.

At the same time, the biggest performance bottleneck is not only the number of roundtrips, but the lack of first-class batching in the interface per roundtrip. What we do in TigerBeetle instead, is we send 8192 transfers in a single network request. This brings the network/disk cost equation down from 1 query per payment, to 1/8192 query per payment. It's like group commit, on steroids.

[1] https://github.com/coilhq/tigerbeetle/blob/main/src/state_ma...

> That's also why SQLite is very fast, as it runs in your application's memory as a library. But then your data is tied to the same limitations as the machine the application is on.

SQLite is one of my favorite storage engines. However, SQLite does not solve our storage fault model. For example, misdirected reads/writes, lost reads/writes, bitrot in the middle of the committed log. SQLite was also not explicitly designed to be integrated with a global consensus protocol as per ”Protocol-Aware Recovery for Consensus-Based Storage” from UW-Madison. For example, there are optimizations around storage fault tolerance in the commit log that you can do, or around deterministic storage across replicas for faster distributed recovery, that you can't do with SQLite. Check out the paper [2] from UW-Madison for the details, which apply also to LevelDB and RocksDB. We wanted our engine also to be able to run in our deterministic simulator. For example, no random thread scheduling etc.

[2] https://www.usenix.org/conference/fast18/presentation/alagap...

jasfi · on July 18, 2022

It sounds like the architecture is different to what I understood. I'll read up more on your system, it sounds very interesting!

jorangreef · on July 18, 2022

Thanks! Great to see that you're working on projects in Nim!

jasfi · on July 19, 2022

Always good to hear someone else interested in Nim!

rurban · on July 19, 2022

I hope you'll realize that SQLite is a completely insecure hack, with insecure defaults and architecture. Why not take a slower but proper DB?

eg https://research.checkpoint.com/2019/select-code_execution-f...

jorangreef · on July 19, 2022

Hey Reini! I've always loved your smhasher benchmarks. Appreciate also that you have Mitzenmacher's tabulation hashing in there, which is my goto.

We don't use SQLite—I don't know how you got that impression? You can find out more about TB's actual storage engine here [1].

However, I have only tremendous respect for SQLite. And to be fair, it's meant to be run embedded, the application is responsible for security, and many applications couldn't go far wrong picking SQLite. It's a fantastic piece of engineering.

[1] https://www.youtube.com/watch?v=yBBpUMR8dHw

zackmorris · on July 18, 2022

It sounds like payments might be part of the larger concept of declarative programming (DP):

https://stackoverflow.com/questions/129628/what-is-declarati...

Maybe TigerBeetle could be generalized to support any multi-step distributed process?

I've used DP for backend server work on AWS with Terraform and it gave me perhaps 100x leverage over what I would have been able to do manually. And I think I first heard about it from a friend who used it in Ansible in the 2010s.

Now I see everything through that lens and find most of the online tutorials at sites like https://www.raywenderlich.com to be somewhat tedious and perhaps too application-driven (as opposed to theory). Not to single them out - they're one of the best, along with https://laracasts.com for backend server concepts. But to really get to scalable solutions, DP is a must IMHO, because it raises attention from implementation details to repeatable processes, which frees the developer to work at a much higher level of abstraction.

jorangreef · on July 19, 2022

> It sounds like payments might be part of the larger concept of declarative programming (DP)

Yes, exactly! The idea with TigerBeetle's state machine [1] is to expose double-entry accounting as a higher level financial primitive, so that developers can think in terms of declaring transfers from one account to another. The business logic behind the scenes is detailed. The interfaces and data structures are simple.

[1] https://github.com/coilhq/tigerbeetle/blob/main/src/state_ma...

> Maybe TigerBeetle could be generalized to support any multi-step distributed process?

That's part of the plan—that the distributed database framework of TigerBeetle can be used as an ”Iron Man suit” to support any kind of state machine, with high availability and fault tolerance.

zackmorris · on July 20, 2022

Oh nice, where has this been my whole life? Keep fighting the good fight!

jorangreef · on July 21, 2022

”I fight for the users!”

kyrra · on July 18, 2022

Googler opinions are my own.

As someone who has been integrating with payment processors for the past 7 years at Google, this is a great article. Many of the things in here are things we test when we integrate with processors.

This post spends a lot of time on redirect style payments, where the user lands on some other website to enter credentials or confirm payment. This form of payment is fraught with corner cases, because the merchant and the processor are using the user's browser as the transport mechanism (ie: these are can MITM part of this flow). If you are not a super extra careful, there are so many ways to attack it. If the payloads aren't signed or encrypted, you have to worry about users messing with the content. If they are signed/encrypted, you still have to worry about replay attacks.

(I've spent a lot of time thinking about redirects the last 3 years, especially with writing of our redirect spec: https://developers.google.com/pay/redirect-fop-v1?hl=en )

Nextgrid · on July 18, 2022

Ideally you shouldn't be using the browser as a transport mechanism. The browser should only be a means of data entry - redirect the user to the payment processor so they can enter their details, but then the redirect back should just go to a "payment status" page while your system on the backend waits for a notification (webhook, etc) from the processor in the background which would then update the record on your side (at which point the read-only status page will pick it up on its next poll and then redirect the user to the next step of whatever process they were doing).

This also allows you to recover the flow should the user drop off after they've paid - you can email them a link to that "payment status" page so they can "rescue" the flow without having to start from scratch.

afiori · on July 19, 2022

We should essentially use the Oauth PKCE flow everywhere it is applicable

pbreit · on July 18, 2022

What if you rely on lookups or webhooks to verify payment details?

hardwaresofton · on July 18, 2022

Does anyone else find it weird that this site absolutely fails to mention Stripe? The word is not in the article, and the logo is not included.

Not alleging anything here but a bit curious -- I can't imagine having a discussion of payment providers without mentioning Stripe.

I wouldn't be surprised, but are they just about as close to perfection as it gets on the fraud/security front?

Tomdarkness · on July 18, 2022

Yeah, on the list on the article the only "modern major payment service" I actually recognised was PayPal and Alipay. I noticed that Rubles are mentioned multiple times in the article, perhaps the author is Russian. Is Stripe even available in Russia?

However saying that I thought PayPal had suspended operations in Russia.

Tijdreiziger · on July 18, 2022

> I noticed that Rubles are mentioned multiple times in the article, perhaps the author is Russian.

It looks like the article is available in both Russian and English (language selector in the top right).

nibbleshifter · on July 18, 2022

Stripe doesn't operate in Russia.

mirkodrummer · on July 18, 2022

Noticed the same thing and went back to comments just to check if I was alone or not :) Pretty odd for sure…

nibbleshifter · on July 18, 2022

Article mostly covers payment methods used in CIS (Russia, etc).

Stripe isn't available in those countries.

bigtones · on July 18, 2022

Yeah I had just one thought after skimming it, use Stripe.

rafaelturk · on July 18, 2022

I'm impressed how Brazil Pix solved much of those problems. IMO is currently the best payment method in the World. Mainly because it broke apart from Visa/Mastercard rules and implemented a new protocol from scratch. http://openpix.com.br/

kyrra · on July 18, 2022

The website you link is not the Pix specification put out by the Brazilian government. Rather that is a payment integrator building services on top of the government spec.

Their specs are published here: https://openbankingbrasil.atlassian.net/wiki/spaces/OB/overv..., with the main landing page for the program is here: https://www.bcb.gov.br/en/financialstability/pix_en

dustypotato · on July 18, 2022

Also see UPI in India. Very convenient and even street side vendors to Tech giants like Google accept payments through it

Tijdreiziger · on July 18, 2022

Most EU countries have something like this (iDEAL, Bancontact, Swish). Works great but breaks down when you need to do a cross-border payment (e.g. you are German and want to order from a Dutch shop).

birracerveza · on July 19, 2022

Situation: there are 14 competing standards

https://xkcd.com/927/

mritchie712 · on July 18, 2022

The spaghetti of systems involved in this post alone is why I think crypto will ultimately work. This[0] is a long thread, but it makes a strong case for payments in crypto.

Not sure how you beat a cheap, open source, global payments system as Visa, a bank or other competitor.

0 - https://twitter.com/SBF_FTX/status/1548292844640616451

XorNot · on July 18, 2022

That thread makes crypto look better by just plain ignoring any problems.

It ignores the existence of exchanges for crypto, or that exchange rate fluctuates widely, assumes transactions fees in crypto are zero, assumes exchanges don't charge fees. It ignores that in the event of any problem in the crypto process there is no refund or unwind possible. It assumes crypto payments are instant, which they are not, and definitely not at scale.

It also manages to confuse "credit card" with "debit card" because you can't get "surprise batch transactions" with a credit card, that's the whole point of them.

And finally its incredibly US centric, because if it weren't it might have to acknowledge that most of those problems are due to the US banking system, and other civilized countries solve them.

mbesto · on July 18, 2022

It also ignore the absolute elephant in the room - what does it mean to settle a payment in near real time and protect for fraud? (like how CC's work)

Payments its just "Stevey's Google Platforms Rant" but slanted towards payments:

Like anything else big and important in life, Accessibility has an evil twin who, jilted by the unbalanced affection displayed by their parents in their youth, has grown into an equally powerful Arch-Nemesis (yes, there's more than one nemesis to accessibility) named Security. And boy howdy are the two ever at odds.

https://gist.github.com/chitchcock/1281611

zinekeller · on July 18, 2022

> And finally its incredibly US centric, because if it weren't it might have to acknowledge that most of those problems are due to the US banking system, and other civilized countries solve them.

US banking is probably the most frustrating experience, at least in developed world (to the point that some developing countries have better systems). Probably explains why PayPal and Venmo (which are private enterprises) are popular for consumers while businesses are still relying on checks.

mritchie712 · on July 18, 2022

> assumes transactions fees in crypto are zero

The fee is in there, it's just very small. That's the point.

> The fee I paid was $0.0002: around 2% of a penny.

XorNot · on July 18, 2022

The idea the fee remains small in the case of Bitcoin-likes is questionable. If "mining" rewards no longer exist, then who's fronting the money to pay to run the network?

rowls66 · on July 18, 2022

Disagree. The issues described are not with the underlying payment system, but with the integration between the merchant and their payment processor. These integrations are complex, and with thousands of merchants integrating mistakes are likely to happen.

Crypto is not going to change anything. Merchants will continue to use payment processors, and merchants will continue to make mistakes in those integrations.

ilaksh · on July 18, 2022

It's not just that it's cheap, open source and global.

Fundamentally cryptocurrency is a superior approach to digital money.

The first most basic problem that it solves that many payment systems have is the need to _give away your credentials_ in order to transact. This is solved by employing cryptography.

This is why, despite all of the unfortunate scams, when people are fundamentally sceptical about the underlying paradigm of cryptocurrency, it is evidence of deep technical ignorance.

It's similar to when cars started replacing horses in the street. People were so used to seeing little piles of horse sht everywhere that they did not realize what a big deal it would be to finally get rid of it.

birracerveza · on July 18, 2022

Pretty much. There won't be an anarchist wild west scenario as the maxis hope, but crypto will inevitably take over legacy systems due to how unmaintainable these have become.

The anti-crypto crowd tends to forget that even though the vast majority of blockchain transactions can be attributed to scams, rugpulls or whatever else, the fact remains that an insane amount of value, bogus or not, has been transacted with little to no issues and downtime on the most popular blockchains (except Solana, that is).

Once regulation is in place you'll have a technology that is vastly superior to the duct-taped monster we have today.

mellavora · on July 18, 2022

Situation: there are 14 competing standards

https://xkcd.com/927/

ravirajx7 · on July 18, 2022

I think this is absolutely correct time to ask something to the people who have worked on this kinda integration. Currently we have a PSP which gives a webhook on successful payment. The challenge here is once I receive webhook (which is a stateless POST request to one of our server) we want to attach it to a browser with an open websocket connection. I have to make a redirect now what do you think is an ideal way to make it? Currently I am posting it directly to the notify page which is waiting for success/failure response. But the problem here is I don't have any mechanism through which I could make a post request directly with the browser context as well as redirect without explicit post(which ofcourse displays payload - think success/fail response). Can you provide a better way for handling the redirect here with form post?

To make things more clear we have our client which wants to display a QR hosted on a different service where it comes via http redirect on requesting one of the server API - then this service gets the webhook & it has to redirect back to the client again with the payment status/webhook response.

btilly · on July 18, 2022

Set up a Redis pub-sub: https://redis.io/docs/manual/pubsub/

The code for the websocket subscribes to Redis. The webhook will publish to Redis.

The sequence of actions is now that you send the client to the PSP, which will hit the webhook, that publishes to Redis, that the server side of the websocket is receiving, that realizes it should send the message to the client, that receives the message from the websocket, that then tells the client to redirect.

Mavvie · on July 18, 2022

I was about to talk about issues with scaling Redis pubsub but apparently they just recently published a newer version (7) with proper sharding support in cluster mode!

btilly · on July 18, 2022

You do have to be aware of Redis' limits. But for most sites, "people paying at the moment" is well within them. And Redis is the easiest solution to implement from scratch.

That said, if you're deployed in a cloud, look up cloud specific messaging services. Specifically AWS SNS, Google PubSub and Azure Web PubSub. But the pattern remains the same. The webhook publishes to the PubSub which finds the other end of the websocket which sends information to the client that does something with it.

Nextgrid · on July 18, 2022

I think my solution in another comment here would work for you: https://news.ycombinator.com/item?id=32138881

cersa8 · on July 18, 2022

I have a much different experience. I use Mollie, probably very similar to Stripe or any other modern payment service provider. I have the occasional failed SEPA transaction due to an invalid reference id but otherwise this is very smooth sailing. There is almost zero friction in the payment process. I might be a small fish with say 40 transactions a day but I guess my situation is not much different from bigger players. Maybe this is because most transactions are in the Euro zone and things go south fast when dealing with international payments.

waiwai933 · on July 19, 2022

What's the right way to make sure you've implemented a payment service integration correctly?

Do payment processors provide services (or checklists to go down) to make sure you haven't made any of the common mistakes? Or is the answer a professional pentest? In an ideal world I feel like you'd want two layers of defense in case one team miss something...

anonymousisme · on July 18, 2022

Should also include the political/policy problems related to processors preventing transactions for "unpopular" product/service providers.