It's kind of crazy when you think about it, but in contrast to every other part of the web, payments on the internet today look largely like they did fifteen years ago. There are a few walled gardens, such as Amazon or the App Store, which have shown what's possible when there's a good payment ecosystem in place. But no one has yet brought anything similar to the internet at large.
That's where Stripe comes in. By building better payments infrastructure, we want to enable more businesses and transactions. Our aim is to expand the internet economy — simply replacing the legacy payment providers would probably be a great business success, but it's not all that interesting as a goal.
All of our engineering challenges derive from this. We're roughly segmented into six engineering teams, built around the core challenges we face.
Product: On the product front, our primary challenge is redesigning online payments (and the associated tooling) from the ground up. Every other team at Stripe is, in a way, supporting the products that we present to the world.
Many of these challenges aren't unique to payments. Our API is a major part of our product, and most web APIs can be pretty confusing and hard to use. In an effort to do better, we've had to create a number of new standards around how to build an API (such as better ways to do webhooks, versioning, logging, and documentation) along the way. There are some more details about this in [1].
More generally, a lot of the product challenges come down to nuanced domain modeling problems. Payments are complex, and choosing abstractions that balance power and flexibility with simplicity and clarity is hard. In other engineering groups, you tend to be choosing and taking advantage of more existing software; in the product group, you need to build a deeper stack of abstractions and tooling.
More than other groups, the engineering decisions made in the product group need to balance non-engineering factors. Implementing the products might itself be tough, but even harder is choosing what to implement in the first place — the problems and prioritization are open-ended. You have to balance user experience, aesthetics, legal and financial considerations, and a general sense for what's most important. The properties you want in your datastore might be clear, but the bounds of a new product are likely to be much murkier.
Ops (financial operations [2]): As any software engineer can attest, writing code that mostly does the right thing is hard. Writing bug-free software is next to impossible. But when you're writing code that moves millions of dollars a day, as our ops team does, you somehow need to write code in a way that anticipates its own bugs and fails safely.
This is a very different constraint from traditional web development, where you can just ignore individual errors and hope the user will have better luck on the next try. On the other hand, it's not quite like writing code for the space shuttle, where a mistake could mean loss of life. We need to figure out how to move quickly while still retaining important safety properties, and while we can tolerate some bugs, we need to make sure each of those issues are discovered and handled before they can affect users.
A lot of our time in the ops group is spent building robust frameworks. When you design the right abstraction, only one person has to think about the Hard Problems, and everyone else can use it without having to think too hard. For example, Siddarth Chandrasekaran and I designed a framework that allows an implementor to model complex system actions as a series of individually simple state transitions. This allows us to handle scheduling, failure isolation, and bug mitigation (since any bug's impact is scoped to a single state transition) in one place.
Sys (systems): One of the consequences of processing payments is that the load on our systems will always be much lower than other companies of equivalent scale — said another way, the dollar value per bit flowing through our systems is incredibly high. As a result, our primary problems are availability and consistency, and we get to push off the scaling challenges most other companies face for a lot longer. This has a very positive effect, allowing us to spend far more of our time writing business logic rather than making low-level optimizations.
The counterpoint is we care about availability in a way that other companies don't. As you can see from [3], we generally hover between four and five nines of uptime. We've had to build our own highly-available load balancing layer on EC2 since EC2's own load balancer doesn't have the availability properties we want. We've also had to build our own event-processing system, affectionately dubbed Monster [4], in order to get a hard guarantee that we never lose events and that failovers always happen without human intervention. We never accept downtime for maintenance, which has meant we need to build our own zero-downtime migration infrastructure.
That being said, we've been growing like crazy, and we're now in the territory where performance starts to impact availability. Evan Broder recently ported Monster to Storm, where we're currently pushing about 50 million events per day (about 10x the maximum my original implementation could handle). Jim Danz has been hunting down and removing bottlenecks. Nelson Elhage designed and implemented our sharding framework, allowing us to scale our databases horizontally.
We also scope security under sys — as you can imagine, security is pretty core to everything we do. A huge amount of our security infrastructure, such as Apiori, our credit card vault, has been spearheaded by Andy Brody.
Risk: In typical security work, you'll spend most of your time defending against a theoretical adversary — in reality, your attack surface is so broad that even at scale any given system won't see that much in the way of sophisticated attacks. In contrast, we see targeted attacks by fraudsters against Stripe and our users every single day. Many of these attackers are quite clever and strongly motivated (successfully pulling off a scheme directly translates to money in their bank account). Consequently, we're continually building and adapting our systems to keep fraudsters away without degrading the experience for good users.
A few people, including Michael Manapat, have built out our machine learning infrastructure. Steve Woodrow vastly improved our instant onboarding systems. However, in corner cases, human interaction will always be needed. We have a team of risk analysts, and Anurag Goel has spent a lot of time creating interfaces to allow them to easily monitor accounts and transaction patterns.
Tools: In order to build everything we need to build, we need to be able to move fast (but not break things). Great tooling is the only way to accomplish this. We work hard to maximize developer productivity and minimize the time between code being written and pushed to production.
Our developer workflow (a lot of people have worked on this, but Andreas Fuchs is the current maintainer) is as follows: each engineer is given an EC2 dev machine, which has its own refreshable development database. Everyone writes code locally using their favorite editor, and that code is transparently synced to their dev machine, where the service they're working on is automatically reloaded. After a push, our test suite completes in a few minutes. After that, a developer can deploy code to production instantly using Carl Jackson's deploy server, Henson.
Data: We have some of the most interesting online commerce data one could ask for. In many ways, we can see the evolving shape of the internet in our systems — the fastest-growing and most innovative companies are using Stripe, and we can directly see how much our quest to make the web a better place is succeeding.
Ingesting and digesting our data is a pretty difficult challenge, and as a result, a lot of our work on data thus far has been building out our data infrastructure. Colin Marc built Zerowing, a system for tailing our production data into HDFS, where it can be queried via Impala or processed with Avi Bryant's Scalding. Steven H. Noble maintains Tiller, a tool he co-authored at Shopify which makes it easy to build dynamic dashboards. We're just now starting to think about building out an analyst team, which will help us better understand our mountains of data.
The short answer is that we have far more challenges than our team of 25 engineers could hope to solve on their own. Thus we're rapidly expanding the team worldwide — see [5] for details.