In this video, we are going to design a
Saved transcript
Design Stripe: System Design Interview (Stripe & Amazon Offers)
Channel: TechPrep
payment gateway service. So, think of
something like Stripe. A payment gateway
acts as the intermediary between a
merchants application and the broader
financial network. So, think issuing
banks, acquiring banks, and card
networks like Visa and Mastercard. And
unlike a digital wallet, which primarily
manages internal ledger transfers, a
payment gateway's primary job is to
securely capture payment information,
assess risk, and route the transaction
to the appropriate financial
institution. And finally, to handle the
subsequent capture, settlement, and
potential refunds. So, for those of you
struggling with system design, I highly
recommend you check out tech prep.app
app where you can view the full write up
and interactive AI whiteboard that walks
you through the design exactly how an
interviewer would and it helped me go
from rejections to offers from the likes
of Stripe and Amazon. So without further
ado, let's jump into the requirements.
So for the functional requirements,
firstly we'll need to process payments.
So clients should be able to charge a
customer's credit card or payment
method. We will need to be able to
process funds. So clients should be able
to issue full or partial refunds. We
will need web ho and notifications. So
the system must asynchronously notify
merchants of payment state changes. So
authorized, captured or failed. And we
also will need to be able to view
payment statuses. So merchants should be
able to query the status of a specific
transaction. Then for non-functional
requirements, strict consistency and
accuracy. So we can't have any money
lost or double charges. So consistency
is prioritized over availability in our
cap theorem context. We will need high
availability and reliability. So
targeting 99.999%
uptime and payment downtime directly
equals loss revenue. So that is a big
no.
Also for security and compliance we must
adhere to PCI DSS standards. PCI and DSS
is a security standard that
organizations must comply when handling
credit card data to ensure secure
processing, storage, and transmission of
card holder information. We'll also need
to have item potency. So network
failures will happen and so therefore
retries must not result in duplicate
charges. And then finally for
performance we will need relatively low
latency for the initial synchronous
authorization step. Try and get to less
than 2 to 3 seconds to ensure a good
user checkout experience. So now that
we've got our requirements set, let's
look at a data model that will satisfy
these requirements. For a payment
gateway, a relational database, so
something like Postgress or MySQL are
the standard choice because asset
compliance is non-negotiable.
So firstly we'll have our main ledger
database and here we will have a
merchant table and so this will store
the profile and authentication details
for the businesses utilizing the payment
gateway to process transactions. Next we
will have the payment order. So this
serves as the central state machine and
financial ledger for tracking the life
cycle and monetary value of an
individual checkout attempt. And as you
can see here, it has its item potency
key and then the status. So we know what
status the payment is in. We'll also
have a payment event. And this will act
as an immutable appendon audit log that
permanently records every raw
interaction and response payload from
external acquiring banks. And so this
can be super useful especially for
things like auditing.
We'll also have our outbox event table.
And so this functions as a reliable
temporary staging area for the
transactional outbox pattern to
guarantee that asynchronous events like
merchant web hooks are safely published
to CFKA without data loss. And so the
way the transactional outbox pattern
works, it's a technique that ensures
reliable message publishing by storing
outbound messages in the same database
transaction as business data and then
having a separate process reliably
deliver those messages to external
systems. And don't worry about it for
now. We'll dive into that later. And
then finally, we'll have a separate
database. This will be our vault
database. This could have a table card
vault. And so this will be a physically
isolated database to securely store
encrypted credit card numbers. So PAN,
so primary account number, which is the
13 to 19digit number on a payment card
that identifies the card holder's
account with the issuing bank and
minimize the systems overall PCI
compliance scope. So we will again dive
deeper into that in a little bit. So now
that we've looked at the data model,
it's time to look at the API design. And
so for this, I think we can easily
implement a nice simple restful API as
this also supports item potency. So our
first endpoint will be the create
payments endpoint. It'll be a post to V1
payments endpoint. In the headers, we
will include our authorization, which
will include a bearer token. And we'll
also have that item potency key. And so
this item potency key ensures that the
same payment request if it's
accidentally sent multiple times, which
could happen due to network issues or a
user double clicking, it will only be
processed once and return the same
result. And so then in our body, it will
include the amount, the currency, the
payment method token. Again, it's the
tokenized card and not the actual credit
card details, as well as maybe a
description about the product. Next, we
will have our get payment status. And
that's very simply a get request to our
V1 payments. And then include the
payment ID. So we can find the status of
that. And then finally, we can have an
issue a refund endpoint. And this will
be a post to a V1 payments. Again,
include the payment ID refunds endpoint.
Again, include an item potency key here
as well. So we avoid double processing.
And in the body, we could include the
amount. So partial refunds are also
possible because it doesn't necessarily
have to be the full amount. So obviously
in a production system there would be a
lot more endpoints but for our system
design these three cover our core
requirements outlined earlier. Looking
at a highle design. Now this system can
be broken down into kind of two core
paths. The first is the authorization
path and this is synchronous and this
path focuses on securely capturing the
payment intent and getting real time
approval from the financial network. So
here we'll have a client which will send
a request to the API gateway which will
act as the entry point for merchant
clients handling SSL termination rate
limiting and routting checkout requests
to internal services. This request will
then get sent to the payment service
which orchestrates the payment flow by
creating a transaction record in our
main ledger database and managing the
communication with downstream systems.
So this main layer database again will
be SQL. So implemented with something
like Postgress and this is our
relational source of truth that stores
merchant data, transaction amounts and
the current state eg pending or
authorized of every payment. The payment
service will then evaluate the
transaction by going to our fraud and
risk service and so by checking metadata
including the IP device fingerprint and
velocity synchronously to block
suspicious payments before they reach
the bank. Once that is passed, the
payment service can then reach out to
our external gateway adapter and this
translates our internal JSON payloads
into specific protocol. So something
like ISO 8583 which is an international
standard that defines message format and
communication protocol for financial
transaction card systems or even legacy
XML which is then required by our
various external networks. And so in
this case that would be the acquiring
bank or the network. And so think of
something like Chase, Visa, Mastercard.
And so this will validate that the card
holder actually has the funds to make
the purchase they intend on making and
then return an approved or declined
response. Once we've done that, we can
then update our main ledger database
with this new status. So that's our
first flow and which is all synchronous.
The second will be our post-processing
path which is all asynchronous. And this
path focuses on updating the merchant
and eventually finalizing the movement
of funds. So the payment service will
publish a message to CFKA. And so CFKA
acts as that central event bus
decoupling the fast synchronous
authorization path from slower
downstream tasks. So we will then have a
web hook listener and this will be the
merchants web hook listener which will
consume the payment either a successful
or failure from CFKA and then fire HTTP
callbacks to the merchants backend
system. So if you think of something
like Substack, maybe a user is reading
an article and they aren't subscribed,
so they can't finish it. So then they
subscribe, that web hook could then
listen and update the subscription
status of that user in the database so
that the system now knows this user has
full access and the user can continue
reading the article. So while this
implementation technically facilitates a
payment, it contains significant
architectural antiatterns and is missing
features. For example, we're storing raw
credit card numbers, which is a big no
no. Our system is still vulnerable to
network retries and we are also exposed
to dual write data loss and so this is
preventing us from fulfilling the
requirements outlined at the start but
don't worry we are going to go into the
deep dives now to solve each one of
those individually. So in our first deep
dive we are going to handle the item
potency and so to meet the requirement
of no double charges the system needs to
handle the reality of dropped network
connections and aggressive client
retries. So if a client doesn't receive
the HTTP response they will retry.
However, we cannot have a second charge.
And so, the way we're going to handle
this is we'll get our client to generate
a unique UYU ID. And so, this will be an
item potency key header included in the
request for every unique checkout
attempt. And so, then the payment
service will then first check a Reddus
cluster before hitting the database. And
so, using this UU ID as the key, it will
check and then if this key exists and
the transaction is in progress, the
system will just simply return a 409
conflict. However, if it is completed,
it will immediately return the cacheed
HTTP response from the first successful
run. So, by implementing this client
generated UU ID and Reddis cluster, we
prevent double charging a customer in
the event of a network failure or a
double click. Next is looking at our PCI
DCS compliance. Again, that's a security
standard that organizations must comply
with when handling credit card data. And
so our initial design implies that the
raw PAN number, so the primary account
number flows through the API gateway and
the payment service. And so this would
trigger massive PCI compliance audits
for the entire infrastructure. And so
what we're going to do is we're going to
isolate this risk by building a highly
restricted isolated microser to handle
raw card data. So firstly, the client
application will send the raw PAN
directly to the card vault service
completely bypassing our main API
gateway. The vault will then encrypt the
PAN using envelope encryption. And
envelope encryption is a security
technique where data is encrypted with
both a data encryption key and then that
data encryption key itself is encrypted
with a separate key encryption key
creating two layers of encryption
protection. And so that detail isn't
really necessary, but all you need to
know is that it is encrypted. And then
that will then be stored in an isolated
vault database and it will return a
nonsensitive string. So a tokenized
string. So then when the client then
later submits this token to the main
payment service, our core infrastructure
only ever sees the tokens. And so just
before the request leaves our network,
the external gateway adapter will
securely query the vault service which
will then swap the token back for the
real PAN to then send on to acquiring
banks. And so it's using this isolated
micros service and vault database that
enables us to be PCIDSS compliant
without affecting our entire
infrastructure. Next we have distributed
transactions. So if the payment service
updates the database to captured but
crashes before sending the event to the
CFKA the merchants's web hook never
fires and the customer never gets their
product. So how do we handle this? Well,
we will firstly have atomic local commit
and so the payment service opens a
single local database transaction and it
updates the payment status to captured
and inserts the event payload into an
outbox events table simultaneously.
There will then be an asynchronous
relay, so a change data capture. So a
CDC worker like Debeesium monitors the
outbox events table and safely publishes
the message to CFKA. And this guarantees
at least once delivery to the web hook
listener. And so that's the key part
there is that we are having one
transaction and then a polling on that
outbox events table so that we get at
least once delivery. So the merchants
web hook will always get triggered with
that event ensuring that no data is ever
lost. And then finally we will have
automated reconciliation. And so the
ultimate source of truth is the actual
money moving between banks. It's not our
internal database. And so what we will
have is we will get bank settlement
files. So let's say at the end of every
day the acquiring bank deposits an you
know SFTP or an S3 batch file detailing
all actual funds moved. And then what we
will have is we will have reconciliation
workers. So cron workers they will parse
these files and match every single
external record against the main ledger
database using the unique transaction
ids. And to handle drift resolution,
it'll update successful matches to
settled and any mismatches, for example,
the bank charged $50 but our database
says it failed are then pushed to an
operations queue for manual auditing or
automated refunding. And so that's how
we ensure our system is in sync with the
source of truth which is what actually
happened in terms of the money movement
within the bank. So I've thrown a lot at
you. So what I'm going to do now is walk
through the complete architecture so you
understand all the components and how
they all fit together so that when it
comes to your interview, you will have
this one logical flow in your brain
that'll be super easy to recall because
you'll understand how they all work
together. So the client again will
safely capture the raw credit card pan
and send it directly to the isolated
card vault service. The vault will then
encrypt it, stores it, and then return a
secure token back to the client. The
client submits the checkout request
containing the token, the amount, and a
unique item potency key to the API
gateway, which routes it to the payment
service. The payment service first
checks the Reddus cluster to ensure the
exact request isn't already being
processed and it then synchronously
consults the fraud risk service to
ensure the transaction is safe to
proceed. The payment service inserts a
new transaction record into the main
ledger database with the status of
created. The payment service can then
forward the payload to the external
gateway adapter. The adapter queries the
vault to securely swap the token back to
the original raw PAN and then formats
the payload and synchronously calls the
acquiring bank. The bank then returns an
approved response and in a single atomic
local database transaction. The payment
service updates the transaction state
from created to authorized and to
captured if fulfilling digital goods
immediately and writes a web hook
payload to the outbox event table. So a
background CDC worker instantly reads
the outbox event table and pushes the
message to CFKA. The web hook listener
consumes this and fires a HTTP call back
to inform the merchants back end of the
success. And finally, one or two days
later, the reconciliation workers
downloads the definitive batch
settlement file from the bank and it
matches the transaction IDs and updates
the local database state to settle
confirming the actual cash has landed in
the merchants's account. So as well as
this complete architecture there are
some maybe additional discussion points
you could discuss if you wanted to.
First one is the handling retry storm.
So the thundering herd. So if the
payment gateway experiences like a
10-second latency spike, thousands of
clients might auto retry their request
simultaneously. And so the solution to
this is to discuss the importance of
exponential backoff with jitter on the
client side and aggressive rate limiting
at the API gateway to drop excessive
retries. And so the item potency key in
Reddus will handle the rest. Next is
scaling the main database. And so as the
platform grows to millions of
transactions a day, a single Postgress
instance will become a bottleneck for
reads and rights. And so the solution is
sharding. And so for a B2B payment
gateway like Stripe, the best shard key
is usually a the merchant ID. And so
this will ensure all transactions, web
hooks and reconciliations for a specific
merchant live on the same database node
making aggregation and pageionation
fast. Then we'll have clock skew in
reconciliation. When a reconciling your
database against the back bank
settlement file time zones and clock SKs
are nightmares and so a transaction
might happen at 235959
on your service but register as 0001 the
next day on the bank service. So the
solution to this is to reconcile based
on strict universally unique transaction
IDs passed down to the bank during the
initial authorization, never purely by
timestamp or amount. So again, check out
the full write up and AI whiteboard at
techrep.app and hopefully you got some
value out of this. If you did, please
like and subscribe and share with a
friend. It helps the channel out a lot
and I will see you in the next one.