Saved transcript

Design Stripe: System Design Interview (Stripe & Amazon Offers)

Channel: TechPrep

In this video, we are going to design a

payment gateway service. So, think of

something like Stripe. A payment gateway

acts as the intermediary between a

merchants application and the broader

financial network. So, think issuing

banks, acquiring banks, and card

networks like Visa and Mastercard. And

unlike a digital wallet, which primarily

manages internal ledger transfers, a

payment gateway's primary job is to

securely capture payment information,

assess risk, and route the transaction

to the appropriate financial

institution. And finally, to handle the

subsequent capture, settlement, and

potential refunds. So, for those of you

struggling with system design, I highly

recommend you check out tech prep.app

app where you can view the full write up

and interactive AI whiteboard that walks

you through the design exactly how an

interviewer would and it helped me go

from rejections to offers from the likes

of Stripe and Amazon. So without further

ado, let's jump into the requirements.

So for the functional requirements,

firstly we'll need to process payments.

So clients should be able to charge a

customer's credit card or payment

method. We will need to be able to

process funds. So clients should be able

to issue full or partial refunds. We

will need web ho and notifications. So

the system must asynchronously notify

merchants of payment state changes. So

authorized, captured or failed. And we

also will need to be able to view

payment statuses. So merchants should be

able to query the status of a specific

transaction. Then for non-functional

requirements, strict consistency and

accuracy. So we can't have any money

lost or double charges. So consistency

is prioritized over availability in our

cap theorem context. We will need high

availability and reliability. So

targeting 99.999%

uptime and payment downtime directly

equals loss revenue. So that is a big

no.

Also for security and compliance we must

adhere to PCI DSS standards. PCI and DSS

is a security standard that

organizations must comply when handling

credit card data to ensure secure

processing, storage, and transmission of

card holder information. We'll also need

to have item potency. So network

failures will happen and so therefore

retries must not result in duplicate

charges. And then finally for

performance we will need relatively low

latency for the initial synchronous

authorization step. Try and get to less

than 2 to 3 seconds to ensure a good

user checkout experience. So now that

we've got our requirements set, let's

look at a data model that will satisfy

these requirements. For a payment

gateway, a relational database, so

something like Postgress or MySQL are

the standard choice because asset

compliance is non-negotiable.

So firstly we'll have our main ledger

database and here we will have a

merchant table and so this will store

the profile and authentication details

for the businesses utilizing the payment

gateway to process transactions. Next we

will have the payment order. So this

serves as the central state machine and

financial ledger for tracking the life

cycle and monetary value of an

individual checkout attempt. And as you

can see here, it has its item potency

key and then the status. So we know what

status the payment is in. We'll also

have a payment event. And this will act

as an immutable appendon audit log that

permanently records every raw

interaction and response payload from

external acquiring banks. And so this

can be super useful especially for

things like auditing.

We'll also have our outbox event table.

And so this functions as a reliable

temporary staging area for the

transactional outbox pattern to

guarantee that asynchronous events like

merchant web hooks are safely published

to CFKA without data loss. And so the

way the transactional outbox pattern

works, it's a technique that ensures

reliable message publishing by storing

outbound messages in the same database

transaction as business data and then

having a separate process reliably

deliver those messages to external

systems. And don't worry about it for

now. We'll dive into that later. And

then finally, we'll have a separate

database. This will be our vault

database. This could have a table card

vault. And so this will be a physically

isolated database to securely store

encrypted credit card numbers. So PAN,

so primary account number, which is the

13 to 19digit number on a payment card

that identifies the card holder's

account with the issuing bank and

minimize the systems overall PCI

compliance scope. So we will again dive

deeper into that in a little bit. So now

that we've looked at the data model,

it's time to look at the API design. And

so for this, I think we can easily

implement a nice simple restful API as

this also supports item potency. So our

first endpoint will be the create

payments endpoint. It'll be a post to V1

payments endpoint. In the headers, we

will include our authorization, which

will include a bearer token. And we'll

also have that item potency key. And so

this item potency key ensures that the

same payment request if it's

accidentally sent multiple times, which

could happen due to network issues or a

user double clicking, it will only be

processed once and return the same

result. And so then in our body, it will

include the amount, the currency, the

payment method token. Again, it's the

tokenized card and not the actual credit

card details, as well as maybe a

description about the product. Next, we

will have our get payment status. And

that's very simply a get request to our

V1 payments. And then include the

payment ID. So we can find the status of

that. And then finally, we can have an

issue a refund endpoint. And this will

be a post to a V1 payments. Again,

include the payment ID refunds endpoint.

Again, include an item potency key here

as well. So we avoid double processing.

And in the body, we could include the

amount. So partial refunds are also

possible because it doesn't necessarily

have to be the full amount. So obviously

in a production system there would be a

lot more endpoints but for our system

design these three cover our core

requirements outlined earlier. Looking

at a highle design. Now this system can

be broken down into kind of two core

paths. The first is the authorization

path and this is synchronous and this

path focuses on securely capturing the

payment intent and getting real time

approval from the financial network. So

here we'll have a client which will send

a request to the API gateway which will

act as the entry point for merchant

clients handling SSL termination rate

limiting and routting checkout requests

to internal services. This request will

then get sent to the payment service

which orchestrates the payment flow by

creating a transaction record in our

main ledger database and managing the

communication with downstream systems.

So this main layer database again will

be SQL. So implemented with something

like Postgress and this is our

relational source of truth that stores

merchant data, transaction amounts and

the current state eg pending or

authorized of every payment. The payment

service will then evaluate the

transaction by going to our fraud and

risk service and so by checking metadata

including the IP device fingerprint and

velocity synchronously to block

suspicious payments before they reach

the bank. Once that is passed, the

payment service can then reach out to

our external gateway adapter and this

translates our internal JSON payloads

into specific protocol. So something

like ISO 8583 which is an international

standard that defines message format and

communication protocol for financial

transaction card systems or even legacy

XML which is then required by our

various external networks. And so in

this case that would be the acquiring

bank or the network. And so think of

something like Chase, Visa, Mastercard.

And so this will validate that the card

holder actually has the funds to make

the purchase they intend on making and

then return an approved or declined

response. Once we've done that, we can

then update our main ledger database

with this new status. So that's our

first flow and which is all synchronous.

The second will be our post-processing

path which is all asynchronous. And this

path focuses on updating the merchant

and eventually finalizing the movement

of funds. So the payment service will

publish a message to CFKA. And so CFKA

acts as that central event bus

decoupling the fast synchronous

authorization path from slower

downstream tasks. So we will then have a

web hook listener and this will be the

merchants web hook listener which will

consume the payment either a successful

or failure from CFKA and then fire HTTP

callbacks to the merchants backend

system. So if you think of something

like Substack, maybe a user is reading

an article and they aren't subscribed,

so they can't finish it. So then they

subscribe, that web hook could then

listen and update the subscription

status of that user in the database so

that the system now knows this user has

full access and the user can continue

reading the article. So while this

implementation technically facilitates a

payment, it contains significant

architectural antiatterns and is missing

features. For example, we're storing raw

credit card numbers, which is a big no

no. Our system is still vulnerable to

network retries and we are also exposed

to dual write data loss and so this is

preventing us from fulfilling the

requirements outlined at the start but

don't worry we are going to go into the

deep dives now to solve each one of

those individually. So in our first deep

dive we are going to handle the item

potency and so to meet the requirement

of no double charges the system needs to

handle the reality of dropped network

connections and aggressive client

retries. So if a client doesn't receive

the HTTP response they will retry.

However, we cannot have a second charge.

And so, the way we're going to handle

this is we'll get our client to generate

a unique UYU ID. And so, this will be an

item potency key header included in the

request for every unique checkout

attempt. And so, then the payment

service will then first check a Reddus

cluster before hitting the database. And

so, using this UU ID as the key, it will

check and then if this key exists and

the transaction is in progress, the

system will just simply return a 409

conflict. However, if it is completed,

it will immediately return the cacheed

HTTP response from the first successful

run. So, by implementing this client

generated UU ID and Reddis cluster, we

prevent double charging a customer in

the event of a network failure or a

double click. Next is looking at our PCI

DCS compliance. Again, that's a security

standard that organizations must comply

with when handling credit card data. And

so our initial design implies that the

raw PAN number, so the primary account

number flows through the API gateway and

the payment service. And so this would

trigger massive PCI compliance audits

for the entire infrastructure. And so

what we're going to do is we're going to

isolate this risk by building a highly

restricted isolated microser to handle

raw card data. So firstly, the client

application will send the raw PAN

directly to the card vault service

completely bypassing our main API

gateway. The vault will then encrypt the

PAN using envelope encryption. And

envelope encryption is a security

technique where data is encrypted with

both a data encryption key and then that

data encryption key itself is encrypted

with a separate key encryption key

creating two layers of encryption

protection. And so that detail isn't

really necessary, but all you need to

know is that it is encrypted. And then

that will then be stored in an isolated

vault database and it will return a

nonsensitive string. So a tokenized

string. So then when the client then

later submits this token to the main

payment service, our core infrastructure

only ever sees the tokens. And so just

before the request leaves our network,

the external gateway adapter will

securely query the vault service which

will then swap the token back for the

real PAN to then send on to acquiring

banks. And so it's using this isolated

micros service and vault database that

enables us to be PCIDSS compliant

without affecting our entire

infrastructure. Next we have distributed

transactions. So if the payment service

updates the database to captured but

crashes before sending the event to the

CFKA the merchants's web hook never

fires and the customer never gets their

product. So how do we handle this? Well,

we will firstly have atomic local commit

and so the payment service opens a

single local database transaction and it

updates the payment status to captured

and inserts the event payload into an

outbox events table simultaneously.

There will then be an asynchronous

relay, so a change data capture. So a

CDC worker like Debeesium monitors the

outbox events table and safely publishes

the message to CFKA. And this guarantees

at least once delivery to the web hook

listener. And so that's the key part

there is that we are having one

transaction and then a polling on that

outbox events table so that we get at

least once delivery. So the merchants

web hook will always get triggered with

that event ensuring that no data is ever

lost. And then finally we will have

automated reconciliation. And so the

ultimate source of truth is the actual

money moving between banks. It's not our

internal database. And so what we will

have is we will get bank settlement

files. So let's say at the end of every

day the acquiring bank deposits an you

know SFTP or an S3 batch file detailing

all actual funds moved. And then what we

will have is we will have reconciliation

workers. So cron workers they will parse

these files and match every single

external record against the main ledger

database using the unique transaction

ids. And to handle drift resolution,

it'll update successful matches to

settled and any mismatches, for example,

the bank charged $50 but our database

says it failed are then pushed to an

operations queue for manual auditing or

automated refunding. And so that's how

we ensure our system is in sync with the

source of truth which is what actually

happened in terms of the money movement

within the bank. So I've thrown a lot at

you. So what I'm going to do now is walk

through the complete architecture so you

understand all the components and how

they all fit together so that when it

comes to your interview, you will have

this one logical flow in your brain

that'll be super easy to recall because

you'll understand how they all work

together. So the client again will

safely capture the raw credit card pan

and send it directly to the isolated

card vault service. The vault will then

encrypt it, stores it, and then return a

secure token back to the client. The

client submits the checkout request

containing the token, the amount, and a

unique item potency key to the API

gateway, which routes it to the payment

service. The payment service first

checks the Reddus cluster to ensure the

exact request isn't already being

processed and it then synchronously

consults the fraud risk service to

ensure the transaction is safe to

proceed. The payment service inserts a

new transaction record into the main

ledger database with the status of

created. The payment service can then

forward the payload to the external

gateway adapter. The adapter queries the

vault to securely swap the token back to

the original raw PAN and then formats

the payload and synchronously calls the

acquiring bank. The bank then returns an

approved response and in a single atomic

local database transaction. The payment

service updates the transaction state

from created to authorized and to

captured if fulfilling digital goods

immediately and writes a web hook

payload to the outbox event table. So a

background CDC worker instantly reads

the outbox event table and pushes the

message to CFKA. The web hook listener

consumes this and fires a HTTP call back

to inform the merchants back end of the

success. And finally, one or two days

later, the reconciliation workers

downloads the definitive batch

settlement file from the bank and it

matches the transaction IDs and updates

the local database state to settle

confirming the actual cash has landed in

the merchants's account. So as well as

this complete architecture there are

some maybe additional discussion points

you could discuss if you wanted to.

First one is the handling retry storm.

So the thundering herd. So if the

payment gateway experiences like a

10-second latency spike, thousands of

clients might auto retry their request

simultaneously. And so the solution to

this is to discuss the importance of

exponential backoff with jitter on the

client side and aggressive rate limiting

at the API gateway to drop excessive

retries. And so the item potency key in

Reddus will handle the rest. Next is

scaling the main database. And so as the

platform grows to millions of

transactions a day, a single Postgress

instance will become a bottleneck for

reads and rights. And so the solution is

sharding. And so for a B2B payment

gateway like Stripe, the best shard key

is usually a the merchant ID. And so

this will ensure all transactions, web

hooks and reconciliations for a specific

merchant live on the same database node

making aggregation and pageionation

fast. Then we'll have clock skew in

reconciliation. When a reconciling your

database against the back bank

settlement file time zones and clock SKs

are nightmares and so a transaction

might happen at 235959

on your service but register as 0001 the

next day on the bank service. So the

solution to this is to reconcile based

on strict universally unique transaction

IDs passed down to the bank during the

initial authorization, never purely by

timestamp or amount. So again, check out

the full write up and AI whiteboard at

techrep.app and hopefully you got some

value out of this. If you did, please

like and subscribe and share with a

friend. It helps the channel out a lot

and I will see you in the next one.