Saved transcript

Design Stripe: System Design Interview (Stripe & Amazon Offers)

Channel: TechPrep

Video title: Design Stripe: System Design Interview (Stripe & Amazon Offers)
Channel: TechPrep
Lines: 479
Tool page: Open live transcript tool

0:00

In this video, we are going to design a

0:02

payment gateway service. So, think of

0:04

something like Stripe. A payment gateway

0:06

acts as the intermediary between a

0:08

merchants application and the broader

0:10

financial network. So, think issuing

0:12

banks, acquiring banks, and card

0:14

networks like Visa and Mastercard. And

0:16

unlike a digital wallet, which primarily

0:19

manages internal ledger transfers, a

0:21

payment gateway's primary job is to

0:23

securely capture payment information,

0:25

assess risk, and route the transaction

0:27

to the appropriate financial

0:29

institution. And finally, to handle the

0:31

subsequent capture, settlement, and

0:32

potential refunds. So, for those of you

0:34

struggling with system design, I highly

0:37

recommend you check out tech prep.app

0:38

app where you can view the full write up

0:40

and interactive AI whiteboard that walks

0:42

you through the design exactly how an

0:44

interviewer would and it helped me go

0:46

from rejections to offers from the likes

0:48

of Stripe and Amazon. So without further

0:50

ado, let's jump into the requirements.

0:52

So for the functional requirements,

0:53

firstly we'll need to process payments.

0:55

So clients should be able to charge a

0:56

customer's credit card or payment

0:58

method. We will need to be able to

1:00

process funds. So clients should be able

1:02

to issue full or partial refunds. We

1:05

will need web ho and notifications. So

1:07

the system must asynchronously notify

1:09

merchants of payment state changes. So

1:11

authorized, captured or failed. And we

1:14

also will need to be able to view

1:16

payment statuses. So merchants should be

1:18

able to query the status of a specific

1:20

transaction. Then for non-functional

1:23

requirements, strict consistency and

1:25

accuracy. So we can't have any money

1:27

lost or double charges. So consistency

1:29

is prioritized over availability in our

1:32

cap theorem context. We will need high

1:34

availability and reliability. So

1:36

targeting 99.999%

1:39

uptime and payment downtime directly

1:41

equals loss revenue. So that is a big

1:43

no.

1:45

Also for security and compliance we must

1:47

adhere to PCI DSS standards. PCI and DSS

1:51

is a security standard that

1:52

organizations must comply when handling

1:54

credit card data to ensure secure

1:56

processing, storage, and transmission of

1:58

card holder information. We'll also need

2:00

to have item potency. So network

2:02

failures will happen and so therefore

2:03

retries must not result in duplicate

2:05

charges. And then finally for

2:07

performance we will need relatively low

2:09

latency for the initial synchronous

2:11

authorization step. Try and get to less

2:13

than 2 to 3 seconds to ensure a good

2:15

user checkout experience. So now that

2:18

we've got our requirements set, let's

2:19

look at a data model that will satisfy

2:21

these requirements. For a payment

2:22

gateway, a relational database, so

2:24

something like Postgress or MySQL are

2:26

the standard choice because asset

2:28

compliance is non-negotiable.

2:30

So firstly we'll have our main ledger

2:32

database and here we will have a

2:34

merchant table and so this will store

2:36

the profile and authentication details

2:38

for the businesses utilizing the payment

2:41

gateway to process transactions. Next we

2:44

will have the payment order. So this

2:45

serves as the central state machine and

2:48

financial ledger for tracking the life

2:50

cycle and monetary value of an

2:52

individual checkout attempt. And as you

2:54

can see here, it has its item potency

2:56

key and then the status. So we know what

2:57

status the payment is in. We'll also

3:00

have a payment event. And this will act

3:02

as an immutable appendon audit log that

3:05

permanently records every raw

3:07

interaction and response payload from

3:09

external acquiring banks. And so this

3:12

can be super useful especially for

3:13

things like auditing.

3:16

We'll also have our outbox event table.

3:19

And so this functions as a reliable

3:21

temporary staging area for the

3:23

transactional outbox pattern to

3:25

guarantee that asynchronous events like

3:27

merchant web hooks are safely published

3:29

to CFKA without data loss. And so the

3:32

way the transactional outbox pattern

3:34

works, it's a technique that ensures

3:35

reliable message publishing by storing

3:38

outbound messages in the same database

3:40

transaction as business data and then

3:43

having a separate process reliably

3:45

deliver those messages to external

3:47

systems. And don't worry about it for

3:48

now. We'll dive into that later. And

3:50

then finally, we'll have a separate

3:51

database. This will be our vault

3:53

database. This could have a table card

3:55

vault. And so this will be a physically

3:57

isolated database to securely store

3:59

encrypted credit card numbers. So PAN,

4:01

so primary account number, which is the

4:04

13 to 19digit number on a payment card

4:07

that identifies the card holder's

4:08

account with the issuing bank and

4:10

minimize the systems overall PCI

4:13

compliance scope. So we will again dive

4:15

deeper into that in a little bit. So now

4:17

that we've looked at the data model,

4:19

it's time to look at the API design. And

4:21

so for this, I think we can easily

4:22

implement a nice simple restful API as

4:25

this also supports item potency. So our

4:27

first endpoint will be the create

4:29

payments endpoint. It'll be a post to V1

4:32

payments endpoint. In the headers, we

4:34

will include our authorization, which

4:36

will include a bearer token. And we'll

4:38

also have that item potency key. And so

4:40

this item potency key ensures that the

4:42

same payment request if it's

4:44

accidentally sent multiple times, which

4:46

could happen due to network issues or a

4:48

user double clicking, it will only be

4:50

processed once and return the same

4:52

result. And so then in our body, it will

4:55

include the amount, the currency, the

4:56

payment method token. Again, it's the

4:58

tokenized card and not the actual credit

5:00

card details, as well as maybe a

5:02

description about the product. Next, we

5:04

will have our get payment status. And

5:06

that's very simply a get request to our

5:07

V1 payments. And then include the

5:09

payment ID. So we can find the status of

5:12

that. And then finally, we can have an

5:14

issue a refund endpoint. And this will

5:16

be a post to a V1 payments. Again,

5:19

include the payment ID refunds endpoint.

5:22

Again, include an item potency key here

5:24

as well. So we avoid double processing.

5:26

And in the body, we could include the

5:28

amount. So partial refunds are also

5:30

possible because it doesn't necessarily

5:32

have to be the full amount. So obviously

5:34

in a production system there would be a

5:35

lot more endpoints but for our system

5:37

design these three cover our core

5:39

requirements outlined earlier. Looking

5:41

at a highle design. Now this system can

5:44

be broken down into kind of two core

5:46

paths. The first is the authorization

5:48

path and this is synchronous and this

5:50

path focuses on securely capturing the

5:52

payment intent and getting real time

5:54

approval from the financial network. So

5:57

here we'll have a client which will send

5:59

a request to the API gateway which will

6:01

act as the entry point for merchant

6:02

clients handling SSL termination rate

6:05

limiting and routting checkout requests

6:06

to internal services. This request will

6:08

then get sent to the payment service

6:10

which orchestrates the payment flow by

6:12

creating a transaction record in our

6:14

main ledger database and managing the

6:16

communication with downstream systems.

6:19

So this main layer database again will

6:21

be SQL. So implemented with something

6:23

like Postgress and this is our

6:25

relational source of truth that stores

6:27

merchant data, transaction amounts and

6:29

the current state eg pending or

6:31

authorized of every payment. The payment

6:34

service will then evaluate the

6:36

transaction by going to our fraud and

6:38

risk service and so by checking metadata

6:41

including the IP device fingerprint and

6:43

velocity synchronously to block

6:45

suspicious payments before they reach

6:47

the bank. Once that is passed, the

6:49

payment service can then reach out to

6:50

our external gateway adapter and this

6:53

translates our internal JSON payloads

6:55

into specific protocol. So something

6:57

like ISO 8583 which is an international

7:01

standard that defines message format and

7:03

communication protocol for financial

7:05

transaction card systems or even legacy

7:07

XML which is then required by our

7:09

various external networks. And so in

7:11

this case that would be the acquiring

7:13

bank or the network. And so think of

7:15

something like Chase, Visa, Mastercard.

7:17

And so this will validate that the card

7:19

holder actually has the funds to make

7:21

the purchase they intend on making and

7:23

then return an approved or declined

7:25

response. Once we've done that, we can

7:27

then update our main ledger database

7:29

with this new status. So that's our

7:32

first flow and which is all synchronous.

7:34

The second will be our post-processing

7:36

path which is all asynchronous. And this

7:39

path focuses on updating the merchant

7:40

and eventually finalizing the movement

7:42

of funds. So the payment service will

7:44

publish a message to CFKA. And so CFKA

7:47

acts as that central event bus

7:49

decoupling the fast synchronous

7:50

authorization path from slower

7:52

downstream tasks. So we will then have a

7:55

web hook listener and this will be the

7:57

merchants web hook listener which will

7:58

consume the payment either a successful

8:01

or failure from CFKA and then fire HTTP

8:05

callbacks to the merchants backend

8:07

system. So if you think of something

8:08

like Substack, maybe a user is reading

8:10

an article and they aren't subscribed,

8:12

so they can't finish it. So then they

8:13

subscribe, that web hook could then

8:15

listen and update the subscription

8:17

status of that user in the database so

8:19

that the system now knows this user has

8:20

full access and the user can continue

8:22

reading the article. So while this

8:26

implementation technically facilitates a

8:28

payment, it contains significant

8:30

architectural antiatterns and is missing

8:32

features. For example, we're storing raw

8:34

credit card numbers, which is a big no

8:35

no. Our system is still vulnerable to

8:37

network retries and we are also exposed

8:39

to dual write data loss and so this is

8:42

preventing us from fulfilling the

8:43

requirements outlined at the start but

8:45

don't worry we are going to go into the

8:46

deep dives now to solve each one of

8:48

those individually. So in our first deep

8:50

dive we are going to handle the item

8:52

potency and so to meet the requirement

8:54

of no double charges the system needs to

8:56

handle the reality of dropped network

8:58

connections and aggressive client

8:59

retries. So if a client doesn't receive

9:02

the HTTP response they will retry.

9:05

However, we cannot have a second charge.

9:07

And so, the way we're going to handle

9:09

this is we'll get our client to generate

9:11

a unique UYU ID. And so, this will be an

9:14

item potency key header included in the

9:16

request for every unique checkout

9:18

attempt. And so, then the payment

9:21

service will then first check a Reddus

9:22

cluster before hitting the database. And

9:25

so, using this UU ID as the key, it will

9:27

check and then if this key exists and

9:29

the transaction is in progress, the

9:31

system will just simply return a 409

9:33

conflict. However, if it is completed,

9:35

it will immediately return the cacheed

9:37

HTTP response from the first successful

9:40

run. So, by implementing this client

9:42

generated UU ID and Reddis cluster, we

9:45

prevent double charging a customer in

9:47

the event of a network failure or a

9:49

double click. Next is looking at our PCI

9:52

DCS compliance. Again, that's a security

9:54

standard that organizations must comply

9:56

with when handling credit card data. And

9:58

so our initial design implies that the

10:00

raw PAN number, so the primary account

10:02

number flows through the API gateway and

10:04

the payment service. And so this would

10:06

trigger massive PCI compliance audits

10:08

for the entire infrastructure. And so

10:10

what we're going to do is we're going to

10:12

isolate this risk by building a highly

10:14

restricted isolated microser to handle

10:16

raw card data. So firstly, the client

10:19

application will send the raw PAN

10:21

directly to the card vault service

10:24

completely bypassing our main API

10:26

gateway. The vault will then encrypt the

10:28

PAN using envelope encryption. And

10:30

envelope encryption is a security

10:32

technique where data is encrypted with

10:34

both a data encryption key and then that

10:36

data encryption key itself is encrypted

10:38

with a separate key encryption key

10:40

creating two layers of encryption

10:42

protection. And so that detail isn't

10:44

really necessary, but all you need to

10:45

know is that it is encrypted. And then

10:47

that will then be stored in an isolated

10:49

vault database and it will return a

10:51

nonsensitive string. So a tokenized

10:54

string. So then when the client then

10:55

later submits this token to the main

10:58

payment service, our core infrastructure

11:00

only ever sees the tokens. And so just

11:02

before the request leaves our network,

11:04

the external gateway adapter will

11:06

securely query the vault service which

11:09

will then swap the token back for the

11:11

real PAN to then send on to acquiring

11:14

banks. And so it's using this isolated

11:16

micros service and vault database that

11:18

enables us to be PCIDSS compliant

11:21

without affecting our entire

11:23

infrastructure. Next we have distributed

11:25

transactions. So if the payment service

11:27

updates the database to captured but

11:29

crashes before sending the event to the

11:32

CFKA the merchants's web hook never

11:34

fires and the customer never gets their

11:36

product. So how do we handle this? Well,

11:38

we will firstly have atomic local commit

11:40

and so the payment service opens a

11:42

single local database transaction and it

11:46

updates the payment status to captured

11:48

and inserts the event payload into an

11:50

outbox events table simultaneously.

11:52

There will then be an asynchronous

11:54

relay, so a change data capture. So a

11:56

CDC worker like Debeesium monitors the

11:59

outbox events table and safely publishes

12:02

the message to CFKA. And this guarantees

12:05

at least once delivery to the web hook

12:08

listener. And so that's the key part

12:10

there is that we are having one

12:12

transaction and then a polling on that

12:15

outbox events table so that we get at

12:18

least once delivery. So the merchants

12:20

web hook will always get triggered with

12:22

that event ensuring that no data is ever

12:24

lost. And then finally we will have

12:26

automated reconciliation. And so the

12:29

ultimate source of truth is the actual

12:31

money moving between banks. It's not our

12:33

internal database. And so what we will

12:36

have is we will get bank settlement

12:38

files. So let's say at the end of every

12:40

day the acquiring bank deposits an you

12:43

know SFTP or an S3 batch file detailing

12:46

all actual funds moved. And then what we

12:48

will have is we will have reconciliation

12:50

workers. So cron workers they will parse

12:52

these files and match every single

12:54

external record against the main ledger

12:57

database using the unique transaction

12:59

ids. And to handle drift resolution,

13:02

it'll update successful matches to

13:05

settled and any mismatches, for example,

13:07

the bank charged $50 but our database

13:10

says it failed are then pushed to an

13:12

operations queue for manual auditing or

13:14

automated refunding. And so that's how

13:16

we ensure our system is in sync with the

13:19

source of truth which is what actually

13:21

happened in terms of the money movement

13:23

within the bank. So I've thrown a lot at

13:26

you. So what I'm going to do now is walk

13:27

through the complete architecture so you

13:29

understand all the components and how

13:31

they all fit together so that when it

13:32

comes to your interview, you will have

13:34

this one logical flow in your brain

13:36

that'll be super easy to recall because

13:38

you'll understand how they all work

13:39

together. So the client again will

13:42

safely capture the raw credit card pan

13:45

and send it directly to the isolated

13:46

card vault service. The vault will then

13:48

encrypt it, stores it, and then return a

13:51

secure token back to the client. The

13:53

client submits the checkout request

13:55

containing the token, the amount, and a

13:56

unique item potency key to the API

13:58

gateway, which routes it to the payment

14:00

service. The payment service first

14:02

checks the Reddus cluster to ensure the

14:04

exact request isn't already being

14:06

processed and it then synchronously

14:08

consults the fraud risk service to

14:10

ensure the transaction is safe to

14:11

proceed. The payment service inserts a

14:14

new transaction record into the main

14:15

ledger database with the status of

14:17

created. The payment service can then

14:19

forward the payload to the external

14:21

gateway adapter. The adapter queries the

14:23

vault to securely swap the token back to

14:26

the original raw PAN and then formats

14:29

the payload and synchronously calls the

14:32

acquiring bank. The bank then returns an

14:35

approved response and in a single atomic

14:38

local database transaction. The payment

14:40

service updates the transaction state

14:42

from created to authorized and to

14:44

captured if fulfilling digital goods

14:46

immediately and writes a web hook

14:48

payload to the outbox event table. So a

14:50

background CDC worker instantly reads

14:53

the outbox event table and pushes the

14:55

message to CFKA. The web hook listener

14:57

consumes this and fires a HTTP call back

14:59

to inform the merchants back end of the

15:02

success. And finally, one or two days

15:04

later, the reconciliation workers

15:06

downloads the definitive batch

15:08

settlement file from the bank and it

15:10

matches the transaction IDs and updates

15:12

the local database state to settle

15:14

confirming the actual cash has landed in

15:16

the merchants's account. So as well as

15:18

this complete architecture there are

15:20

some maybe additional discussion points

15:22

you could discuss if you wanted to.

15:24

First one is the handling retry storm.

15:26

So the thundering herd. So if the

15:28

payment gateway experiences like a

15:30

10-second latency spike, thousands of

15:32

clients might auto retry their request

15:34

simultaneously. And so the solution to

15:36

this is to discuss the importance of

15:38

exponential backoff with jitter on the

15:40

client side and aggressive rate limiting

15:42

at the API gateway to drop excessive

15:44

retries. And so the item potency key in

15:47

Reddus will handle the rest. Next is

15:50

scaling the main database. And so as the

15:52

platform grows to millions of

15:53

transactions a day, a single Postgress

15:55

instance will become a bottleneck for

15:57

reads and rights. And so the solution is

16:00

sharding. And so for a B2B payment

16:02

gateway like Stripe, the best shard key

16:04

is usually a the merchant ID. And so

16:06

this will ensure all transactions, web

16:08

hooks and reconciliations for a specific

16:10

merchant live on the same database node

16:13

making aggregation and pageionation

16:15

fast. Then we'll have clock skew in

16:18

reconciliation. When a reconciling your

16:20

database against the back bank

16:22

settlement file time zones and clock SKs

16:25

are nightmares and so a transaction

16:26

might happen at 235959

16:30

on your service but register as 0001 the

16:33

next day on the bank service. So the

16:35

solution to this is to reconcile based

16:38

on strict universally unique transaction

16:40

IDs passed down to the bank during the

16:43

initial authorization, never purely by

16:46

timestamp or amount. So again, check out

16:49

the full write up and AI whiteboard at

16:51

techrep.app and hopefully you got some

16:52

value out of this. If you did, please

16:54

like and subscribe and share with a

16:55

friend. It helps the channel out a lot

16:57

and I will see you in the next one.