OkCupid: from Rest to GraphQL

This article was originally posted as a two-part series on the OkCupid Tech Blog. Here, it is reproduced in full in two sections:

Part One: Moving from Rest to GraphQL
Part Two: Troubleshooting the first release

Part One: Moving from Rest to GraphQL

A lot has been written about the benefits of moving from a REST API to a GraphQL API¹. But let’s say that you’re already convinced. If you want to convert a site with millions of users, ensure that performance doesn’t suffer, and just really don’t want to screw it up: how do you do it?

We embarked on this journey in 2019 and made it out alive to tell the tale! Our GraphQL API is now the official API at OkCupid, with all clients adopting it: our iOS and Android apps, as well as our desktop and mobile web single-page React apps.

So, here’s how we tackled this huge project. I’ll talk a little about what we built, the strategy we came up with to test the new code we were shipping, and a few things that could have gone better on the technology side. Disclaimer: this article is more about the process than the code itself; to hear about the performance issues we had to overcome to reach parity with our previous API, read about our first release in part two.

But first, some stats

At the time of writing, our GraphQL API has been in production for 1½ years, and we stopped adding new features to our REST API over a year ago. The graph handles up to 175k requests per minute, and it is made up of 227 types (2023 update: we are up to 432 types).

We haven’t fully deprecated our REST API, but we’re more than halfway through converting our clients if you look at request volume (we’ve added the entities that support the most popular pages), and maybe a little less than halfway there by entity count.

How we did it

Since this was a whole new tech stack and repository for us (Node, Apollo Server, Docker²), we needed to figure out a plan to verify its efficacy without disrupting production. Our process was:

Pick an appropriate page to convert
Build the schema
Add a shadow request to call the new API while still fetching data via the REST API
Do an A/B test with real users that changes the data source

We started the project at the start of January 2019, released our shadow query on January 28th, started our A/B test on March 13th, and released it fully on April 30th. So in just 4 “easy” steps, you too can have a graph in production in “only” 4 months!

So let’s dig into each step.

1. Pick an appropriate page to convert

We decided to make the OkCupid Conversations page our test bed. On this page, users can see the list of ongoing conversations they have, as well as a list of “mutual matches” (people with whom they can start a new conversation):

The conversations page at the time of conversion

It’s important to choose a page that will let you model some core parts of your site; this will help you settle on conventions, flesh out important parts of your data model, create a better base for future work, and just be a better proof of concept. The more “real” the page is, the more it will help you learn if the new API is going to work.

We chose the Conversations page, which made us consider how to represent:

User: basic information about a user account
Match: stateful information about how two users relate to each other (e.g., match percent, if one has liked the other, etc.)
Conversation: basic conversation information (e.g., the sender, a snippet of the last message, the time sent)

It also got us thinking about some reusable API concepts like pagination.

2. Build your schema

For a lot of teams doing schema design for the first time, this will likely be a challenging step — it was for me! Some tips:

Do research. There is a lot of great writing about schemas, from the basic examples in the GraphQL docs, to GitHub and Yelp’s public APIs, to Relay’s docs. A big shout-out to the Apollo team here; we got great help from them at this stage.
Don’t worry about how your REST API formatted its data. It’s better to design your schema to be more expressive and idiomatic than it is to feel constrained by what your previous API returned.
Be consistent. Our previous API was mostly snake_case, but had a few ugly combined words (e.g., userid and displayname). This is your opportunity to make your field names more standard and readable, so take it!
Be specific. The more accurately you name the fields in your graph, the easier it is to migrate to a new field if you need to make a breaking change. For example, User.essaysWithDefaults is better than User.essays.
Take your research and make something that works for your team. When investigating pagination standards, for example, I was tempted to use Relay’s spec, but found its reliance on terms like edges and nodes more clinical than we wanted to expose to clients in our graph (we instead settled on returning a list of data³).

3. Add a shadow request

Before having GraphQL provide data to real users, we tested our system in production with a shadow request: on our target page, the user requested its data from the REST API, then did the same from GraphQL after displaying the REST data (discarding the duped data). This let us compare the performance of the two APIs and fix issues before users found them.

We certainly aren’t the first people to think of this, but it was a massively important step for us. Our first draft of this API took nearly twice the time of the REST API, which, obviously, was not cool. Releasing a shadow request allowed us to triage these performance issues without affecting real users’ experience on the site.

For the technical side of what went wrong and how we got GraphQL up to speed parity, check out part two.

4. Run an experiment

The final step was to test the new API against the old with real users! Since we already verified that the response times were similar with the shadow request, we felt confident releasing an A/B test.

Experiments where you expect not to see a change are tricky because you are trying to prove that nothing happened. So in an experiment like this, the stats you’re tracking will, by nature, never reach significance unless there’s something wrong.

So instead of looking for a significant change in stats, you should set a duration for your experiment; once you’ve reached that duration and still see no significant changes, you can launch with confidence. For us, that was a month’s run (with over 100k users in each group). And… it worked!

What could have gone better

No first draft is ever perfect (nor any second draft, for me at least). While the process around releasing the API went well, there were a few technical things we learned after our release.

Error handling

We didn’t have any structure around how we returned errors from GraphQL mutations, and by the time we realized there was a problem, we had a robust variety of ways we showed errors to our clients. A solution that seems really interesting would be to standardize on an Error type that we can extend in a given mutation payload. This medium post has a very in-depth writeup of good error styles.

Where should business logic go?

When confronted with a product feature that involves a business rule, it can be tempting to add that logic to the API layer, especially if you’d otherwise be relying on another team to implement it.

For example, we built a feature that shows a list of everyone who liked and messaged you. We show the whole list to paid users, but for free users we only show the first one, then a series of placeholders. Our first release of this feature had the logic to check a user’s paid status and replace the cards with placeholders in the API layer.

After working with the graph for a while now, we’ve realized that the business logic works best when centralized in the back-end, and that the role of our graph is to fetch, format, and present the back-end’s data in a way that makes sense to clients.

That’s it, y’all

Overall, our process worked out really well; it allowed us to get something into production quickly to validate our technical decisions, fix errors before they got to users, and test our changes against the previous API.

If you decide to take a similar journey, we hope this roadmap will be useful. Good luck!

Part Two: The Shadow Request (aka Troubleshooting our First Release)

As we were building our GraphQL API in a totally new stack, we wanted to see how it would measure up against our previous REST API with a real production load, and we wanted to do so without negatively impacting the user experience.

To do this, we released what we called The Shadow Request. On our target page, the user loaded the page’s data from the REST API as normal and displayed the page. Then, the user loaded the same data from GraphQL, measured that call’s timing, and discarded the data.

We didn’t come up with this idea, but it was a game changer for us: we discovered that our first release of the GraphQL API took about double the time — 1200ms versus 600ms — of the REST API. If we had shown this version to real users, it would have led to a very poor experience for them.

Here, I will talk about the improvements we found in our Docker and Node environments, how GraphQL resolvers work on lists of entities, and CORS requests. So, let’s take a look!

Docker and Node Low-Hanging Fruit

The first thing that we realized was that I accidentally released a build with the NODE_ENV set to development. You always hear to not do this, since development mode enables more logging and slower code paths in packages. But now I have the empirical evidence to say why not: changing NODE_ENV to production saved us 34ms per request on average.

We were also using an unoptimized Docker base image for this initial deploy. Switching from node to node-stretch-slim reduced our image size by 600mb (850mb to 250mb); while this didn't speed up the response time of the application, it did make our development cycle quicker by making the build and deploy process faster.

These were not the biggest wins, but they were two of the easiest!

Naive GraphQL Resolvers Can Be Sloooow

If you have a field that returns a list of entities (in our case, OkCupid users) you’ll probably be getting information about each of those users, like their name or age.

The page that we were converting to GraphQL for this deploy was the OkCupid messages page. When making our schema, we defined a Conversation as having a text snippet from the last sent message, and a User entity representing the person to whom you're talking. Then we added a field on the top-level User entity to fetch that user's conversations. Here are the relevant parts of the schema and resolver:

A simplified version of our Conversation schema

This worked; we deployed and celebrated! But when we looked at a request’s stack trace, we saw something that looked like this:

Uh, ok… that waterfall is definitely NOT what we were looking for. But thinking about it, it makes sense: we just told the resolver about how to get a single user’s information, so it does all it knows how to and makes 20 cascading requests to the back-end.

But, we can do better. We happened to already have a way to get information about multiple users from the back-end at the same time, so the solution was to update our resolver with a package to batch multiple requests of the same entity type. Lots of folks use DataLoader, but in this particular example I found GraphQL Resolve Batch to be more ergonomic. Here’s our updated resolver:

Note the updated data source call — getUsers instead of getUser

So here, we pass the package a function that looks like a normal resolver, but instead of getting the parent as the first argument, the package provides parents (our list of conversations). We then pluck out the user IDs and call our batch API endpoint, getUsers. That change sliced off almost 275ms from the call, and the timeline looked pretty darn slick:

In this particular instance, chasing waterfalls was advisable

Subdomains + CORS Didn’t Work For Us

Those two changes got us most of the way there, but our GraphQL API was still consistently 300ms slower than our REST API. Since we had already pared down the server side of things as much as we could, we started looking from the client’s perspective.

Early on in the project, we decided to serve our API from graphql.okcupid.com, and saw that user requests from www.okcupid.com were triggering a CORS preflight. That is normal, but they were taking what felt like an eternity: 300ms (does that time ring a bell?). We investigated a number of angles with our Ops team (was it cloudflare? our load balancer HAProxy?), but didn't come up with any reasonable leads. So, we decided to just try serving it from www.okcupid.com/graphql, and the 300ms vanished. What a trick!

Hey, It Worked

After releasing this series of changes to our setup, we reached parity with our old REST API. We discovered and fixed issues with our Node environment, GraphQL resolvers, and CORS, all without impacting site performance. And we were then well positioned to release an experiment that compared real users loading data from GraphQL versus the REST API.

If you are considering adding new technology to your stack, hopefully you will consider a shadow request to validate it. And if that stack happens to create a GraphQL API, hopefully you can avoid some of the pitfalls that we hit. Good luck!

Thanks to Katherine Erickson, Raymond Sohn, and the OkCupid web team for reading drafts of this article.

1. For us, it boiled down to: a more expressive way for clients to interact with our data, a more performant way to retrieve data with fewer network requests, more flexibility for our clients to create new features without API changes once the graph was built out a bit, and a technology that is rapidly being adopted as a community standard for APIs. ↩

2. This was a greenfield project, built in a new repository and deployed separately from our back-end and client codebases. It runs in Node, using Apollo Server and Express. Our data was provided by calls to our REST API for the initial release, but we’ve since moved to calling our back-end directly using gRPC.

The API is deployed with Docker: we build Docker images with CI, and orchestrate releasing those images to our web servers with Docker Swarm. A huge, truly enormous shout-out goes to Hugh Tipping on our ops team for putting together Docker Swarm and a launch script to interact with it, along with tons of Docker experience and support! Also emotional support.

We use Apollo Client across all platforms (desktop/mobile web, iOS, and Android), and integrated with Apollo Studio to use their Operation Registry for security and to track speed and field usage stats. ↩

3. edges and nodes didn't feel right to us, but the Relay description of paging cursors was pretty spot on. So, we use a data array for the items, and a Relay-inspired PageInfo entity:

↩