Scaling GraphQL Adoption at Netflix: Tejas Shikhare at QCon San Francisco 2022 – | Wonder Mind Kids

At QCon San Francisco 2022, Tejas Shikhare, Senior Software Engineer at Netflix, presented Scaling GraphQL Adoption at Netflix. Tejas has worked on Netflix’s federated GraphQL platform, distributed systems, and more recently developer tools and training. This talk is part of the editorial track Modern APIs: Building and Evolving.

Tejas began his presentation with an introduction to GraphQL, an alternative to a communication protocol for APIs between clients and servers. In GraphQL there is a schema that defines the data graph. There are three root types, Query, Mutation, and Subscription, which represent the entry points to the data graph.

GraphQL offers a few advantages: minimizing round trips to the server, strong input, and being able to act as a visual presentation for your APIs in different domains. Tejas summarizes GraphQL:

Put simply, GraphQL gives you the power to pull exactly the data you want from the server. No longer. Not less.

Netflix started its GraphQL journey with a dedicated server called DNA API that communicates with its microservices, a common pattern for companies with GraphQL servers. Netflix used an internally developed tool called Falcor back then between 2012 and 2015.

The DNA API became more popular over time and had some problems. Code changes were required at both the microservice and API layers, and often by different teams. As a result, the API team must be experts in many areas, be the first line of support, and make code changes frequently. This resulted in slow build times and cascading errors.

Federated GraphQL would solve these problems by extending types across service boundaries and allowing each team to implement their own parts of the API. In the example above, each service knows about movieId and hydrates the fields it owns.

There are three components in a GraphQL federated architecture. First, a domain graph service (DGS) is responsible for the implementation of the subgraph. Next, a schema registry is responsible for validating each subgraph and merging them to form a supergraph. Finally, Supergraph is made available to clients via a highly available GraphQL gateway service. When a client writes a query to the GraphQL gateway, the gateway is responsible for breaking the query into subqueries and sending them to each domain DGS.

Although GraphQL and this architecture at Netflix today handles 1 billion+ daily requests, 10,000+ types and fields, and 500+ active developers, there are some challenges with Federated GraphQL. Tejas sees the following issues with Netflix.

  • Federation and GraphQL have steep learning curves.

  • In a federated graphQL architecture, multiple players often make changes to the schema, which can lead to inconsistent schema design issues.

  • The chart is getting too big for collaboration. Naming conflicts are a common problem, and namespaces alone do more harm than good.

These issues lead to a key question: while Federated GraphQL gives each team the freedom to move quickly, does it allow developers to be responsible stewards of the API?

To answer this question and solve these growing problems, Tejas and his team developed a workflow called Collaborated Schema Design. Tejas and his team have developed some tools to facilitate and strengthen the implementation of workflows.

  • GraphHub is a schema collaboration tool to reduce the collaboration challenges between the client and server team. GraphHub is a monorepo that contains all schemas and syncs with the schema registry to have the latest schema from production. Because GraphHub is a repo, it allows any developer to make a suggestion via a pull request.

  • Alongside GraphHub, the Tejas team also created a schema working group that anyone can join to set and maintain standards.

  • GraphDoctor, a schema linter, was developed to help with a consistent API in the massive multiplayer environment. GraphDoctor listens to new pull requests and uses codified schema guidelines as linter rules to keep API schema designs consistent.

  • Graphlabs creates sandbox environments for each new pull request to create rapid prototyping and a short feedback loop between client and server teams.

  • Graphical statistics and notifications for depreciation workflow. Graphical statistics and notifications count and notify when outdated fields are used.

Tejas continues his presentation by highlighting that there are other issues that he and his team are actively working to solve, such as:

The Federation is not free. And it won’t magically solve all your problems. We had to create a lot of additional tools, documentation and developer training to make it work

Tejas concludes his talk with some recommendations if you want to introduce GraphQL in your organization:

  • Start with a monolithic GraphQL API and offload that effort to a single team, ideally with a mix of backend and UI engineers.

  • As your GraphQL API grows, think about federations

  • Plan a coordinated GraphQL effort across your organization to avoid separate and siled GraphQL APIs

  • Schema design is absolutely taboo. The effort you put into your schema design directly impacts the success of GraphQL in your organization.

  • Take a schema-first approach.

  • Use the expiration workflow to create a versionless GraphQL API

  • Be product-centric

  • GraphQL really shines for consumer and device APIs, but it’s not meant for everything.

Netflix’s engineering team also gave a presentation on GraphQL Federation at a previous QCon plus. Additional Modern APIs: Building and Evolving talks will be recorded and made available on InfoQ in the coming months.

Leave a Comment