Virtual Event
November 17–November 20, 2020
Learn More and Register to Attend This Event

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2020 - Virtual to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Eastern Standard Time (UTC–05:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Wednesday, November 18 • 5:45pm - 6:20pm
An SLO-Driven Approach to Enhance Kubernetes Cluster Reliability - Qian Ding & Cong Chen, Ant Financial

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.

How to define reliability of a Kubernetes cluster? What are the SLOs? How many 9s is enough to ensure end-users are happy for a Kubernetes cluster with thousands of nodes? Service-level-objective (SLO) is the key to run large-scale production cluster reliably. Defining SLOs for classic web services is simple, since web requests are served synchronously with distinct status code. On the contrast, defining SLOs for Kubernetes services is obscured due to its intent-oriented design and declarative APIs. This talk first briefs the philosophy behind the SLO-driven approach for reliability engineering, followed by a deep dive of how SREs define SLOs for one of the world largest Kubernetes cluster in Ant Financial. Finally this talk shares concrete cases and lessons learned of building SLOs framework from several perspectives, including monitoring, alerting and tracing.

avatar for Qian Ding

Qian Ding

Staff Engineer, Ant Financial
Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in Ant Financial's production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems... Read More →
avatar for Cong Chen

Cong Chen

Senior Software Engineer, Ant Financial
Cong Chen is a senior site reliability engineer at Ant Financial. Currently he is in charge of the stability of large-scale Kubernetes clusters of Ant Financial. Previously he worked at DIDI and RedHat as an architect and virtualization engineer. He has been focusing on the topic... Read More →

Wednesday November 18, 2020 5:45pm - 6:20pm EST
Intrado Virtual Event Platform