Nele Uhlemann

Sep 21, 2023 • World Congress 2023

Handling incidents collaboratively is like solving a rubix cube

What if developers could instrument their code and define SLOs with a simple decorator? Learn a new approach to making observability a shared responsibility.

#1about 4 minutes

The Rubik's Cube metaphor for engineering teams

Different engineering teams like backend and SREs operate on different sides of the system, requiring collaboration during incidents.

#2about 3 minutes

The first phase of resolving incidents collaboratively

The initial step in incident response is to establish a common understanding and transparency across teams before applying quick fixes.

#3about 2 minutes

Preventing future incidents with best practices

After resolving an incident, teams must collaborate on prevention by documenting best practices for patterns like service retries.

#4about 2 minutes

Discovering incidents through system observability

The discovery phase relies on making systems observable by collecting telemetry data like logs, metrics, and traces.

#5about 2 minutes

Standardizing telemetry collection with OpenTelemetry

OpenTelemetry provides a vendor-neutral standard for instrumenting applications, preventing vendor lock-in for observability backends.

#6about 2 minutes

Simplifying metrics with the Autometrics library

The open-source Autometrics library uses decorators to automatically generate key metrics like latency, errors, and request rate from functions.

#7about 5 minutes

Demo of generating metrics and SLOs from code

A live demo shows how Autometrics provides live metrics in the IDE and helps define SLOs that can be visualized in Grafana.

#8about 1 minute

Summary of collaborative incident management phases

A recap of the three key phases for collaborative incident handling: resolving, preventing, and discovering issues together.

#9about 2 minutes

Q&A on tooling and open source contribution

The speaker answers audience questions about managing tool complexity and the role of community contributions in open-source projects.

Andrew Comp
Berlin, Germany

Intermediate

Java

JavaScript

Admir Comp

Remote

Intermediate

DevOps

From learning to earning

Jobs that call for the skills explored in this talk.

Peter Park System GmbH
München, Germany

Senior

Python

Docker

Node.js

JavaScript

CONTIAMO GMBH
Berlin, Germany

Senior

Python

Docker

TypeScript

PostgreSQL

Architect / Staff Python Engineer (m/f/d)

CONTIAMO GMBH
Berlin, Germany

Senior

Python

Docker

TypeScript

PostgreSQL

Senior DevOps Engineer - Search & Services - (f/m/x)

AUTO1 Group SE
Berlin, Germany

Intermediate

Senior

ELK

Terraform

Elasticsearch

pixx.io GmbH
Mühldorf, Germany

Intermediate

Senior

Terraform

egocentric Systems GmbH
Dresden, Germany

Intermediate

Senior

DevOps

Kubernetes

Peter Park System GmbH
München, Germany

Intermediate

Senior

Bash

Linux

Python

PRODYNA SE
Berlin, Germany

Intermediate

Senior

Saby Company
Delebio, Italy

Intermediate

Java

Kotlin

The Rubik's Cube metaphor for engineering teams

The first phase of resolving incidents collaboratively

Preventing future incidents with best practices

Discovering incidents through system observability

Standardizing telemetry collection with OpenTelemetry

Simplifying metrics with the Autometrics library

Demo of generating metrics and SLOs from code

Summary of collaborative incident management phases

Q&A on tooling and open source contribution

Designer

Cobol Dev

Matching moments

Applying agile and SRE principles to incident response

Applying Agile Principles to Incident Management

Actionable takeaways for SREs on incident management

Serverless Observability: where SLOs meet transforms

Using an incident console to manage response and resolvers

Applying Agile Principles to Incident Management

Understanding observability and the need for a process

Mastering AI-Driven Problem Solving in Engineering with Observability

Fostering cross-team collaboration with SLOs

Serverless Observability: where SLOs meet transforms

How engineers handle production errors and monitoring

DevOps at Netflix

Overcoming observability challenges with a unified platform

All your telemetry data from any source in one place

Handling operational challenges and infrastructure failures at scale

How building an industry DBMS differs from building a research one

Featured Partners

Related Videos

Empathy: The secret sauce of Resilience

One size fits all! Not at all!

Engineering culture: Why ownership is the secret ingredient

The Software Bug All Stars - and what we can learn from them

Unveiling the Dark Side: Navigating the Pitfalls of Digital Ambitions

Applying Agile Principles to Incident Management

Mastering AI-Driven Problem Solving in Engineering with Observability

SRE Methods In an Agency Environment

Related Articles

From learning to earning

Lead Backend Engineer (m/f/d)

Senior Python Engineer

Architect / Staff Python Engineer (m/f/d)

Senior DevOps Engineer - Search & Services - (f/m/x)

Cloud Platform Engineer (d/w/m*)

Site Reliability Engineer (m/w/d)

(Lead) IoT Solutions Engineer (m/f/d)

DevOps Engineer (all genders) in Berlin

Software Engineer Frontend (w/m/d)