Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Temporal.io – Durable Execution Mastery

Temporal.io – Durable Execution Mastery

Ein umfassender Deep Dive in die Orchestrierung verteilter Systeme mit Temporal

Über dieses Buch

Dieses Buch ist eine vollständige Einführung in Temporal.io, die führende Plattform für Durable Execution. Hier lernen Sie, wie Sie zuverlässige, skalierbare und wartbare verteilte Systeme entwickeln, indem Sie komplexe Workflows als einfachen Code schreiben.

Das Buch kombiniert theoretische Grundlagen mit praktischen Python-Beispielen, die Sie direkt ausführen können. Jedes Kapitel enthält lauffähige Code-Beispiele aus dem GitHub Repository, die Temporal-Konzepte demonstrieren.

Entstehung und Methodik

Dieses Buch wurde als persönliches Lernprojekt entwickelt, um Temporal.io umfassend zu verstehen und zu meistern. Die Inhalte entstanden in Zusammenarbeit mit generativer KI (Claude by Anthropic), wobei ich als Autor:

  • Die Lernziele, Struktur und inhaltliche Ausrichtung definiert habe
  • Alle Konzepte aktiv erarbeitet und hinterfragt habe
  • Die Code-Beispiele entwickelt und getestet habe
  • Die technische Korrektheit und praktische Anwendbarkeit sichergestellt habe

Die KI diente dabei als interaktiver Lernpartner: Sie half mir, komplexe Temporal-Konzepte zu strukturieren, verschiedene Perspektiven zu beleuchten und das Gelernte in verständliche Erklärungen zu übersetzen. Dieser kollaborative Ansatz ermöglichte es mir, tiefer in die Materie einzutauchen und ein umfassendes Verständnis von Durable Execution zu entwickeln.

Das Ergebnis ist ein Buch, das meine persönliche Lernreise dokumentiert und anderen helfen soll, Temporal.io systematisch zu erlernen.

Voraussetzungen

  • Python 3.13+
  • uv package manager
  • Temporal CLI oder Docker (für Code-Beispiele)
  • Grundkenntnisse in Python und verteilten Systemen

Was Sie lernen werden

Teil I: Grundlagen der Durable Execution

Lernen Sie die Kernkonzepte von Temporal kennen und verstehen Sie, warum Durable Execution die Zukunft verteilter Systeme ist.

Teil II: Entwicklung von Temporal-Anwendungen (SDK-Fokus)

Tauchen Sie ein in die praktische Entwicklung mit dem Temporal Python SDK.

Teil III: Resilienz, Evolution und Muster

Meistern Sie fortgeschrittene Muster für robuste, evolvierbare Systeme.

Teil IV: Betrieb, Skalierung und Best Practices

Bringen Sie Ihre Temporal-Anwendungen in die Produktion.

Teil V: Das Temporal Kochbuch

Praktische Rezepte für häufige Anwendungsfälle.

Code-Beispiele

Alle Code-Beispiele aus diesem Buch finden Sie im GitHub Repository unter examples/. Jedes Kapitel hat sein eigenes lauffähiges Python-Projekt:

# Beispiel ausführen (z.B. Kapitel 1)
cd examples/part-01/chapter-01
uv sync
uv run python simple_workflow.py

Ressourcen

  • Temporal Documentation: https://docs.temporal.io/
  • Temporal Python SDK: https://docs.temporal.io/develop/python
  • Temporal Community: https://community.temporal.io/

Viel Erfolg beim Lernen von Temporal!

Kapitel 1: Einführung in Temporal

Lernziele:

  • Verstehen, was Temporal ist und warum es wichtig ist
  • Die Grundprinzipien der Durable Execution kennenlernen
  • Die Geschichte von Temporal nachvollziehen
  • Anwendungsfälle für Temporal identifizieren können

1.1 Das Problem verteilter Systeme

Stellen Sie sich vor, Sie entwickeln ein E-Commerce-System. Ein Kunde bestellt ein Produkt, und Ihr System muss folgende Schritte ausführen:

  1. Zahlung bei einem Zahlungsdienstleister (z.B. Stripe) durchführen
  2. Lagerbestand im Inventory-Service reduzieren
  3. Versand beim Logistikpartner beauftragen
  4. Bestätigungs-E-Mail versenden

Was passiert, wenn:

  • Der Zahlungsdienstleister nach 30 Sekunden antwortet, aber Ihre Anfrage bereits timeout hatte?
  • Der Inventory-Service abstürzt, nachdem die Zahlung durchging?
  • Der Versanddienstleister nicht erreichbar ist?
  • Ihr Server während des Prozesses neu startet?

Bei traditionellen Ansätzen müssen Sie:

  • Manuell Zustand in einer Datenbank speichern
  • Komplexe Retry-Logik implementieren
  • Kompensations-Transaktionen für Rollbacks programmieren
  • Idempotenz-Schlüssel verwalten
  • Worker-Prozesse und Message Queues koordinieren

Dies führt zu hunderten Zeilen Boilerplate-Code, nur um sicherzustellen, dass Ihr Geschäftsprozess zuverlässig funktioniert.

Temporal löst diese Probleme auf fundamentale Weise.

1.2 Was ist Temporal?

Definition

Temporal ist eine Open-Source-Plattform (MIT-Lizenz) für Durable Execution – dauerhafte, ausfallsichere Codeausführung. Es handelt sich um eine zuverlässige Laufzeitumgebung, die garantiert, dass Ihr Code vollständig ausgeführt wird, unabhängig davon, wie viele Fehler auftreten.

Das Kernversprechen von Temporal:

“Build applications that never lose state, even when everything else fails”

Entwickeln Sie Anwendungen, die niemals ihren Zustand verlieren, selbst wenn alles andere ausfällt.

Was ist Durable Execution?

Durable Execution (Dauerhafte Ausführung) ist crash-sichere Codeausführung mit folgenden Eigenschaften:

1. Virtualisierte Ausführung

Ihr Code läuft über mehrere Prozesse hinweg, potenziell auf verschiedenen Maschinen. Bei einem Crash wird die Arbeit transparent in einem neuen Prozess fortgesetzt, wobei der Anwendungszustand automatisch wiederhergestellt wird.

sequenceDiagram
    participant Code as Ihr Workflow-Code
    participant W1 as Worker 1
    participant TS as Temporal Service
    participant W2 as Worker 2

    Code->>W1: Schritt 1: Zahlung
    W1->>TS: Event: Zahlung erfolgreich
    W1-xW1: ❌ Worker abstürzt
    TS->>W2: Wiederherstellung: Replay Events
    W2->>Code: Zustand wiederhergestellt
    Code->>W2: Schritt 2: Inventory
    W2->>TS: Event: Inventory aktualisiert

2. Automatische Zustandspersistierung

Der Zustand wird bei jedem Schritt automatisch erfasst und gespeichert. Bei einem Fehler kann die Ausführung exakt dort fortgesetzt werden, wo sie aufgehört hat – ohne Fortschrittsverlust.

3. Zeitunabhängiger Betrieb

Anwendungen können unbegrenzt laufen – von Millisekunden bis zu Jahren – ohne zeitliche Beschränkungen und ohne externe Scheduler.

4. Hardware-agnostisches Design

Zuverlässigkeit ist in die Software eingebaut, nicht abhängig von teurer fehlertoleranter Hardware. Funktioniert in VMs, Containern und Cloud-Umgebungen.

Temporal vs. Traditionelle Ansätze

Die folgende Tabelle zeigt den fundamentalen Unterschied:

AspektTraditionelle ZustandsmaschineTemporal Durable Execution
ZustandsmanagementManuell in Datenbanken persistierenAutomatisch durch Event Sourcing
FehlerbehandlungManuell Retries und Timeouts implementierenEingebaute, konfigurierbare Retry-Policies
WiederherstellungKomplexe Checkpoint-Logik programmierenAutomatische Wiederherstellung am exakten Unterbrechungspunkt
DebuggingZustand über verteilte Logs suchenVollständige Event-History in einem Log
Code-StilZustandsübergänge explizit definierenNormale if/else und Schleifen in Ihrer Programmiersprache

1.3 Geschichte: Von AWS SWF über Cadence zu Temporal

Die Ursprünge bei Amazon (2002-2010)

Max Fateev arbeitete bei Amazon und leitete die Architektur und Entwicklung von:

  • AWS Simple Workflow Service (SWF) – Einer der ersten Workflow-Engines in der Cloud
  • AWS Simple Queue Service (SQS) – Das Storage-Backend für eine der meistgenutzten Queue-Services weltweit

Diese Erfahrungen zeigten die Notwendigkeit für zuverlässige Orchestrierung verteilter Systeme.

Microsoft Azure Durable Functions

Parallel entwickelte Samar Abbas bei Microsoft das Durable Task Framework – eine Orchestrierungs-Bibliothek für langlebige, zustandsbehaftete Workflows, die zur Grundlage für Azure Functions wurde.

Cadence bei Uber (2015)

timeline
    title Von Cadence zu Temporal
    2002-2010 : Max Fateev bei Amazon
              : AWS SWF & SQS
    2015 : Cadence bei Uber
         : Max Fateev + Samar Abbas
         : Open Source von Anfang an
    2019 : Temporal gegründet
         : 2. Mai 2019
         : Fork von Cadence
    2020 : Series A
         : 18,75 Mio USD
    2021 : Series B
         : 75 Mio USD
    2024 : Bewertung > 1,5 Mrd USD
         : Tausende Kunden weltweit

2015 kamen Max Fateev und Samar Abbas bei Uber zusammen und schufen Cadence – eine transformative Workflow-Engine, die von Anfang an vollständig Open Source war.

Produktionsdaten bei Uber:

  • 100+ verschiedene Anwendungsfälle
  • 50 Millionen laufende Ausführungen zu jedem Zeitpunkt
  • 3+ Milliarden Ausführungen pro Monat

Temporal gegründet (2019)

Am 2. Mai 2019 gründeten die ursprünglichen Tech-Leads von Cadence – Maxim Fateev und Samar Abbas – Temporal Technologies und forkten das Cadence-Projekt.

Warum ein Fork?

Temporal wurde gegründet, um:

  • Die Entwicklung zu beschleunigen
  • Cloud-nativen Support zu verbessern
  • Eine bessere Developer Experience zu schaffen
  • Ein nachhaltiges Business-Modell zu etablieren

Finanzierung und Wachstum:

  • Oktober 2020: Series A mit 18,75 Millionen USD
  • Juni 2021: Series B mit 75 Millionen USD
  • 2024: Series B erweitert auf 103 Millionen USD, Unternehmensbewertung über 1,5 Milliarden USD

1.4 Kernkonzepte im Überblick

Temporal basiert auf drei Hauptkomponenten:

1. Workflows

Ein Workflow definiert eine Abfolge von Schritten durch Code.

Eigenschaften:

  • Geschrieben in Ihrer bevorzugten Programmiersprache (Go, Java, Python, TypeScript, .NET, PHP, Ruby)
  • Resilient: Workflows können jahrelang laufen, selbst bei Infrastrukturausfällen
  • Ressourceneffizient: Im Wartezustand verbrauchen sie null Rechenressourcen
  • Deterministisch: Muss bei gleichen Eingaben immer gleich ablaufen (für Replay-Mechanismus)

2. Activities

Eine Activity ist eine Methode oder Funktion, die fehleranfällige Geschäftslogik kapselt.

Eigenschaften:

  • Führt eine einzelne, klar definierte Aktion aus (z.B. API-Aufruf, E-Mail senden, Datei verarbeiten)
  • Nicht deterministisch: Darf externe Systeme aufrufen
  • Automatisch wiederholbar: Das System kann Activities bei Fehlern automatisch wiederholen
  • Timeout-geschützt: Konfigurierbare Timeouts verhindern hängende Operations

3. Workers

Ein Worker führt Workflow- und Activity-Code aus.

Eigenschaften:

  • Prozess, der als Brücke zwischen Anwendungslogik und Temporal Server dient
  • Pollt eine Task Queue, die ihm Aufgaben zur Ausführung zuweist
  • Meldet Ergebnisse zurück an den Temporal Service
  • Kann horizontal skaliert werden
graph TB
    subgraph "Ihre Anwendung"
        WF[Workflow Definition]
        ACT[Activity Implementierung]
        WORKER[Worker Prozess]

        WF -->|ausgeführt von| WORKER
        ACT -->|ausgeführt von| WORKER
    end

    subgraph "Temporal Platform"
        TS[Temporal Service]
        DB[(Event History Database)]

        TS -->|speichert| DB
    end

    WORKER <-->|Task Queue| TS

    style WF fill:#e1f5ff
    style ACT fill:#ffe1f5
    style WORKER fill:#f5ffe1
    style TS fill:#ffd700
    style DB fill:#ddd

1.5 Hauptanwendungsfälle

Temporal wird von tausenden Unternehmen für mission-critical Anwendungen eingesetzt. Hier sind reale Beispiele:

Financial Operations

  • Stripe: Payment Processing
  • Coinbase: Jede Coinbase-Transaktion nutzt Temporal für Geldtransfers
  • ANZ Bank: Hypotheken-Underwriting – langlebige, zustandsbehaftete Prozesse über Wochen

E-Commerce und Logistik

  • Turo: Buchungssystem für Carsharing
  • Maersk: Logistik-Orchestrierung – Verfolgung von Containern weltweit
  • Box: Content Management

Infrastruktur und DevOps

  • Netflix: Custom CI/CD-Systeme – “fundamentaler Wandel in der Art, wie Anwendungen entwickelt werden können”
  • Datadog: Infrastruktur-Services – von einer Anwendung auf über 100 Nutzer in Dutzenden Teams innerhalb eines Jahres
  • Snap: Jede Snap Story verwendet Temporal

Kommunikation

  • Twilio: Jede Nachricht auf Twilio nutzt Temporal
  • Airbnb: Marketing-Kampagnen-Orchestrierung

AI und Machine Learning

  • Lindy, Dust, ZoomInfo: AI Agents mit State-Durability und menschlicher Intervention
  • Descript & Neosync: Datenpipelines und GPU-Ressourcen-Koordination

1.6 Warum ist Temporal wichtig?

Problem 1: Fehlerresilienz

Traditionell:

def process_order(order_id):
    try:
        payment = charge_credit_card(order_id)  # Was, wenn Timeout?
        save_payment_to_db(payment)  # Was, wenn Server hier abstürzt?
        inventory = update_inventory(order_id)  # Was, wenn Service nicht erreichbar?
        save_inventory_to_db(inventory)  # Was, wenn DB-Connection verloren?
        shipping = schedule_shipping(order_id)  # Was, wenn nach 2 Retries immer noch Fehler?
        send_confirmation_email(order_id)  # Was, wenn E-Mail-Service down ist?
    except Exception as e:
        # Manuelle Rollback-Logik für jeden möglichen Fehlerzustand?
        # Welche Schritte waren erfolgreich?
        # Wie kompensieren wir bereits durchgeführte Aktionen?
        # Wie stellen wir sicher, dass wir nicht doppelt buchen?
        pass

Mit Temporal:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str):
        # Temporal garantiert, dass dieser Code vollständig ausgeführt wird
        payment = await workflow.execute_activity(
            charge_credit_card,
            order_id,
            retry_policy=RetryPolicy(maximum_attempts=5)
        )

        inventory = await workflow.execute_activity(
            update_inventory,
            order_id,
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        shipping = await workflow.execute_activity(
            schedule_shipping,
            order_id
        )

        await workflow.execute_activity(send_confirmation_email, order_id)

        # Kein manuelles State-Management
        # Keine manuellen Retries
        # Automatische Wiederherstellung bei Crashes

Problem 2: Langlebige Prozesse

Beispiel: Kreditantrag

Ein Hypothekenantrag kann Wochen dauern:

  1. Antrag eingereicht → Wartet auf Dokumente
  2. Dokumente hochgeladen → Wartet auf manuelle Prüfung
  3. Prüfung abgeschlossen → Wartet auf Gutachten
  4. Gutachten erhalten → Finale Entscheidung

Mit traditionellen Ansätzen:

  • Cron-Jobs, die den Status in der DB prüfen
  • Komplexe Zustandsmaschinen
  • Anfällig für Race Conditions
  • Schwer zu debuggen

Mit Temporal:

@workflow.defn
class MortgageApplicationWorkflow:
    @workflow.run
    async def run(self, application_id: str):
        # Wartet auf Dokumente (kann Tage dauern)
        documents = await workflow.wait_condition(
            lambda: self.documents_uploaded
        )

        # Wartet auf manuelle Prüfung
        review_result = await workflow.wait_condition(
            lambda: self.review_completed
        )

        # Wartet auf Gutachten
        appraisal = await workflow.wait_condition(
            lambda: self.appraisal_received
        )

        # Finale Entscheidung
        decision = await workflow.execute_activity(
            make_decision,
            application_id,
            documents,
            review_result,
            appraisal
        )

        return decision

Der Workflow kann Wochen oder Monate laufen, ohne Ressourcen zu verbrauchen, während er wartet.

Problem 3: Observability

graph LR
    subgraph "Ohne Temporal"
        A[Logs in Service A]
        B[Logs in Service B]
        C[DB State]
        D[Queue Messages]
        E[Entwickler sucht Fehler]

        E -.->|durchsucht| A
        E -.->|durchsucht| B
        E -.->|prüft| C
        E -.->|prüft| D
    end

    subgraph "Mit Temporal"
        F[Temporal Web UI]
        G[Event History]
        H[Entwickler sieht komplette History]

        H -->|ein Klick| F
        F -->|zeigt| G
    end

    style E fill:#ffcccc
    style H fill:#ccffcc

Mit Temporal haben Sie:

  • Vollständige Event-History jeder Workflow-Ausführung
  • Time-Travel Debugging: Sehen Sie exakt, was zu jedem Zeitpunkt passiert ist
  • Web UI: Visualisierung aller laufenden und abgeschlossenen Workflows
  • Stack Traces: Sehen Sie, wo ein Workflow gerade “hängt”

1.7 Fundamentaler Paradigmenwechsel

Temporal hebt die Anwendungsentwicklung auf eine neue Ebene, indem es die Last der Fehlerbehandlung entfernt – ähnlich wie höhere Programmiersprachen die Komplexität der Maschinenprogrammierung abstrahiert haben.

Analogie: Von Assembler zu Python

Assembler (1950er)Python (heute)
Manuelle SpeicherverwaltungGarbage Collection
Register manuell verwaltenVariablen einfach deklarieren
Goto-StatementsStrukturierte Programmierung
Hunderte Zeilen für einfache AufgabenWenige Zeilen aussagekräftiger Code
Ohne TemporalMit Temporal
Manuelle Zustandsspeicherung in DBAutomatisches State-Management
Retry-Logik überallDeklarative Retry-Policies
Timeout-Handling manuellAutomatische Timeouts
Fehlersuche über viele ServicesZentrale Event-History
Defensive ProgrammierungFokus auf Geschäftslogik

Temporal macht verteilte Systeme so zuverlässig wie Schwerkraft.

1.8 Zusammenfassung

In diesem Kapitel haben Sie gelernt:

Was Temporal ist: Eine Plattform für Durable Execution, die garantiert, dass Ihr Code vollständig ausgeführt wird, unabhängig von Fehlern

Die Geschichte: Von AWS SWF über Cadence bei Uber zu Temporal als führende Open-Source-Lösung mit Milliarden-Bewertung

Kernkonzepte: Workflows (Orchestrierung), Activities (Aktionen), Workers (Ausführung)

Anwendungsfälle: Von Payment Processing bei Stripe/Coinbase über Logistik bei Maersk bis hin zu CI/CD bei Netflix

Warum es wichtig ist: Temporal löst fundamentale Probleme verteilter Systeme – Fehlerresilienz, langlebige Prozesse, Observability

Im nächsten Kapitel werden wir tiefer in die Kernbausteine eintauchen und verstehen, wie Workflows, Activities und Worker im Detail funktionieren.

Praktisches Beispiel

Im Verzeichnis ../examples/part-01/chapter-01/ finden Sie ein lauffähiges Beispiel eines einfachen Temporal Workflows:

cd ../examples/part-01/chapter-01
uv sync
uv run python simple_workflow.py

Dieses Beispiel zeigt:

  • Wie ein Workflow definiert wird
  • Wie eine Verbindung zum Temporal Server hergestellt wird
  • Wie ein Workflow gestartet und ausgeführt wird
  • Wie das Ergebnis abgerufen wird

Weiterführende Ressourcen

  • 📚 Offizielle Dokumentation: https://docs.temporal.io/
  • 🎥 Temporal YouTube Channel: Tutorials und Talks
  • 💬 Community Slack: https://temporal.io/slack
  • 🐙 GitHub: https://github.com/temporalio/temporal
  • 📰 Temporal Blog: https://temporal.io/blog – Case Studies und Best Practices

Zurück zum Inhaltsverzeichnis | Nächstes Kapitel: Kernbausteine →

Kapitel 2: Kernbausteine – Workflows, Activities, Worker

Nach der Einführung in Temporal im vorherigen Kapitel tauchen wir nun tief in die drei fundamentalen Bausteine ein, die das Herzstück jeder Temporal-Anwendung bilden: Workflows, Activities und Worker. Das Verständnis dieser Komponenten und ihres Zusammenspiels ist entscheidend für die erfolgreiche Entwicklung mit Temporal.

2.1 Überblick: Die drei Säulen von Temporal

Temporal basiert auf einer klaren Trennung der Verantwortlichkeiten (Separation of Concerns), die in drei Hauptkomponenten unterteilt ist:

graph TB
    subgraph "Temporal Application"
        W[Workflows<br/>Orchestrierung]
        A[Activities<br/>Ausführung]
        WK[Workers<br/>Hosting]
    end

    subgraph "Charakteristika"
        W --> W1[Deterministisch]
        W --> W2[Koordinieren]
        W --> W3[Event Sourcing]

        A --> A1[Nicht-deterministisch]
        A --> A2[Ausführen]
        A --> A3[Side Effects]

        WK --> WK1[Stateless]
        WK --> WK2[Polling]
        WK --> WK3[Skalierbar]
    end

    style W fill:#e1f5ff
    style A fill:#ffe1e1
    style WK fill:#e1ffe1

Die Rollen im Detail:

  • Workflows: Die Dirigenten des Orchesters – sie definieren was passieren soll und in welcher Reihenfolge, führen aber selbst keine Business Logic aus.

  • Activities: Die Musiker – sie führen die eigentliche Arbeit aus, von Datenbankzugriffen über API-Aufrufe bis hin zu komplexen Berechnungen.

  • Workers: Die Konzerthalle – sie bieten die Infrastruktur, in der Workflows und Activities ausgeführt werden, und kommunizieren mit dem Temporal Service.

2.2 Workflows: Die Orchestrierungslogik

2.2.1 Was ist ein Workflow?

Ein Workflow in Temporal ist eine Funktion oder Methode, die die Orchestrierungslogik Ihrer Anwendung definiert. Anders als in vielen anderen Workflow-Engines wird ein Temporal-Workflow in einer echten Programmiersprache geschrieben – nicht in YAML, XML oder einer DSL.

Fundamentale Eigenschaften:

  1. Deterministisch: Bei gleichen Inputs immer gleiche Outputs und Commands
  2. Langlebig: Kann Tage, Monate oder Jahre laufen
  3. Ausfallsicher: Übersteht Infrastruktur- und Code-Deployments
  4. Versionierbar: Unterstützt Code-Änderungen bei laufenden Workflows

Ein einfaches Beispiel aus dem Code:

from temporalio import workflow
from datetime import timedelta

@workflow.defn
class DataProcessingWorkflow:
    """
    Ein Workflow orchestriert Activities - er führt sie nicht selbst aus.
    """

    @workflow.run
    async def run(self, data: str) -> dict:
        # Ruft Activity auf - delegiert die eigentliche Arbeit
        processed = await workflow.execute_activity(
            process_data,
            data,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                maximum_attempts=3,
                initial_interval=timedelta(seconds=1),
            ),
        )

        # Orchestriert weitere Schritte
        await workflow.execute_activity(
            send_notification,
            f"Processed: {processed}",
            start_to_close_timeout=timedelta(seconds=10),
        )

        return {"input": data, "output": processed, "status": "completed"}

📁 Code-Beispiel: ../examples/part-01/chapter-02/workflow.py

2.2.2 Der Determinismus-Constraint

Warum Determinismus?

Temporal nutzt einen Replay-Mechanismus, um Workflow-State zu rekonstruieren. Stellen Sie sich vor, ein Worker-Prozess stürzt ab, während ein Workflow läuft. Wenn der Workflow später auf einem anderen Worker fortgesetzt wird, muss Temporal den exakten State wiederherstellen. Dies geschieht durch:

  1. Laden der Event History (alle bisherigen Events)
  2. Replay des Workflow-Codes gegen diese History
  3. Vergleich der generierten Commands mit der History
  4. Bei Übereinstimmung: Workflow kann fortgesetzt werden
sequenceDiagram
    participant WC as Workflow Code
    participant Worker
    participant History as Event History
    participant Service as Temporal Service

    Note over Worker: Worker startet neu nach Crash

    Worker->>Service: Poll für Workflow Task
    Service->>Worker: Workflow Task + Event History

    Worker->>History: Lade alle Events
    History-->>Worker: [Start, Activity1Scheduled, Activity1Complete, ...]

    Worker->>WC: Replay Code von Anfang
    WC->>WC: Führe Code aus
    WC-->>Worker: Commands [ScheduleActivity1, ...]

    Worker->>Worker: Validiere Commands gegen History

    alt Commands stimmen überein
        Worker->>Service: Workflow Task Complete
        Note over Worker: State erfolgreich rekonstruiert
    else Commands weichen ab
        Worker->>Service: Non-Deterministic Error
        Note over Worker: Code hat sich geändert!
    end

Was ist verboten in Workflows?

# ❌ FALSCH: Nicht-deterministisch
@workflow.defn
class BadWorkflow:
    @workflow.run
    async def run(self):
        # ❌ Zufallszahlen
        random_value = random.random()

        # ❌ Aktuelle Zeit
        now = datetime.now()

        # ❌ Direkte I/O-Operationen
        with open('file.txt') as f:
            data = f.read()

        # ❌ Direkte API-Aufrufe
        response = requests.get('https://api.example.com')

        return random_value

# ✅ RICHTIG: Deterministisch
@workflow.defn
class GoodWorkflow:
    @workflow.run
    async def run(self):
        # ✅ Temporal's Zeit-API
        now = workflow.now()

        # ✅ Temporal's Sleep
        await workflow.sleep(timedelta(hours=1))

        # ✅ I/O in Activities auslagern
        data = await workflow.execute_activity(
            read_file,
            'file.txt',
            start_to_close_timeout=timedelta(seconds=10)
        )

        # ✅ API-Aufrufe in Activities
        response = await workflow.execute_activity(
            call_external_api,
            'https://api.example.com',
            start_to_close_timeout=timedelta(seconds=30)
        )

        return {"data": data, "response": response}

Die goldene Regel: Workflows orchestrieren, Activities führen aus.

2.2.3 Long-Running Workflows und Continue-As-New

Temporal-Workflows können theoretisch unbegrenzt lange laufen. In der Praxis gibt es jedoch technische Grenzen:

Event History Limits:

  • Maximum 50.000 Events (technisch 51.200)
  • Maximum 50 MB History-Größe
  • Performance-Degradation ab ~10.000 Events

Continue-As-New Pattern:

Für langlebige Workflows nutzt man das Continue-As-New Pattern:

@workflow.defn
class LongRunningWorkflow:
    @workflow.run
    async def run(self, iteration: int = 0) -> str:
        # Führe Batch von Arbeit aus
        for i in range(100):
            await workflow.execute_activity(
                process_item,
                f"item-{iteration}-{i}",
                start_to_close_timeout=timedelta(minutes=1)
            )

        # Nach 100 Items: Continue-As-New
        # Neue Workflow-Instanz mit frischer Event History
        workflow.continue_as_new(iteration + 1)

Wie Continue-As-New funktioniert:

timeline
    title Workflow Lifecycle mit Continue-As-New
    section Run 1
        Start Workflow : Event History [0-100 Events]
        Process Items : 100 Activities
        Decision Point : Continue-As-New?
    section Run 2
        New Run ID : Neue Event History [0 Events]
        Transfer State : iteration = 1
        Process Items : 100 Activities
        Decision Point : Continue-As-New?
    section Run 3
        New Run ID : Neue Event History [0 Events]
        Transfer State : iteration = 2
        Continue... : Unbegrenzt möglich

Wann Continue-As-New nutzen?

  • Bei regelmäßigen Checkpoints (täglich, wöchentlich)
  • Wenn Event History >10.000 Events erreicht
  • Bei häufigen Code-Deployments (vermeidet Versioning-Probleme)

2.2.4 Workflow Versioning

Code ändert sich. Workflows laufen lange. Was passiert, wenn laufende Workflows auf neue Code-Versionen treffen?

Problem:

# Version 1 (deployed, Workflows laufen)
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str):
        await workflow.execute_activity(process_payment, ...)
        return "done"

# Version 2 (neues Deployment)
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str):
        # NEU: Validierung hinzugefügt
        await workflow.execute_activity(validate_order, ...)
        await workflow.execute_activity(process_payment, ...)
        return "done"

Beim Replay eines alten Workflows würde der neue Code eine zusätzliche Activity schedulen – Non-Deterministic Error!

Lösung: Patching API

from temporalio import workflow

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_id: str):
        # Patching: Unterstütze alte UND neue Workflows
        if workflow.patched("add-validation"):
            # Neuer Code
            await workflow.execute_activity(validate_order, ...)

        # Alter Code (läuft in beiden Versionen)
        await workflow.execute_activity(process_payment, ...)
        return "done"

Patching-Workflow:

  1. Patch hinzufügen mit neuem und altem Code
  2. Warten bis alle alten Workflows abgeschlossen sind
  3. deprecate_patch() verwenden
  4. Patch-Code entfernen im nächsten Deployment

2.2.5 Workflow-Timeouts

Temporal bietet verschiedene Timeout-Typen:

graph LR
    subgraph "Workflow Execution Timeouts"
        WET[Workflow Execution Timeout<br/>Gesamte Execution inkl. Retries]
        WRT[Workflow Run Timeout<br/>Ein einzelner Run]
        WTT[Workflow Task Timeout<br/>Worker Task Processing]
    end

    WET --> WRT
    WRT --> WTT

    style WET fill:#ffcccc
    style WRT fill:#ffffcc
    style WTT fill:#ccffcc

Empfehlung: Workflow-Timeouts werden generell nicht empfohlen. Workflows sind für langlebige, resiliente Ausführung konzipiert. Timeouts sollten nur in Ausnahmefällen gesetzt werden.

2.3 Activities: Die Business Logic

2.3.1 Was sind Activities?

Activities sind normale Funktionen, die einzelne, klar definierte Aktionen ausführen. Im Gegensatz zu Workflows dürfen Activities:

  • ✅ I/O-Operationen durchführen
  • ✅ Externe APIs aufrufen
  • ✅ Datenbanken lesen/schreiben
  • ✅ Zufallszahlen generieren
  • ✅ Aktuelle Systemzeit verwenden
  • ✅ Side Effects haben

Activities sind der Ort für die eigentliche Business Logic.

Beispiel aus dem Code:

from temporalio import activity

@activity.defn
async def process_data(data: str) -> str:
    """
    Activity für Datenverarbeitung.
    Darf nicht-deterministische Operationen durchführen.
    """
    activity.logger.info(f"Processing data: {data}")

    # Simuliert externe API-Aufrufe, DB-Zugriffe, etc.
    result = data.upper()

    activity.logger.info(f"Data processed: {result}")
    return result

@activity.defn
async def send_notification(message: str) -> None:
    """
    Activity für Side Effects (E-Mail, Webhook, etc.)
    """
    activity.logger.info(f"Sending notification: {message}")

    # In der Praxis: Echter API-Aufruf
    # await email_service.send(message)
    # await webhook.post(message)

    print(f"📧 Notification: {message}")

📁 Code-Beispiel: ../examples/part-01/chapter-02/activities.py

2.3.2 Activity-Timeouts

Activities haben vier verschiedene Timeout-Typen:

graph TB
    subgraph "Activity Lifecycle"
        Scheduled[Activity Scheduled<br/>in Queue]
        Started[Activity Started<br/>by Worker]
        Running[Activity Executing]
        Complete[Activity Complete]

        Scheduled -->|Schedule-To-Start| Started
        Started -->|Start-To-Close| Complete
        Running -->|Heartbeat| Running
        Scheduled -->|Schedule-To-Close| Complete
    end

    style Scheduled fill:#e1f5ff
    style Started fill:#fff4e1
    style Running fill:#ffe1e1
    style Complete fill:#e1ffe1

1. Start-To-Close Timeout (wichtigster):

await workflow.execute_activity(
    process_data,
    data,
    start_to_close_timeout=timedelta(minutes=5),  # Max. 5 Min pro Versuch
)

2. Schedule-To-Close Timeout (inkl. Retries):

await workflow.execute_activity(
    process_data,
    data,
    schedule_to_close_timeout=timedelta(minutes=30),  # Max. 30 Min total
)

3. Schedule-To-Start Timeout (selten benötigt):

await workflow.execute_activity(
    process_data,
    data,
    schedule_to_start_timeout=timedelta(minutes=10),  # Max. 10 Min in Queue
)

4. Heartbeat Timeout (für langlebige Activities):

await workflow.execute_activity(
    long_running_task,
    params,
    heartbeat_timeout=timedelta(seconds=30),  # Heartbeat alle 30s
)

Mindestens ein Timeout erforderlich: Start-To-Close ODER Schedule-To-Close.

2.3.3 Retry-Policies und Error Handling

Default Retry Policy (wenn nicht anders konfiguriert):

RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=100),
    maximum_attempts=0,  # 0 = unbegrenzt
)

Retry-Berechnung:

retry_wait = min(
    initial_interval × (backoff_coefficient ^ retry_count),
    maximum_interval
)

Beispiel: Bei initial_interval=1s und backoff_coefficient=2:

  • Retry 1: nach 1 Sekunde
  • Retry 2: nach 2 Sekunden
  • Retry 3: nach 4 Sekunden
  • Retry 4: nach 8 Sekunden

Custom Retry Policy:

from temporalio.common import RetryPolicy

@workflow.defn
class RobustWorkflow:
    @workflow.run
    async def run(self):
        result = await workflow.execute_activity(
            flaky_external_api,
            params,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(minutes=1),
                backoff_coefficient=2,
                maximum_attempts=5,
                # Diese Fehler NICHT wiederholen
                non_retryable_error_types=["InvalidInputError", "AuthError"],
            ),
        )
        return result

Non-Retryable Errors:

from temporalio.exceptions import ApplicationError

@activity.defn
async def validate_input(data: str) -> str:
    if not data:
        # Dieser Fehler wird NICHT wiederholt
        raise ApplicationError(
            "Input cannot be empty",
            non_retryable=True
        )
    return data

2.3.4 Heartbeats für langlebige Activities

Für Activities, die lange laufen (mehrere Minuten oder länger), bieten Heartbeats zwei Vorteile:

  1. Schnellere Failure Detection: Service erkennt Worker-Crashes sofort
  2. Progress Tracking: Bei Restart kann Activity von letztem Checkpoint fortsetzen
from temporalio import activity

@activity.defn
async def process_large_file(file_path: str, total_items: int) -> str:
    """
    Verarbeitet große Datei mit Progress-Tracking.
    """
    start_index = 0

    # Recover von vorherigem Progress
    if activity.info().is_heartbeat_details_available():
        start_index = activity.heartbeat_details()[0]
        activity.logger.info(f"Resuming from index {start_index}")

    for i in range(start_index, total_items):
        # Verarbeite Item
        await process_item(i)

        # Heartbeat mit Progress
        activity.heartbeat(i)

    return f"Processed {total_items} items"

Workflow-Seite:

result = await workflow.execute_activity(
    process_large_file,
    args=["big_file.csv", 10000],
    start_to_close_timeout=timedelta(hours=2),
    heartbeat_timeout=timedelta(seconds=30),  # Erwarte Heartbeat alle 30s
)

Wann Heartbeats nutzen?

  • ✅ Große Datei-Downloads oder -Verarbeitung
  • ✅ ML-Model-Training
  • ✅ Batch-Processing mit vielen Items
  • ❌ Schnelle API-Aufrufe (< 1 Minute)

2.3.5 Idempotenz – Die wichtigste Best Practice

Activities sollten IMMER idempotent sein: Mehrfache Ausführung = gleiches Ergebnis.

Warum?

  • Temporal garantiert At-Least-Once Execution für Activities
  • Bei Netzwerkfehlern kann unklar sein, ob Activity erfolgreich war
  • Temporal wiederholt die Activity im Zweifel

Beispiel: Geldüberweisung (nicht idempotent):

# ❌ GEFÄHRLICH: Nicht idempotent
@activity.defn
async def transfer_money(from_account: str, to_account: str, amount: float):
    # Was passiert bei Retry?
    # → Geld wird mehrfach überwiesen!
    await bank_api.transfer(from_account, to_account, amount)

Lösung: Idempotency Keys

# ✅ SICHER: Idempotent
@activity.defn
async def transfer_money(
    from_account: str,
    to_account: str,
    amount: float,
    idempotency_key: str
):
    # Prüfe ob bereits ausgeführt
    if await bank_api.is_processed(idempotency_key):
        return await bank_api.get_result(idempotency_key)

    # Führe Überweisung aus
    result = await bank_api.transfer(
        from_account,
        to_account,
        amount,
        idempotency_key=idempotency_key
    )

    return result

Idempotency Key Generierung im Workflow:

@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, order_id: str, amount: float):
        # Generiere deterministischen Idempotency Key
        idempotency_key = f"payment-{order_id}-{workflow.info().run_id}"

        await workflow.execute_activity(
            transfer_money,
            args=[
                "account-A",
                "account-B",
                amount,
                idempotency_key
            ],
            start_to_close_timeout=timedelta(minutes=5),
        )

2.3.6 Local Activities – Der Spezialfall

Local Activities werden im gleichen Prozess wie der Workflow ausgeführt, ohne separate Task Queue:

result = await workflow.execute_local_activity(
    quick_calculation,
    params,
    start_to_close_timeout=timedelta(seconds=5),
)

Wann nutzen?

  • ✅ Sehr kurze Activities (< 1 Sekunde)
  • ✅ Hoher Throughput erforderlich (1000+ Activities/Sekunde)
  • ✅ Einfache Berechnungen ohne externe Dependencies

Limitierungen:

  • ❌ Keine Heartbeats
  • ❌ Bei Retry wird gesamte Activity wiederholt (kein Checkpoint)
  • ❌ Höheres Risiko bei nicht-idempotenten Operationen

Empfehlung: Nutze reguläre Activities als Default. Local Activities nur für sehr spezifische Performance-Optimierungen.

2.4 Workers: Die Laufzeitumgebung

2.4.1 Worker-Architektur

Workers sind eigenständige Prozesse, die außerhalb des Temporal Service laufen und:

  1. Task Queues pollen (long-polling RPC)
  2. Workflow- und Activity-Code ausführen
  3. Ergebnisse zurück an Temporal Service senden
graph TB
    subgraph "Worker Process"
        WW[Workflow Worker<br/>Führt Workflow-Code aus]
        AW[Activity Worker<br/>Führt Activity-Code aus]
        Poller1[Workflow Task Poller]
        Poller2[Activity Task Poller]
    end

    subgraph "Temporal Service"
        WQ[Workflow Task Queue]
        AQ[Activity Task Queue]
    end

    Poller1 -.->|Long Poll| WQ
    WQ -.->|Task| Poller1
    Poller1 --> WW

    Poller2 -.->|Long Poll| AQ
    AQ -.->|Task| Poller2
    Poller2 --> AW

    style WW fill:#e1f5ff
    style AW fill:#ffe1e1

Worker Setup – Beispiel aus dem Code:

from temporalio.worker import Worker
from shared.temporal_helpers import create_temporal_client

async def main():
    # 1. Verbinde zu Temporal
    client = await create_temporal_client()

    # 2. Erstelle Worker
    worker = Worker(
        client,
        task_queue="book-examples",
        workflows=[DataProcessingWorkflow],  # Registriere Workflows
        activities=[process_data, send_notification],  # Registriere Activities
    )

    # 3. Starte Worker (blockiert bis Ctrl+C)
    await worker.run()

📁 Code-Beispiel: ../examples/part-01/chapter-02/worker.py

2.4.2 Task Queues und Polling

Task Queue Eigenschaften:

  • Lightweight: Dynamisch erstellt, keine explizite Registration
  • On-Demand: Wird beim ersten Workflow/Activity-Start erstellt
  • Persistent: Tasks bleiben erhalten bei Worker-Ausfällen
  • Load Balancing: Automatische Verteilung über alle Worker

Long-Polling Mechanismus:

sequenceDiagram
    participant Worker
    participant Service as Temporal Service
    participant Queue as Task Queue

    loop Kontinuierliches Polling
        Worker->>Service: Poll für Tasks (RPC)

        alt Task verfügbar
            Queue->>Service: Task
            Service-->>Worker: Task
            Worker->>Worker: Execute Task
            Worker->>Service: Complete Task
        else Keine Tasks
            Note over Service: Verbindung bleibt offen
            Note over Service: Wartet bis Task oder Timeout
            Service-->>Worker: Keine Tasks (nach Timeout)
        end
    end

Pull-basiert, nicht Push:

  • Worker holen Tasks nur, wenn Kapazität vorhanden
  • Verhindert Überlastung
  • Automatisches Backpressure-Handling

2.4.3 Task Queue Routing und Partitioning

Routing-Strategien:

# 1. Standard: Ein Task Queue für alles
worker = Worker(
    client,
    task_queue="default",
    workflows=[WorkflowA, WorkflowB],
    activities=[activity1, activity2, activity3],
)

# 2. Separierung nach Funktion
workflow_worker = Worker(
    client,
    task_queue="workflows",
    workflows=[WorkflowA, WorkflowB],
)

activity_worker = Worker(
    client,
    task_queue="activities",
    activities=[activity1, activity2, activity3],
)

# 3. Isolation kritischer Activities (Bulkheading)
critical_worker = Worker(
    client,
    task_queue="critical-activities",
    activities=[payment_activity],
)

background_worker = Worker(
    client,
    task_queue="background-activities",
    activities=[send_email, generate_report],
)

Warum Isolation?

  • Verhindert, dass langsame Activities kritische blockieren
  • Bessere Ressourcen-Allokation
  • Dedizierte Skalierung möglich

Task Queue Partitioning:

# Default: 4 Partitionen
# → Höherer Throughput, keine FIFO-Garantie

# Single Partition für FIFO-Garantie
# (via Temporal Server Config)

2.4.4 Sticky Execution – Performance-Optimierung

Problem: Bei jedem Workflow Task muss Worker die komplette Event History laden und Workflow replayed.

Lösung: Sticky Execution

sequenceDiagram
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant Service
    participant NQ as Normal Queue
    participant SQ as Sticky Queue (Worker 1)

    W1->>Service: Poll Normal Queue
    Service-->>W1: Workflow Task (WF-123)
    W1->>W1: Execute + Cache State
    W1->>Service: Complete

    Service->>SQ: Nächster Task für WF-123 → Sticky Queue

    W1->>Service: Poll Sticky Queue
    Service-->>W1: Workflow Task (WF-123)
    Note over W1: State im Cache!<br/>Kein History Reload
    W1->>W1: Execute (sehr schnell)
    W1->>Service: Complete

    Note over Service: Timeout (5s default)
    Service->>NQ: Task zurück zu Normal Queue

    W2->>Service: Poll Normal Queue
    Service-->>W2: Workflow Task (WF-123)
    Note over W2: Kein Cache<br/>History Reload + Replay

Vorteile:

  • 10-100x schnellere Task-Verarbeitung
  • Reduzierte Last auf History Service
  • Geringere Latenz

Automatisch aktiviert – keine Konfiguration erforderlich!

2.4.5 Worker Scaling und Deployment

Horizontal Scaling:

Workers sind stateless – Workflow-State ist im Temporal Service, nicht im Worker.

# Gleicher Code auf allen Workers
# Kann beliebig skaliert werden

# Worker 1 (Server A)
worker1 = Worker(client, task_queue="production", ...)
await worker1.run()

# Worker 2 (Server B)
worker2 = Worker(client, task_queue="production", ...)
await worker2.run()

# Worker 3 (Server C)
worker3 = Worker(client, task_queue="production", ...)
await worker3.run()

Deployment Patterns:

  1. Dedizierte Worker Processes (empfohlen für Production):
# Separate Prozesse nur für Temporal
python worker.py
  1. Combined Worker + Application:
# Im gleichen Prozess wie Web Server
# Nur für Development/kleine Apps

async def start_services():
    # Starte Web Server
    web_server = await start_web_server()

    # Starte Worker (im Hintergrund)
    worker = Worker(...)
    asyncio.create_task(worker.run())
  1. Worker Fleets (High Availability):
Kubernetes Deployment:
- 10+ Worker Pods
- Auto-Scaling basierend auf Task Queue Länge
- Rolling Updates ohne Downtime

Skalierungs-Strategien:

SzenarioLösung
Höherer Workflow-ThroughputMehr Worker Processes
Langlebige ActivitiesMehr Activity Task Slots pro Worker
CPU-intensive ActivitiesWeniger Slots, mehr CPU pro Worker
I/O-bound ActivitiesMehr Slots, weniger CPU pro Worker
Kritische Activities isolierenSeparate Task Queue + dedizierte Worker

2.4.6 Worker Tuning und Konfiguration

Task Slots – Concurrency Control:

worker = Worker(
    client,
    task_queue="production",
    workflows=[...],
    activities=[...],
    max_concurrent_workflow_tasks=100,  # Max. parallele Workflow Tasks
    max_concurrent_activities=50,       # Max. parallele Activities
    max_concurrent_local_activities=100, # Max. parallele Local Activities
)

Resource-Based Auto-Tuning (empfohlen):

from temporalio.worker import ResourceBasedTunerConfig, ResourceBasedSlotConfig

worker = Worker(
    client,
    task_queue="production",
    workflows=[...],
    activities=[...],
    tuner=ResourceBasedTunerConfig(
        # Workflow Task Slots
        workflow_task_slot_supplier=ResourceBasedSlotConfig(
            target_cpu_usage=0.8,      # Ziel: 80% CPU
            target_memory_usage=0.8,   # Ziel: 80% Memory
            minimum_slots=5,
            maximum_slots=100,
        ),
        # Activity Task Slots
        activity_task_slot_supplier=ResourceBasedSlotConfig(
            target_cpu_usage=0.7,
            target_memory_usage=0.7,
        ),
    ),
)

Vorteile:

  • Verhindert Out-of-Memory Errors
  • Optimiert Durchsatz automatisch
  • Passt sich an Workload an

2.5 Das Zusammenspiel: Ein komplettes Beispiel

Betrachten wir ein vollständiges Beispiel: Datenverarbeitung mit Benachrichtigung.

2.5.1 Der komplette Flow

sequenceDiagram
    participant Client
    participant Service as Temporal Service
    participant WQ as Workflow Task Queue
    participant AQ as Activity Task Queue
    participant Worker

    Client->>Service: Start Workflow "DataProcessing"
    Service->>Service: Create Event History
    Service->>Service: Write WorkflowExecutionStarted
    Service->>WQ: Create Workflow Task

    Worker->>WQ: Poll
    WQ-->>Worker: Workflow Task

    Worker->>Worker: Execute Workflow Code
    Note over Worker: Code: execute_activity(process_data)
    Worker->>Service: Commands [ScheduleActivity(process_data)]
    Service->>Service: Write ActivityTaskScheduled
    Service->>AQ: Create Activity Task

    Worker->>AQ: Poll
    AQ-->>Worker: Activity Task (process_data)
    Worker->>Worker: Execute Activity Function
    Note over Worker: Actual data processing
    Worker->>Service: Activity Result
    Service->>Service: Write ActivityTaskCompleted
    Service->>WQ: Create new Workflow Task

    Worker->>WQ: Poll
    WQ-->>Worker: Workflow Task
    Worker->>Worker: Replay + Continue
    Note over Worker: Code: execute_activity(send_notification)
    Worker->>Service: Commands [ScheduleActivity(send_notification)]
    Service->>AQ: Create Activity Task

    Worker->>AQ: Poll
    AQ-->>Worker: Activity Task (send_notification)
    Worker->>Worker: Execute send_notification
    Worker->>Service: Activity Result
    Service->>Service: Write ActivityTaskCompleted
    Service->>WQ: Create Workflow Task

    Worker->>WQ: Poll
    WQ-->>Worker: Workflow Task
    Worker->>Worker: Replay + Complete
    Worker->>Service: Commands [CompleteWorkflow]
    Service->>Service: Write WorkflowExecutionCompleted

    Client->>Service: Get Result
    Service-->>Client: {"status": "completed", ...}

2.5.2 Event History Timeline

Die Event History für diesen Flow:

1.  WorkflowExecutionStarted
    - WorkflowType: DataProcessingWorkflow
    - Input: "Sample Data"

2.  WorkflowTaskScheduled

3.  WorkflowTaskStarted

4.  WorkflowTaskCompleted
    - Commands: [ScheduleActivityTask(process_data)]

5.  ActivityTaskScheduled
    - ActivityType: process_data

6.  ActivityTaskStarted

7.  ActivityTaskCompleted
    - Result: "SAMPLE DATA"

8.  WorkflowTaskScheduled

9.  WorkflowTaskStarted

10. WorkflowTaskCompleted
    - Commands: [ScheduleActivityTask(send_notification)]

11. ActivityTaskScheduled
    - ActivityType: send_notification

12. ActivityTaskStarted

13. ActivityTaskCompleted

14. WorkflowTaskScheduled

15. WorkflowTaskStarted

16. WorkflowTaskCompleted
    - Commands: [CompleteWorkflowExecution]

17. WorkflowExecutionCompleted
    - Result: {"input": "Sample Data", "output": "SAMPLE DATA", "status": "completed"}

2.5.3 Code-Beispiel ausführen

Voraussetzungen:

# 1. Temporal Server starten (Docker)
docker compose up -d

# 2. Dependencies installieren
cd ../examples/part-01/chapter-02
uv sync

Terminal 1 – Worker starten:

uv run python worker.py

Ausgabe:

INFO - Starting Temporal Worker...
INFO - Worker registered workflows and activities:
INFO -   - Workflows: ['DataProcessingWorkflow']
INFO -   - Activities: ['process_data', 'send_notification']
INFO - Worker is running and polling for tasks...
INFO - Press Ctrl+C to stop

Terminal 2 – Workflow starten:

uv run python workflow.py

Ausgabe:

INFO - Processing data: Sample Data
INFO - Data processed successfully: SAMPLE DATA
INFO - Sending notification: Processed: SAMPLE DATA
📧 Notification: Processed: SAMPLE DATA
INFO - Notification sent successfully

✅ Workflow Result: {'input': 'Sample Data', 'output': 'SAMPLE DATA', 'status': 'completed'}

2.6 Best Practices

2.6.1 Workflow Best Practices

  1. Orchestrieren, nicht Implementieren

    # ❌ Schlecht: Business Logic im Workflow
    @workflow.defn
    class BadWorkflow:
        @workflow.run
        async def run(self, data: list):
            result = []
            for item in data:
                # Komplexe Business Logic
                processed = item.strip().upper().replace("_", "-")
                result.append(processed)
            return result
    
    # ✅ Gut: Logic in Activity
    @workflow.defn
    class GoodWorkflow:
        @workflow.run
        async def run(self, data: list):
            return await workflow.execute_activity(
                process_items,
                data,
                start_to_close_timeout=timedelta(minutes=5)
            )
    
  2. Kurze Workflow-Funktionen

    • Lange Workflows in kleinere Child Workflows aufteilen
    • Verbessert Wartbarkeit und Testbarkeit
  3. Continue-As-New bei langen Laufzeiten

    • Spätestens bei 10.000 Events
    • Oder: Regelmäßige Checkpoints (täglich/wöchentlich)
  4. Determinismus-Tests schreiben

    from temporalio.testing import WorkflowEnvironment
    
    async def test_workflow_determinism():
        async with await WorkflowEnvironment.start_time_skipping() as env:
            # Teste Workflow mit verschiedenen Szenarien
            ...
    

2.6.2 Activity Best Practices

  1. IMMER idempotent

    • Nutze Idempotency Keys
    • Prüfe ob Operation bereits durchgeführt wurde
  2. Passende Granularität

    • Nicht zu fein: Bloated History
    • Nicht zu grob: Schwierige Idempotenz, ineffiziente Retries
  3. Timeouts immer setzen

    • Mindestens Start-To-Close
    • Heartbeats für langlebige Activities
  4. Error Handling

    @activity.defn
    async def robust_activity(params):
        try:
            return await external_api.call(params)
        except TemporaryError as e:
            # Retry durch Temporal
            raise
        except PermanentError as e:
            # Nicht wiederholen
            raise ApplicationError(str(e), non_retryable=True)
    

2.6.3 Worker Best Practices

  1. Dedizierte Worker Processes in Production

    • Nicht im gleichen Prozess wie Web Server
  2. Task Queue Isolation für kritische Activities

    # Zahlungen isoliert
    payment_worker = Worker(
        client,
        task_queue="payments",
        activities=[payment_activity],
    )
    
    # Background Jobs separat
    background_worker = Worker(
        client,
        task_queue="background",
        activities=[email_activity, report_activity],
    )
    
  3. Resource-Based Tuning nutzen

    • Verhindert Out-of-Memory
    • Optimiert Throughput automatisch
  4. Monitoring und Metriken

    # Wichtige Metriken überwachen:
    # - worker_task_slots_available (sollte >0 sein)
    # - temporal_sticky_cache_hit_total
    # - temporal_activity_execution_failed_total
    

2.7 Zusammenfassung

In diesem Kapitel haben wir die drei Kernbausteine von Temporal kennengelernt:

Workflows orchestrieren den gesamten Prozess:

  • Deterministisch und replay-fähig
  • Langlebig (Tage bis Jahre)
  • Geschrieben in normalen Programmiersprachen
  • Dürfen KEINE I/O-Operationen durchführen

Activities führen die eigentliche Arbeit aus:

  • Nicht-deterministisch
  • Dürfen I/O, externe APIs, Side Effects
  • Automatische Retries mit konfigurierbaren Policies
  • Sollten IMMER idempotent sein

Workers hostet Workflow- und Activity-Code:

  • Pollen Task Queues via long-polling
  • Stateless und horizontal skalierbar
  • Führen Workflow-Replay und Activity-Execution aus
  • Sticky Execution für Performance

Das große Bild:

graph TB
    Client[Client Code]
    Service[Temporal Service]
    Worker[Worker Process]

    Client -->|Start Workflow| Service
    Service -->|Tasks via Queue| Worker
    Worker -->|Workflow Code| WF[Workflows<br/>Orchestrierung]
    Worker -->|Activity Code| AC[Activities<br/>Ausführung]
    WF -->|Schedule| AC
    Worker -->|Results| Service
    Service -->|History| DB[(Event<br/>History)]

    style WF fill:#e1f5ff
    style AC fill:#ffe1e1
    style Worker fill:#e1ffe1

Mit diesem Verständnis der Kernbausteine können wir im nächsten Kapitel tiefer in die Architektur des Temporal Service eintauchen und verstehen, wie Frontend, History Service, Matching Service und Persistence Layer zusammenarbeiten.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 3: Architektur des Temporal Service

Code-Beispiele für dieses Kapitel: examples/part-01/chapter-02/

Kapitel 3: Architektur des Temporal Service

Nachdem wir in den vorherigen Kapiteln die Grundkonzepte und Kernbausteine von Temporal kennengelernt haben, tauchen wir nun tief in die Architektur des Temporal Service ein. Der Temporal Service ist das Herzstück des gesamten Systems – er koordiniert Workflows, speichert den State, verwaltet Task Queues und garantiert die Ausführung. Ein fundiertes Verständnis dieser Architektur ist entscheidend für den Betrieb und die Skalierung von Temporal in Production.

3.1 Architektur-Übersicht

3.1.1 Die vier Kernkomponenten

Der Temporal Service besteht aus vier unabhängig skalierbaren Diensten:

graph TB
    subgraph "Temporal Service"
        FE[Frontend Service<br/>API Gateway]
        HS[History Service<br/>State Management]
        MS[Matching Service<br/>Task Queues]
        WS[Worker Service<br/>Internal Operations]
    end

    subgraph "External Components"
        Client[Clients]
        Workers[Worker Processes]
        DB[(Persistence<br/>Database)]
        ES[(Visibility<br/>Elasticsearch)]
    end

    Client -->|gRPC| FE
    Workers -->|Long Poll| FE
    FE --> HS
    FE --> MS
    HS -->|Read/Write| DB
    HS --> MS
    MS -->|Tasks| DB
    WS --> HS
    HS -->|Events| ES

    style FE fill:#e1f5ff
    style HS fill:#ffe1e1
    style MS fill:#fff4e1
    style WS fill:#e1ffe1

Frontend Service:

  • Stateless API Gateway
  • Entry Point für alle Client- und Worker-Requests
  • Request-Validierung und Rate Limiting
  • Routing zu History und Matching Service

History Service:

  • Verwaltet Workflow Execution State
  • Speichert Event History (Event Sourcing)
  • Koordiniert Workflow-Lifecycle
  • Sharded: Feste Anzahl von Shards, die Workflow-Executions zugeordnet werden

Matching Service:

  • Verwaltet Task Queues
  • Dispatcht Tasks an Worker
  • Long-Polling Mechanismus
  • Partitioned: Task Queues in Partitionen für Skalierung

Worker Service (interner Dienst):

  • Führt interne System-Workflows aus
  • Replication Queue Processing
  • Archival Operations
  • Nicht die Worker-Prozesse der Anwender!

3.1.2 Architekturprinzipien

Event Sourcing als Fundament: Temporal speichert eine append-only Event History für jede Workflow Execution. Der komplette Workflow-State kann durch Replay dieser History rekonstruiert werden.

Separation of Concerns:

  • Frontend: API und Routing
  • History: State Management und Koordination
  • Matching: Task Dispatching
  • Persistence: Daten-Speicherung

Unabhängige Skalierung: Jeder Dienst kann unabhängig horizontal skaliert werden, um verschiedenen Workload-Charakteristiken gerecht zu werden.

3.2 Frontend Service: Das API Gateway

3.2.1 Rolle und Verantwortlichkeiten

Der Frontend Service ist der einzige öffentliche Entry Point zum Temporal Service:

graph LR
    subgraph "External"
        C1[Client 1]
        C2[Client 2]
        W1[Worker 1]
        W2[Worker 2]
    end

    subgraph "Frontend Service"
        FE1[Frontend Instance 1]
        FE2[Frontend Instance 2]
        FE3[Frontend Instance 3]
    end

    LB[Load Balancer]

    C1 --> LB
    C2 --> LB
    W1 --> LB
    W2 --> LB

    LB --> FE1
    LB --> FE2
    LB --> FE3

    FE1 -.->|Route| History[History Service]
    FE2 -.->|Route| Matching[Matching Service]
    FE3 -.->|Route| History

    style LB fill:#cccccc
    style FE1 fill:#e1f5ff
    style FE2 fill:#e1f5ff
    style FE3 fill:#e1f5ff

API Exposure:

  • gRPC API (Port 7233): Primäres Protokoll für Clients und Workers
  • HTTP API (Port 8233): HTTP-Proxy für Web UI und HTTP-Clients
  • Protocol Buffers: Serialisierung mit protobuf

Request Handling:

  1. Empfängt API-Requests (StartWorkflow, SignalWorkflow, PollWorkflowTask, etc.)
  2. Validiert Requests auf Korrektheit
  3. Führt Rate Limiting durch
  4. Routet zu History oder Matching Service

3.2.2 Rate Limiting

Frontend implementiert Multi-Level Rate Limiting:

# Namespace-Level RPS Limit
# Pro Namespace maximal N Requests/Sekunde
frontend.namespacerps = 1200

# Persistence-Level QPS Limit
# Schützt Datenbank vor Überlastung
frontend.persistenceMaxQPS = 10000

# Task Queue-Level Limits
# Pro Task Queue maximal M Dispatches/Sekunde

Warum Rate Limiting?

  • Schutz vor übermäßiger Last
  • Fairness zwischen Namespaces (Multi-Tenancy)
  • Vermeidung von Database-Überlastung
  • Backpressure für Clients

3.2.3 Namespace Routing

Multi-Tenancy durch Namespaces:

Namespaces bieten logische Isolation:

  • Workflow Executions isoliert pro Namespace
  • Separate Resource Limits
  • Unabhängige Retention Policies
  • Verschiedene Archival Configurations

Routing-Mechanismus: Frontend bestimmt aus Request-Header, welcher Namespace betroffen ist, und routet entsprechend.

3.2.4 Stateless Design

Horizontale Skalierung ohne Limits:

# Einfaches Hinzufügen neuer Frontend Instances
kubectl scale deployment temporal-frontend --replicas=10

Eigenschaften:

  • Keine Session-Affinität erforderlich
  • Kein Shared State zwischen Instances
  • Load Balancer verteilt Traffic
  • Einfaches Rolling Update

3.3 History Service: Das Herzstück

3.3.1 Event Sourcing und State Management

Der History Service verwaltet den kompletten Lifecycle jeder Workflow Execution:

stateDiagram-v2
    [*] --> WorkflowStarted: Client starts workflow
    WorkflowStarted --> WorkflowTaskScheduled: Create first task
    WorkflowTaskScheduled --> WorkflowTaskStarted: Worker polls
    WorkflowTaskStarted --> WorkflowTaskCompleted: Worker returns commands
    WorkflowTaskCompleted --> ActivityTaskScheduled: Schedule activity
    ActivityTaskScheduled --> ActivityTaskStarted: Worker polls
    ActivityTaskStarted --> ActivityTaskCompleted: Activity finishes
    ActivityTaskCompleted --> WorkflowTaskScheduled: New workflow task
    WorkflowTaskScheduled --> WorkflowTaskStarted
    WorkflowTaskStarted --> WorkflowExecutionCompleted: Workflow completes
    WorkflowExecutionCompleted --> [*]

Zwei Formen von State:

  1. Mutable State (veränderlich):

    • Aktueller Snapshot der Workflow Execution
    • Tracked: Laufende Activities, Timer, Child Workflows, pending Signals
    • In-Memory Cache für kürzlich verwendete Executions
    • In Database persistiert (typischerweise eine Zeile)
    • Wird bei jeder State Transition aktualisiert
  2. Immutable Event History (unveränderlich):

    • Append-Only Log aller Workflow Events
    • Source of Truth: Workflow-State kann komplett rekonstruiert werden
    • Definiert in Protocol Buffer Specifications
    • Limits: 51.200 Events oder 50 MB (Warnung bei 10.240 Events/10 MB)

3.3.2 Sharding-Architektur

Fixed Shard Count:

Der History Service nutzt Sharding für Parallelität:

graph TB
    subgraph "Workflow Executions"
        WF1[Workflow 1<br/>ID: order-123]
        WF2[Workflow 2<br/>ID: payment-456]
        WF3[Workflow 3<br/>ID: order-789]
        WF4[Workflow 4<br/>ID: shipment-111]
    end

    subgraph "History Shards (Fixed: 512)"
        S1[Shard 1]
        S2[Shard 2]
        S3[Shard 3]
        S4[Shard 512]
    end

    WF1 -->|Hash| S1
    WF2 -->|Hash| S2
    WF3 -->|Hash| S1
    WF4 -->|Hash| S3

    style S1 fill:#ffe1e1
    style S2 fill:#ffe1e1
    style S3 fill:#ffe1e1
    style S4 fill:#ffe1e1

Shard Assignment:

shard_id = hash(workflow_id + namespace) % shard_count

Eigenschaften:

  • Shard Count wird bei Cluster-Erstellung festgelegt
  • Nicht änderbar nach Cluster-Start
  • Empfohlen: 128-512 Shards für kleine Cluster, selten >4096
  • Jeder Shard ist eine Unit of Parallelism
  • Alle Updates innerhalb eines Shards sind sequenziell

Performance-Implikationen:

Max Throughput pro Shard = 1 / (Database Operation Latency)

Beispiel:
- DB Latency: 10ms
- Max Throughput: 1 / 0.01s = 100 Updates/Sekunde pro Shard
- 512 Shards → ~51.200 Updates/Sekunde gesamt

3.3.3 Interne Task Queues

Jeder History Shard verwaltet interne Queues für verschiedene Task-Typen:

graph TB
    subgraph "History Shard"
        TQ[Transfer Queue<br/>Sofort ausführbar]
        TimerQ[Timer Queue<br/>Zeitbasiert]
        VisQ[Visibility Queue<br/>Search Updates]
        RepQ[Replication Queue<br/>Multi-Cluster]
        ArchQ[Archival Queue<br/>Long-term Storage]
    end

    TQ -->|Triggers| Matching[Matching Service]
    TimerQ -->|Fires at time| TQ
    VisQ -->|Updates| ES[(Elasticsearch)]
    RepQ -->|Replicates| Remote[Remote Cluster]
    ArchQ -->|Archives| S3[(S3/GCS)]

    style TQ fill:#e1f5ff
    style TimerQ fill:#fff4e1
    style VisQ fill:#ffe1e1
    style RepQ fill:#e1ffe1
    style ArchQ fill:#ffffcc

1. Transfer Queue:

  • Sofort ausführbare Tasks
  • Enqueued Workflow/Activity Tasks zu Matching
  • Erzeugt Timer

2. Timer Queue:

  • Zeitbasierte Events
  • Workflow Timeouts, Retries, Delays
  • Cron Triggers
  • Fires zur definierten Zeit, erzeugt oft Transfer Tasks

3. Visibility Queue:

  • Updates für Visibility Store (Elasticsearch)
  • Ermöglicht Workflow-Suche und -Filterung
  • Powert Web UI Queries

4. Replication Queue (Multi-Cluster):

  • Repliziert Events zu Remote Clusters
  • Async Replication für High Availability

5. Archival Queue:

  • Triggert Archivierung nach Retention Period
  • Langzeitspeicherung (S3, GCS, etc.)

3.3.4 Workflow State Transition

Transaktionaler Ablauf:

sequenceDiagram
    participant Input as Input<br/>(RPC, Timer, Signal)
    participant HS as History Service
    participant Mem as In-Memory State
    participant DB as Database

    Input->>HS: State Transition Trigger
    HS->>Mem: Load Mutable State (from cache/DB)
    HS->>Mem: Create new Events
    HS->>Mem: Update Mutable State
    HS->>Mem: Generate Internal Tasks

    HS->>DB: BEGIN TRANSACTION
    HS->>DB: Write Events to History Table
    HS->>DB: Update Mutable State Row
    HS->>DB: Write Transfer/Timer Tasks
    HS->>DB: COMMIT TRANSACTION

    DB-->>HS: Transaction Success
    HS->>HS: Cache Updated State

Consistency durch Transactions:

  • Mutable State und Event History werden atomar committed
  • Verhindert Inkonsistenzen bei Crashes
  • Database Transactions garantieren ACID-Eigenschaften

Transactional Outbox Pattern:

  • Transfer Tasks werden mit State in DB persistiert
  • Task Processing erfolgt asynchron
  • Verhindert Divergenz zwischen State und Task Queues

3.3.5 Cache-Mechanismen

Mutable State Cache:

# Pro-Shard Cache
# Cached kürzlich verwendete Workflow Executions
# Vermeidet teure History Replays

cache_size_per_shard = 1000  # Beispiel

Vorteile:

  • Schneller Zugriff auf aktiven Workflow State
  • Reduziert Database Reads
  • Kritisch für Performance bei hoher Last

Cache Miss: Bei Cache Miss muss History Service:

  1. Event History aus DB laden
  2. Komplette History replayed
  3. State rekonstruieren
  4. In Cache einfügen

Geplante Verbesserung: Host-Level Cache, der von allen Shards geteilt wird.

3.4 Matching Service: Task Queue Management

3.4.1 Aufgaben und Verantwortlichkeiten

Der Matching Service verwaltet alle user-facing Task Queues:

graph TB
    subgraph "Task Queues"
        WQ[Workflow Task Queue<br/>'production']
        AQ[Activity Task Queue<br/>'production']
        AQ2[Activity Task Queue<br/>'background']
    end

    subgraph "Matching Service"
        P1[Partition 1]
        P2[Partition 2]
        P3[Partition 3]
        P4[Partition 4]
    end

    subgraph "Workers"
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker 3]
    end

    History[History Service] -->|Enqueue| P1
    History -->|Enqueue| P2

    W1 -->|Long Poll| P1
    W2 -->|Long Poll| P3
    W3 -->|Long Poll| P4

    P1 -.-> WQ
    P2 -.-> AQ
    P3 -.-> AQ2
    P4 -.-> WQ

    style P1 fill:#fff4e1
    style P2 fill:#fff4e1
    style P3 fill:#fff4e1
    style P4 fill:#fff4e1

Core Functions:

  • Task Queue Verwaltung
  • Task Dispatching an Workers
  • Long-Poll Protocol Implementation
  • Load Balancing über Worker Processes

3.4.2 Task Queue Partitioning

Default: 4 Partitionen pro Task Queue

# Task Queue "production" mit 4 Partitionen
task_queue_partitions = {
    "production": [
        "production_partition_0",
        "production_partition_1",
        "production_partition_2",
        "production_partition_3",
    ]
}

Partition Charakteristika:

  • Tasks werden zufällig einer Partition zugeordnet
  • Worker Polls werden gleichmäßig verteilt
  • Partitionen sind Units of Scaling für Matching Service
  • Partition Count anpassbar basierend auf Last

Hierarchische Organisation:

graph TB
    Root[Root Partition]
    P1[Partition 1]
    P2[Partition 2]
    P3[Partition 3]
    P4[Partition 4]

    Root --> P1
    Root --> P2
    Root --> P3
    Root --> P4

    P1 -.->|Forward if empty| Root
    P2 -.->|Forward if empty| Root
    P3 -.->|Forward tasks| Root
    P4 -.->|Forward if no pollers| Root

Forwarding Mechanismus:

  • Leere Partitionen forwarden Polls zur Parent Partition
  • Partitionen ohne Poller forwarden Tasks zur Parent
  • Ermöglicht effiziente Ressourcennutzung

3.4.3 Sync Match vs Async Match

Zwei Dispatch-Modi:

sequenceDiagram
    participant HS as History Service
    participant MS as Matching Service
    participant W as Worker
    participant DB as Database

    Note over MS,W: Sync Match (Optimal Path)
    HS->>MS: Enqueue Task
    W->>MS: Poll (waiting)
    MS->>W: Task (immediate)
    Note over MS: No DB write!

    Note over MS,DB: Async Match (Backlog Path)
    HS->>MS: Enqueue Task
    MS->>DB: Persist Task
    Note over W: Worker polls later
    W->>MS: Poll
    MS->>DB: Read Task
    DB-->>MS: Task
    MS->>W: Task

Sync Match (Optimal):

  • Task sofort an wartenden Worker geliefert
  • Keine Database-Persistierung erforderlich
  • Zero Backlog Szenario
  • Höchste Performance
  • Metrik: sync_match_rate sollte hoch sein (>90%)

Async Match (Backlog):

  • Task wird in DB persistiert
  • Worker holt später aus Backlog
  • Tritt auf wenn keine Worker verfügbar
  • Niedrigere Performance (DB Round-Trip)
  • Tasks FIFO aus Backlog

Special Cases:

  • Nexus/Query Tasks: Niemals persistiert, nur Sync Match
  • Sticky Workflow Tasks: Bei Sync Match Fail → DB Persistence

3.4.4 Load Balancing

Worker-Pull Model:

graph LR
    subgraph "Workers (Pull-Based)"
        W1[Worker 1<br/>Capacity: 50]
        W2[Worker 2<br/>Capacity: 30]
        W3[Worker 3<br/>Capacity: 100]
    end

    subgraph "Matching Service"
        TQ[Task Queue<br/>Tasks: 200]
    end

    W1 -.->|Poll when capacity| TQ
    W2 -.->|Poll when capacity| TQ
    W3 -.->|Poll when capacity| TQ

    TQ -->|Distribute| W1
    TQ -->|Distribute| W2
    TQ -->|Distribute| W3

    style W1 fill:#e1ffe1
    style W2 fill:#e1ffe1
    style W3 fill:#e1ffe1

Vorteile:

  • Natürliches Load Balancing
  • Workers holen nur wenn Kapazität vorhanden
  • Verhindert Worker-Überlastung
  • Kein Worker Discovery/DNS erforderlich

Backlog Management:

  • Monitor BacklogIncreaseRate Metrik
  • Balance Worker Count mit Task Volume
  • Scale Workers um Sync Match Rate zu maximieren

3.4.5 Sticky Execution Optimization

Problem: Bei jedem Workflow Task muss Worker Event History laden und replayed.

Lösung: Sticky Task Queues

sequenceDiagram
    participant HS as History Service
    participant MS as Matching Service
    participant NQ as Normal Queue
    participant SQ as Sticky Queue (Worker 1)
    participant W1 as Worker 1
    participant W2 as Worker 2

    HS->>MS: Enqueue Task (WF-123, first time)
    MS->>NQ: Add to Normal Queue
    W1->>MS: Poll Normal Queue
    MS-->>W1: Task (WF-123)
    W1->>W1: Execute + Cache State
    W1->>HS: Complete

    Note over MS: Create Sticky Queue for Worker 1
    HS->>MS: Enqueue Task (WF-123, second time)
    MS->>SQ: Add to Sticky Queue (Worker 1)

    W1->>MS: Poll Sticky Queue
    MS-->>W1: Task (WF-123)
    Note over W1: State im Cache!<br/>Kein Replay!
    W1->>W1: Execute (sehr schnell)
    W1->>HS: Complete

    Note over MS: Timeout (5s) - Worker 1 nicht verfügbar
    HS->>MS: Enqueue Task (WF-123, third time)
    MS->>SQ: Try Sticky Queue
    MS->>NQ: Fallback to Normal Queue

    W2->>MS: Poll Normal Queue
    MS-->>W2: Task (WF-123)
    Note over W2: Kein Cache<br/>History Reload + Replay

Vorteile:

  • 10-100x schnellere Task-Verarbeitung
  • Reduzierte Last auf History Service
  • Geringere Latenz für Workflows

Automatisch aktiviert – keine Konfiguration erforderlich!

3.5 Worker Service: Interne Operationen

3.5.1 Unterschied zu User Workers

WICHTIG: Worker Service ≠ User Worker Processes!

graph TB
    subgraph "Temporal Cluster (Managed)"
        WS[Worker Service<br/>Internal System Service]
    end

    subgraph "User Application (Self-Hosted)"
        UW1[User Worker 1]
        UW2[User Worker 2]
        UW3[User Worker 3]
    end

    WS -->|Processes| IWF[Internal System Workflows]
    WS -->|Handles| Rep[Replication Queue]
    WS -->|Manages| Arch[Archival Operations]

    UW1 -->|Executes| AppWF[Application Workflows]
    UW2 -->|Executes| AppWF
    UW3 -->|Executes| AppWF

    style WS fill:#e1ffe1
    style UW1 fill:#e1f5ff
    style UW2 fill:#e1f5ff
    style UW3 fill:#e1f5ff

3.5.2 Aufgaben des Worker Service

Interne Background-Verarbeitung:

  1. System Workflows:

    • Workflow Deletions
    • Dead-Letter Queue Handling
    • Batch Operations
  2. Replication Queue Processing:

    • Multi-Cluster Replication
    • Event-Synchronisation zu Remote Clusters
  3. Archival Operations:

    • Langzeit-Archivierung abgeschlossener Workflows
    • Upload zu S3, GCS, etc.
  4. Kafka Visibility Processor (Version < 1.5.0):

    • Event Processing für Elasticsearch

Self-Hosting: Nutzt Temporal’s eigene Workflow Engine für Cluster-Level Operationen – “Temporal orchestriert Temporal”!

3.6 Persistence Layer: Datenspeicherung

3.6.1 Unterstützte Datenbanken

Primary Persistence (temporal_default):

graph TB
    subgraph "Supported Databases"
        Cass[Cassandra 3.x+<br/>NoSQL, Horizontal Scaling]
        PG[PostgreSQL 9.6+<br/>SQL, Transactional]
        MySQL[MySQL 5.7+<br/>SQL, Transactional]
    end

    subgraph "Temporal Services"
        HS[History Service]
        MS[Matching Service]
    end

    HS -->|Read/Write| Cass
    HS -->|Read/Write| PG
    HS -->|Read/Write| MySQL

    MS -->|Task Backlog| Cass
    MS -->|Task Backlog| PG
    MS -->|Task Backlog| MySQL

    style Cass fill:#e1f5ff
    style PG fill:#ffe1e1
    style MySQL fill:#fff4e1

Cassandra:

  • Natürliche horizontale Skalierung
  • Multi-Datacenter Replication
  • Eventual Consistency Model
  • Empfohlen für massive Scale

PostgreSQL/MySQL:

  • Vertikale Skalierung
  • Read Replicas für Visibility Queries
  • Connection Pooling kritisch
  • Ausreichend für die meisten Production Deployments

3.6.2 Datenmodell

Zwei-Schema-Ansatz:

1. temporal_default (Core Persistence):

Tables:
- executions: Mutable State of Workflow Executions
- history_node: Append-Only Event Log (partitioned)
- tasks: Transfer, Timer, Visibility, Replication Queues
- namespaces: Namespace Metadata, Retention Policies
- queue_metadata: Task Queue Checkpoints

2. temporal_visibility (Search/Query):

Tables:
- executions_visibility: Indexed Workflow Metadata
  - workflow_id, workflow_type, status, start_time, close_time
  - custom_search_attributes (JSON/Searchable)

Event History Storage Pattern:

# Events werden in Batches gespeichert (History Nodes)
# Jeder Node: ~100-200 Events
# Optimiert für sequentielles Lesen

history_nodes = [
    {
        "node_id": 1,
        "events": [1..100],  # WorkflowStarted bis Event 100
        "prev_txn_id": 0,
        "txn_id": 12345
    },
    {
        "node_id": 2,
        "events": [101..200],
        "prev_txn_id": 12345,
        "txn_id": 12456
    },
]

3.6.3 Visibility Store

Database Visibility (Basic):

-- Einfache SQL Queries
SELECT * FROM executions_visibility
WHERE workflow_type = 'OrderProcessing'
  AND status = 'Running'
  AND start_time > '2025-01-01'
ORDER BY start_time DESC
LIMIT 100;

Limitierungen:

  • Begrenzte Query-Capabilities
  • Performance-Probleme bei großen Datasets
  • Verfügbar: PostgreSQL 12+, MySQL 8.0.17+

Elasticsearch Visibility (Advanced, empfohlen):

// Komplexe Queries möglich
{
  "query": {
    "bool": {
      "must": [
        {"term": {"WorkflowType": "OrderProcessing"}},
        {"term": {"ExecutionStatus": "Running"}},
        {"range": {"StartTime": {"gte": "2025-01-01"}}}
      ],
      "filter": [
        {"term": {"CustomStringField": "VIP"}}
      ]
    }
  },
  "sort": [{"StartTime": "desc"}],
  "size": 100
}

Vorteile:

  • High-Performance Indexing
  • Komplexe Such-Queries
  • Custom Attributes und Filter
  • Entlastet Haupt-Datenbank

Design Consideration: Elasticsearch nimmt Query-Last von der Main Database – kritisch für Skalierung!

3.6.4 Konsistenz-Garantien

Strong Consistency (Writes):

# Database Transaction gewährleistet Konsistenz
BEGIN TRANSACTION
    UPDATE executions SET mutable_state = ... WHERE ...
    INSERT INTO history_node VALUES (...)
    INSERT INTO tasks VALUES (...)
COMMIT
  • History Service nutzt DB Transactions
  • Mutable State + Events atomar committed
  • Einzelner Shard verarbeitet alle Updates sequenziell
  • Verhindert Race Conditions

Eventual Consistency (Reads):

  • Visibility Data eventual consistent
  • Multi-Cluster Replication asynchron
  • Replication Lag möglich bei Failover

Event Sourcing Benefits:

  • Exactly-Once Execution Semantics
  • Komplette Audit Trail
  • State Reconstruction jederzeit möglich
  • Replay für Debugging

3.7 Kommunikationsflüsse

3.7.1 Workflow Start Flow

Der komplette Flow vom Client bis zur ersten Workflow Task Execution:

sequenceDiagram
    participant C as Client
    participant FE as Frontend
    participant HS as History
    participant DB as Database
    participant MS as Matching
    participant W as Worker

    C->>FE: StartWorkflowExecution(id, type, input)
    FE->>FE: Validate & Rate Limit
    FE->>FE: Hash(workflow_id) → Shard 42
    FE->>HS: Forward to History Shard 42

    HS->>DB: BEGIN TRANSACTION
    HS->>DB: INSERT WorkflowExecutionStarted Event
    HS->>DB: INSERT WorkflowTaskScheduled Event
    HS->>DB: INSERT Mutable State
    HS->>DB: INSERT Transfer Task (workflow task)
    HS->>DB: COMMIT TRANSACTION

    DB-->>HS: Success
    HS-->>FE: Execution Created
    FE-->>C: RunID + Success

    Note over HS: Transfer Queue Processor

    HS->>MS: AddWorkflowTask(task_queue, task)
    MS->>MS: Try Sync Match

    alt Sync Match Success
        W->>MS: PollWorkflowTaskQueue (waiting)
        MS-->>W: Task (immediate)
    else No Pollers
        MS->>DB: Persist Task to Backlog
        Note over W: Worker polls later
        W->>MS: PollWorkflowTaskQueue
        MS->>DB: Read from Backlog
        DB-->>MS: Task
        MS-->>W: Task
    end

    W->>W: Execute Workflow Code
    W->>FE: RespondWorkflowTaskCompleted(commands)
    FE->>HS: Process Commands

3.7.2 Activity Execution Flow

sequenceDiagram
    participant W as Worker<br/>(Workflow)
    participant FE as Frontend
    participant HS as History
    participant MS as Matching
    participant AW as Worker<br/>(Activity)

    Note over W: Workflow Code schedules Activity

    W->>FE: RespondWorkflowTask([ScheduleActivity])
    FE->>HS: Process Commands

    HS->>HS: Create ActivityTaskScheduled Event
    HS->>HS: Generate Transfer Task
    HS->>MS: AddActivityTask(task_queue, task)

    MS->>MS: Try Sync Match
    AW->>MS: PollActivityTaskQueue
    MS-->>AW: Activity Task

    AW->>AW: Execute Activity Function
    alt Activity Success
        AW->>FE: RespondActivityTaskCompleted(result)
        FE->>HS: Process Result
        HS->>HS: Create ActivityTaskCompleted Event
    else Activity Failure
        AW->>FE: RespondActivityTaskFailed(error)
        FE->>HS: Process Failure
        HS->>HS: Create ActivityTaskFailed Event
        Note over HS: Retry Logic applies
    end

    HS->>HS: Create new WorkflowTask
    HS->>MS: AddWorkflowTask
    Note over W: Worker receives continuation task

3.7.3 Long-Polling Mechanismus

Worker Long-Poll Detail:

# Worker SDK Code (vereinfacht)
async def poll_workflow_tasks():
    while True:
        try:
            # Long Poll mit ~60s Timeout
            response = await client.poll_workflow_task_queue(
                task_queue="production",
                timeout=60  # Sekunden
            )

            if response.has_task:
                # Task sofort erhalten (Sync Match!)
                await execute_workflow_task(response.task)
            else:
                # Timeout - keine Tasks verfügbar
                # Sofort erneut pollen
                continue

        except Exception as e:
            # Fehlerbehandlung
            await asyncio.sleep(1)

Server-Seite (Matching Service):

# Matching Service (konzeptuell)
async def handle_poll_request(poll_request):
    # Try Sync Match
    task = try_get_task_immediately(poll_request.task_queue)

    if task:
        # Sync Match erfolgreich!
        return task

    # Kein Task verfügbar - halte Verbindung offen
    task = await wait_for_task_or_timeout(
        poll_request.task_queue,
        timeout=60
    )

    if task:
        return task
    else:
        return empty_response

Vorteile:

  • Minimale Latenz bei Task-Verfügbarkeit
  • Reduzierte Netzwerk-Overhead (keine Poll-Loops)
  • Natürliches Backpressure Handling

3.8 Skalierung und High Availability

3.8.1 Unabhängige Service-Skalierung

graph TB
    subgraph "Scaling Strategy"
        FE1[Frontend<br/>3 Instances]
        HS1[History<br/>10 Instances]
        MS1[Matching<br/>5 Instances]
        WS1[Worker<br/>2 Instances]
    end

    subgraph "Charakteristika"
        FE1 -.-> FE_C[Stateless<br/>Unbegrenzt skalierbar]
        HS1 -.-> HS_C[Sharded<br/>Shards über Instances verteilt]
        MS1 -.-> MS_C[Partitioned<br/>Partitionen über Instances]
        WS1 -.-> WS_C[Internal Workload<br/>Separat skalierbar]
    end

Frontend Service:

  • Stateless → Beliebig horizontal skalieren
  • Hinter Load Balancer
  • Keine Koordinations-Overhead

History Service:

  • Instanzen hinzufügen
  • Shards dynamisch über Instances verteilt
  • Ringpop koordiniert Shard Ownership
  • Constraint: Total Shard Count fixed

Matching Service:

  • Instanzen hinzufügen
  • Task Queue Partitionen über Instances verteilt
  • Consistent Hashing für Partition Placement

3.8.2 Database Scaling

Bottleneck: Database oft ultimatives Performance-Limit

Cassandra:

# Natürliche horizontale Skalierung
# Neue Nodes hinzufügen
nodetool status
# Rebalancing automatisch

PostgreSQL/MySQL:

-- Vertikale Skalierung: Größere Instances
-- Read Replicas für Visibility Queries
-- Connection Pooling kritisch

max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB

3.8.3 Multi-Cluster Replication

Global Namespaces für High Availability:

graph TB
    subgraph "Cluster 1 (Primary - US-West)"
        NS1[Namespace: production<br/>Active]
        HS1[History Service]
        DB1[(Database)]
    end

    subgraph "Cluster 2 (Standby - US-East)"
        NS2[Namespace: production<br/>Standby]
        HS2[History Service]
        DB2[(Database)]
    end

    Client[Client Application]

    Client -->|Writes| NS1
    Client -.->|Reads| NS1
    Client -.->|Reads| NS2

    NS1 -->|Async Replication| NS2

    style NS1 fill:#90EE90
    style NS2 fill:#FFB6C1

Charakteristika:

  • Async Replication: Hoher Throughput
  • Nicht strongly consistent über Clusters
  • Replication Lag bei Failover → potentieller Progress Loss
  • Visibility APIs funktionieren auf Active und Standby

Failover Process:

  1. Namespace auf Backup Cluster aktiviert
  2. Workflows setzen fort vom letzten replizierten State
  3. Einige in-flight Activity Tasks können re-executed werden
  4. Akzeptabel für Disaster Recovery Szenarien

3.8.4 Performance-Metriken

Key Metrics zu überwachen:

# History Service
"shard_lock_latency": < 5ms,  # Idealerweise ~1ms
"cache_hit_rate": > 80%,
"transfer_task_latency": < 100ms,

# Matching Service
"sync_match_rate": > 90%,  # Hoch halten!
"backlog_size": < 1000,
"poll_success_rate": > 95%,

# Database
"query_latency_p99": < 50ms,
"connection_pool_utilization": 60-80%,
"persistence_rps": < max_qps,

Sticky Execution Optimization:

sticky_cache_hit_rate: > 70%
→ Drastisch reduzierte History Replays
→ 10-100x schnellere Task-Verarbeitung

3.9 Praktisches Beispiel: Service Interaction

Schauen wir uns das Code-Beispiel für Kapitel 3 an:

@workflow.defn
class ServiceArchitectureWorkflow:
    """
    Demonstriert Service-Architektur-Konzepte.
    """

    @workflow.run
    async def run(self) -> dict:
        workflow.logger.info("Workflow started - event logged in history")

        # Frontend → History: Workflow gestartet
        # History → Database: WorkflowExecutionStarted Event
        # History → History Cache: Mutable State gecached

        steps = ["Frontend processing", "History service update", "Task scheduling"]

        for i, step in enumerate(steps, 1):
            workflow.logger.info(f"Step {i}: {step}")
            # Jedes Log → Event in History

        # History → Matching: Workflow Task scheduled
        # Matching → Worker: Task dispatched (hoffentlich Sync Match!)

        workflow.logger.info("Workflow completed - final event in history")

        return {
            "message": "Architecture demonstration complete",
            "steps_completed": len(steps)
        }

📁 Code-Beispiel: ../examples/part-01/chapter-03/service_interaction.py

Workflow ausführen:

# Terminal 1: Worker starten
cd ../examples/part-01/chapter-03
uv run python -m temporalio.worker \
    --task-queue book-examples \
    service_interaction

# Terminal 2: Workflow starten
uv run python service_interaction.py

Ausgabe zeigt Service-Interaktionen:

=== Temporal Service Architecture Demonstration ===

1. Client connecting to Temporal Frontend...
   ✓ Connected to Temporal service

2. Starting workflow (ID: architecture-demo-001)
   Frontend schedules task...
   History service creates event log...
   ✓ Workflow started

3. Waiting for workflow completion...
   Worker polls task queue...
   Worker executes workflow code...
   History service logs each event...
   ✓ Workflow completed

4. Accessing workflow history...
   ✓ Retrieved 17 events from history service

=== Architecture Components Demonstrated ===
✓ Client - Initiated workflow
✓ Frontend - Accepted workflow request
✓ History Service - Stored event log
✓ Task Queue - Delivered tasks to worker
✓ Worker - Executed workflow code

3.10 Zusammenfassung

In diesem Kapitel haben wir die Architektur des Temporal Service im Detail kennengelernt:

Die vier Kernkomponenten:

  1. Frontend Service – Stateless API Gateway

    • Entry Point für alle Requests
    • Rate Limiting und Validation
    • Routing zu History und Matching
  2. History Service – State Management

    • Verwaltet Workflow Execution Lifecycle
    • Event Sourcing mit Mutable State + Immutable Events
    • Sharded für Parallelität
    • Interne Task Queues (Transfer, Timer, Visibility, etc.)
  3. Matching Service – Task Queue Management

    • Verwaltet alle user-facing Task Queues
    • Partitioned für Skalierung
    • Sync Match (optimal) vs Async Match (Backlog)
    • Long-Polling Protocol
  4. Worker Service – Interne Operationen

    • Replication, Archival, System Workflows
    • Unterschied zu User Worker Processes

Persistence Layer:

  • Cassandra, PostgreSQL, MySQL
  • Event History + Mutable State
  • Visibility Store (Database oder Elasticsearch)
  • Strong Consistency bei Writes

Kommunikationsflüsse:

  • Client → Frontend → History → Database
  • History → Matching → Worker (Long-Poll)
  • Event Sourcing garantiert Consistency

Skalierung:

  • Unabhängige Service-Skalierung
  • Frontend: Unbegrenzt horizontal
  • History: Via Shard-Distribution
  • Matching: Via Partition-Distribution
  • Multi-Cluster für High Availability

Performance-Optimierungen:

  • Sticky Execution (10-100x schneller)
  • Sync Match (kein DB Round-Trip)
  • Mutable State Cache
  • Partitioning für Parallelität
graph TB
    Client[Client/Worker]
    FE[Frontend<br/>Stateless API]
    HS[History<br/>Sharded State]
    MS[Matching<br/>Partitioned Queues]
    DB[(Database<br/>Cassandra/PG/MySQL)]
    ES[(Elasticsearch<br/>Visibility)]

    Client -->|gRPC| FE
    FE --> HS
    FE --> MS
    HS -->|Events| DB
    HS -->|Enqueue| MS
    HS -->|Index| ES
    MS -->|Backlog| DB

    style FE fill:#e1f5ff
    style HS fill:#ffe1e1
    style MS fill:#fff4e1
    style DB fill:#e1ffe1
    style ES fill:#ffffcc

Mit diesem tiefen Verständnis der Temporal Service Architektur können wir nun in Teil II eintauchen, wo wir uns auf die praktische Nutzung der SDKs konzentrieren und fortgeschrittene Entwicklungstechniken erlernen.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 4: Entwicklungs-Setup und SDK-Auswahl

Code-Beispiele für dieses Kapitel: examples/part-01/chapter-03/

Kapitel 4: Entwicklungs-Setup und SDK-Auswahl

Willkommen zu Teil II des Buches! Nachdem wir in Teil I die theoretischen Grundlagen von Temporal kennengelernt haben, tauchen wir nun in die praktische Entwicklung ein. In diesem Kapitel richten wir die komplette Entwicklungsumgebung ein, wählen das richtige SDK aus und lernen die Tools kennen, die uns bei der täglichen Arbeit mit Temporal unterstützen.

4.1 SDK-Übersicht: Die Wahl der richtigen Sprache

4.1.1 Verfügbare SDKs

Temporal bietet sieben offizielle SDKs für verschiedene Programmiersprachen (Stand 2025):

graph TB
    subgraph "Native Implementations"
        Go[Go SDK<br/>Native Implementation]
        Java[Java SDK<br/>Native Implementation]
    end

    subgraph "Rust Core SDK Based"
        TS[TypeScript SDK<br/>Built on Rust Core]
        Python[Python SDK<br/>Built on Rust Core]
        DotNet[.NET SDK<br/>Built on Rust Core]
        PHP[PHP SDK<br/>Built on Rust Core]
        Ruby[Ruby SDK<br/>Built on Rust Core<br/>Pre-release]
    end

    RustCore[Rust Core SDK<br/>Shared Implementation]

    TS -.-> RustCore
    Python -.-> RustCore
    DotNet -.-> RustCore
    PHP -.-> RustCore
    Ruby -.-> RustCore

    style Go fill:#e1f5ff
    style Java fill:#ffe1e1
    style Python fill:#90EE90
    style TS fill:#fff4e1
    style DotNet fill:#e1ffe1
    style PHP fill:#ffffcc
    style Ruby fill:#FFB6C1
    style RustCore fill:#cccccc

Architektur-Unterschiede:

Native Implementationen (Go, Java):

  • Komplett eigenständige Implementierungen
  • Eigene Metric-Standards (Sekunden statt Millisekunden)
  • Lange etabliert und battle-tested

Rust Core SDK Based (TypeScript, Python, .NET, PHP, Ruby):

  • Teilen dieselbe Rust-basierte Core-Implementierung
  • Metrics in Millisekunden
  • Effizientere Ressourcennutzung
  • Einheitliches Verhalten über SDKs hinweg

4.1.2 Python SDK: Unser Fokus

Warum Python?

# Python bietet native async/await Unterstützung
from temporalio import workflow
from datetime import timedelta

@workflow.defn
class DataProcessingWorkflow:
    @workflow.run
    async def run(self, data: list[str]) -> dict:
        # Natürliches async/await für parallele Operationen
        results = await asyncio.gather(
            *[self.process_item(item) for item in data]
        )
        return {"processed": len(results)}

Python SDK Stärken:

  1. Async/Await Native: Perfekt für Workflows mit Timern und parallelen Tasks
  2. Type Safety: Vollständige Type Hints mit Generics
  3. Workflow Sandbox: Automatische Modul-Neuladung für Determinismus
  4. ML/AI Ecosystem: Ideal für Data Science, Machine Learning, LLM-Projekte
  5. Entwickler-Freundlichkeit: Pythonic API, saubere Syntax

Technische Anforderungen:

# Python Version
Python >= 3.10  (empfohlen: 3.10+)

# Core Dependencies
temporalio >= 1.0.0
protobuf >= 3.20, < 7.0.0

4.1.3 Wann welches SDK?

Entscheidungsmatrix:

SzenarioEmpfohlenes SDKGrund
Data Science, ML, AIPythonEcosystem, Libraries
High-Performance MicroservicesGoPerformance, Concurrency
Enterprise BackendJavaJVM Ecosystem, Legacy Integration
Web DevelopmentTypeScriptNode.js, Frontend-Integration
.NET Shops.NETC# Integration, Performance
Polyglot ArchitekturMixGo API + Python Workers für ML

Feature Parity: Alle Major SDKs (Go, Java, TypeScript, Python, .NET) sind production-ready mit vollständiger Feature-Parität.

4.2 Lokale Entwicklungsumgebung

4.2.1 Temporal Server Optionen

Option 1: Temporal CLI Dev Server (Empfohlen für Einstieg)

# Installation Temporal CLI
# macOS/Linux:
brew install temporal

# Oder: Download binary von CDN

# Dev Server starten
temporal server start-dev

# Mit persistenter SQLite-Datenbank
temporal server start-dev --db-filename temporal.db

# In Docker
docker run --rm -p 7233:7233 -p 8233:8233 \
    temporalio/temporal server start-dev --ip 0.0.0.0

Eigenschaften:

  • Ports: gRPC auf localhost:7233, Web UI auf http://localhost:8233
  • Database: In-Memory (ohne --db-filename) oder SQLite
  • Features: Embedded Server, Web UI, Default Namespace
  • Ideal für: Erste Schritte, Tutorials, lokales Testen

Option 2: Docker Compose (Production-like)

# Temporal Docker Compose Setup klonen
git clone https://github.com/temporalio/docker-compose.git
cd docker-compose

# Starten
docker compose up

# Im Hintergrund
docker compose up -d

Komponenten:

services:
  postgresql:      # Port 5432, Credentials: temporal/temporal
  elasticsearch:   # Port 9200, Single-Node Mode
  temporal:        # gRPC: 7233, Web UI: 8080
  temporal-admin-tools:
  temporal-ui:

Ideal für:

  • Production-ähnliche lokale Entwicklung
  • Testing mit Elasticsearch Visibility
  • Multi-Service Integration Tests

Option 3: Temporalite (Leichtgewichtig)

# Standalone Binary mit SQLite
# Weniger Ressourcen als Docker Compose
# Nur für Development/Testing

Vergleich der Optionen:

graph LR
    subgraph "Development Journey"
        Start[Start Learning]
        Dev[Active Development]
        PreProd[Pre-Production Testing]
    end

    Start -->|Use| CLI[CLI Dev Server<br/>Schnell, Einfach]
    Dev -->|Use| Docker[Docker Compose<br/>Production-like]
    PreProd -->|Use| Full[Full Deployment<br/>Kubernetes]

    style CLI fill:#90EE90
    style Docker fill:#fff4e1
    style Full fill:#ffe1e1

4.2.2 Python Entwicklungsumgebung

Moderne Toolchain mit uv (Empfohlen 2025):

# uv installieren (10-100x schneller als pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Projekt erstellen
mkdir my-temporal-project
cd my-temporal-project

# Python-Version festlegen
echo "3.13" > .python-version

# pyproject.toml erstellen
cat > pyproject.toml << 'EOF'
[project]
name = "my-temporal-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "temporalio>=1.0.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.4.0",
    "pytest-asyncio>=0.21.0",
]
EOF

# Dependencies installieren
uv sync

# Script ausführen (uv managed venv automatisch)
uv run python worker.py

Traditioneller Ansatz (falls uv nicht verfügbar):

# Virtual Environment erstellen
python -m venv venv

# Aktivieren
source venv/bin/activate  # Windows: venv\Scripts\activate

# Dependencies installieren
pip install temporalio

# Mit optionalen Features
pip install temporalio[opentelemetry,pydantic]

Temporal SDK Extras:

# gRPC Support
pip install temporalio[grpc]

# OpenTelemetry für Tracing
pip install temporalio[opentelemetry]

# Pydantic Integration
pip install temporalio[pydantic]

# Alles
pip install temporalio[grpc,opentelemetry,pydantic]

4.2.3 IDE Setup und Debugging

VSCode Configuration:

// .vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Worker",
            "type": "python",
            "request": "launch",
            "program": "${workspaceFolder}/workers/worker.py",
            "console": "integratedTerminal",
            "justMyCode": false,
            "env": {
                "TEMPORAL_ADDRESS": "localhost:7233",
                "LOG_LEVEL": "DEBUG"
            }
        },
        {
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal"
        }
    ]
}

Debugging-Einschränkungen:

from temporalio import workflow, activity

# ✅ Breakpoints funktionieren
@activity.defn
async def my_activity(param: str) -> str:
    # Breakpoint hier funktioniert!
    result = process(param)
    return result

# ❌ Breakpoints funktionieren NICHT (Sandbox-Limitation)
@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self, input: str) -> str:
        # Breakpoint hier wird NICHT getroffen
        result = await workflow.execute_activity(...)
        return result

Workaround: Nutze workflow.logger für Debugging in Workflows:

@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> str:
        workflow.logger.info(f"Processing order: {order_id}")

        result = await workflow.execute_activity(...)
        workflow.logger.debug(f"Activity result: {result}")

        return result

4.3 Projekt-Struktur und Best Practices

4.3.1 Empfohlene Verzeichnisstruktur

my-temporal-project/
├── .env                        # Environment Variables (nicht committen!)
├── .gitignore
├── .python-version             # Python 3.13
├── pyproject.toml              # Projekt-Konfiguration
├── temporal.toml               # Temporal Multi-Environment Config
├── README.md
│
├── src/
│   ├── __init__.py
│   │
│   ├── workflows/              # Alle Workflow-Definitionen
│   │   ├── __init__.py
│   │   ├── order_workflow.py
│   │   └── payment_workflow.py
│   │
│   ├── activities/             # Alle Activity-Implementierungen
│   │   ├── __init__.py
│   │   ├── email_activities.py
│   │   └── payment_activities.py
│   │
│   ├── models/                 # Shared Types und Dataclasses
│   │   ├── __init__.py
│   │   └── order_models.py
│   │
│   └── workers/                # Worker-Prozesse
│       ├── __init__.py
│       └── main_worker.py
│
├── tests/                      # Test Suite
│   ├── __init__.py
│   ├── conftest.py            # pytest Fixtures
│   ├── test_workflows/
│   │   └── test_order_workflow.py
│   └── test_activities/
│       └── test_email_activities.py
│
└── scripts/                    # Helper Scripts
    ├── start_worker.sh
    └── deploy.sh

4.3.2 Type-Safe Workflow Development

Input/Output mit Dataclasses:

# src/models/order_models.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class OrderInput:
    order_id: str
    customer_email: str
    items: list[str]
    total_amount: float

@dataclass
class OrderResult:
    success: bool
    transaction_id: Optional[str] = None
    error_message: Optional[str] = None

# src/workflows/order_workflow.py
from temporalio import workflow
from datetime import timedelta
from ..models.order_models import OrderInput, OrderResult
from ..activities.payment_activities import process_payment
from ..activities.email_activities import send_confirmation

@workflow.defn
class OrderWorkflow:
    """
    Orchestriert den Order-Processing-Flow.
    """

    @workflow.run
    async def run(self, input: OrderInput) -> OrderResult:
        """
        Verarbeitet eine Bestellung.

        Args:
            input: Order-Daten mit allen relevanten Informationen

        Returns:
            OrderResult mit Transaction ID oder Fehler
        """
        workflow.logger.info(f"Processing order {input.order_id}")

        try:
            # Payment verarbeiten
            transaction_id = await workflow.execute_activity(
                process_payment,
                args=[input.total_amount, input.customer_email],
                start_to_close_timeout=timedelta(seconds=30),
            )

            # Confirmation Email senden
            await workflow.execute_activity(
                send_confirmation,
                args=[input.customer_email, input.order_id, transaction_id],
                start_to_close_timeout=timedelta(seconds=10),
            )

            return OrderResult(
                success=True,
                transaction_id=transaction_id
            )

        except Exception as e:
            workflow.logger.error(f"Order processing failed: {e}")
            return OrderResult(
                success=False,
                error_message=str(e)
            )

Vorteile von Dataclasses:

  • ✅ Typsicherheit mit IDE-Unterstützung
  • ✅ Einfaches Hinzufügen neuer Felder (mit Defaults)
  • ✅ Automatische Serialisierung/Deserialisierung
  • ✅ Bessere Lesbarkeit als *args, **kwargs

4.3.3 Configuration Management

Multi-Environment Setup mit temporal.toml:

# temporal.toml - Multi-Environment Configuration

# Default für lokale Entwicklung
[default]
target = "localhost:7233"
namespace = "default"

# Development Environment
[dev]
target = "localhost:7233"
namespace = "development"

# Staging Environment
[staging]
target = "staging.temporal.example.com:7233"
namespace = "staging"
tls_cert_path = "/path/to/staging-cert.pem"
tls_key_path = "/path/to/staging-key.pem"

# Production Environment
[prod]
target = "prod.temporal.example.com:7233"
namespace = "production"
tls_cert_path = "/path/to/prod-cert.pem"
tls_key_path = "/path/to/prod-key.pem"

Environment-basierte Client-Konfiguration:

# src/config.py
import os
from temporalio.client import Client
from temporalio.envconfig import load_client_config

async def create_client() -> Client:
    """
    Erstellt Temporal Client basierend auf TEMPORAL_PROFILE env var.

    Beispiel:
        export TEMPORAL_PROFILE=prod
        python worker.py  # Verbindet zu Production
    """
    profile = os.getenv("TEMPORAL_PROFILE", "default")
    config = load_client_config(profile=profile)

    client = await Client.connect(**config)
    return client

.env File (nicht committen!):

# .env - Lokale Entwicklung
TEMPORAL_ADDRESS=localhost:7233
TEMPORAL_NAMESPACE=default
TEMPORAL_TASK_QUEUE=order-processing
TEMPORAL_PROFILE=dev

# Application Config
LOG_LEVEL=DEBUG
DATABASE_URL=postgresql://user:pass@localhost:5432/orders

# External Services
SMTP_SERVER=smtp.example.com
SMTP_PORT=587
STRIPE_API_KEY=sk_test_...

Config Loading mit Pydantic:

# src/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache

class Settings(BaseSettings):
    # Temporal Config
    temporal_address: str = "localhost:7233"
    temporal_namespace: str = "default"
    temporal_task_queue: str = "default"

    # Application Config
    log_level: str = "INFO"
    database_url: str

    # External Services
    smtp_server: str
    smtp_port: int = 587
    stripe_api_key: str

    class Config:
        env_file = ".env"
        case_sensitive = False

@lru_cache()
def get_settings() -> Settings:
    """Singleton Settings Instance."""
    return Settings()

# Usage
from config import get_settings
settings = get_settings()

4.4 Development Workflow

4.4.1 Worker Development Loop

Typischer Entwicklungs-Workflow:

graph TB
    Start[Code Ändern]
    Restart[Worker Neustarten]
    Test[Workflow Testen]
    Debug[Debuggen mit Web UI]
    Fix[Fehler Fixen]

    Start --> Restart
    Restart --> Test
    Test -->|Error| Debug
    Debug --> Fix
    Fix --> Start
    Test -->|Success| Done[Feature Complete]

    style Start fill:#e1f5ff
    style Test fill:#fff4e1
    style Debug fill:#ffe1e1
    style Done fill:#90EE90

1. Worker starten:

# Terminal 1: Temporal Server (falls nicht bereits läuft)
temporal server start-dev --db-filename temporal.db

# Terminal 2: Worker starten
uv run python src/workers/main_worker.py

# Oder mit Environment
TEMPORAL_PROFILE=dev LOG_LEVEL=DEBUG uv run python src/workers/main_worker.py

2. Workflow ausführen:

# scripts/run_workflow.py
import asyncio
from temporalio.client import Client
from src.workflows.order_workflow import OrderWorkflow
from src.models.order_models import OrderInput

async def main():
    client = await Client.connect("localhost:7233")

    input_data = OrderInput(
        order_id="ORD-123",
        customer_email="customer@example.com",
        items=["Item A", "Item B"],
        total_amount=99.99
    )

    result = await client.execute_workflow(
        OrderWorkflow.run,
        input_data,
        id="order-ORD-123",
        task_queue="order-processing",
    )

    print(f"Result: {result}")

if __name__ == "__main__":
    asyncio.run(main())
# Workflow ausführen
uv run python scripts/run_workflow.py

3. Debugging mit Web UI:

# Web UI öffnen
http://localhost:8233

# Workflow suchen: order-ORD-123
# → Event History inspizieren
# → Input/Output anzeigen
# → Stack Trace bei Errors

4. Code-Änderungen ohne Downtime:

# Bei Code-Änderungen:
# 1. Worker mit Ctrl+C stoppen
# 2. Code ändern
# 3. Worker neu starten

# Laufende Workflows:
# → Werden automatisch fortgesetzt
# → Event History bleibt erhalten
# → Bei Breaking Changes: Workflow Versioning nutzen (Kapitel 9)

4.4.2 Temporal CLI Commands

Wichtigste Commands für Development:

# Workflows auflisten
temporal workflow list

# Workflow Details
temporal workflow describe -w order-ORD-123

# Event History anzeigen
temporal workflow show -w order-ORD-123

# Event History als JSON exportieren
temporal workflow show -w order-ORD-123 > history.json

# Workflow starten
temporal workflow start \
  --task-queue order-processing \
  --type OrderWorkflow \
  --workflow-id order-ORD-456 \
  --input '{"order_id": "ORD-456", "customer_email": "test@example.com", ...}'

# Workflow ausführen und auf Result warten
temporal workflow execute \
  --task-queue order-processing \
  --type OrderWorkflow \
  --workflow-id order-ORD-789 \
  --input @input.json

# Workflow canceln
temporal workflow cancel -w order-ORD-123

# Workflow terminieren (hard stop)
temporal workflow terminate -w order-ORD-123

# Signal senden
temporal workflow signal \
  --workflow-id order-ORD-123 \
  --name add-item \
  --input '{"item": "Item C"}'

# Query ausführen
temporal workflow query \
  --workflow-id order-ORD-123 \
  --type get-status

# Workflow Count
temporal workflow count

Temporal Web UI Navigation:

http://localhost:8233
│
├── Workflows → Alle Workflow Executions
│   ├── Filter (Status, Type, Time Range)
│   ├── Search (Workflow ID, Run ID)
│   └── Details → Event History
│       ├── Timeline View
│       ├── Compact View
│       ├── Full History
│       └── JSON Export
│
├── Namespaces → Namespace Management
├── Archival → Archived Workflows
└── Settings → Server Configuration

4.5 Testing und Debugging

4.5.1 Test-Setup mit pytest

Dependencies installieren:

# pyproject.toml
[project.optional-dependencies]
dev = [
    "pytest>=7.4.0",
    "pytest-asyncio>=0.21.0",
    "pytest-cov>=4.1.0",  # Coverage reporting
]
uv sync --all-extras

pytest Configuration:

# pyproject.toml
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]

4.5.2 Workflow Testing mit Time-Skipping

Integration Test:

# tests/test_workflows/test_order_workflow.py
import pytest
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker

from src.workflows.order_workflow import OrderWorkflow
from src.activities.payment_activities import process_payment
from src.activities.email_activities import send_confirmation
from src.models.order_models import OrderInput

@pytest.mark.asyncio
async def test_order_workflow_success():
    """Test erfolgreicher Order-Processing Flow."""

    # Time-Skipping Test Environment
    async with await WorkflowEnvironment.start_time_skipping() as env:
        # Worker mit Workflows und Activities
        async with Worker(
            env.client,
            task_queue="test-queue",
            workflows=[OrderWorkflow],
            activities=[process_payment, send_confirmation],
        ):
            # Workflow ausführen
            input_data = OrderInput(
                order_id="TEST-001",
                customer_email="test@example.com",
                items=["Item A"],
                total_amount=49.99
            )

            result = await env.client.execute_workflow(
                OrderWorkflow.run,
                input_data,
                id="test-order-001",
                task_queue="test-queue",
            )

            # Assertions
            assert result.success is True
            assert result.transaction_id is not None
            assert result.error_message is None

Test mit gemockten Activities:

from temporalio import activity

@activity.defn
async def mock_process_payment(amount: float, email: str) -> str:
    """Mock Payment Activity für Tests."""
    return f"mock-txn-{amount}"

@activity.defn
async def mock_send_confirmation(email: str, order_id: str, txn_id: str) -> None:
    """Mock Email Activity für Tests."""
    pass

@pytest.mark.asyncio
async def test_order_workflow_with_mocks():
    """Test mit gemockten Activities."""

    async with await WorkflowEnvironment.start_time_skipping() as env:
        async with Worker(
            env.client,
            task_queue="test-queue",
            workflows=[OrderWorkflow],
            activities=[mock_process_payment, mock_send_confirmation],  # Mocks!
        ):
            result = await env.client.execute_workflow(...)

            assert result.success is True

Time Skipping für Timeouts:

from datetime import timedelta

@pytest.mark.asyncio
async def test_workflow_with_long_timeout():
    """Test Workflow mit 7 Tagen Sleep - läuft sofort!"""

    async with await WorkflowEnvironment.start_time_skipping() as env:
        async with Worker(...):
            # Workflow mit 7-Tage Sleep
            # Time-Skipping: Läuft in Millisekunden statt 7 Tagen!
            result = await env.client.execute_workflow(
                WeeklyReportWorkflow.run,
                ...
            )

            assert result.week_number == 1

4.5.3 Activity Unit Tests

# tests/test_activities/test_payment_activities.py
import pytest
from src.activities.payment_activities import process_payment

@pytest.mark.asyncio
async def test_process_payment_success():
    """Test Payment Activity in Isolation."""

    # Activity direkt testen (ohne Temporal)
    result = await process_payment(
        amount=99.99,
        email="test@example.com"
    )

    assert result.startswith("txn_")
    assert len(result) > 10

@pytest.mark.asyncio
async def test_process_payment_invalid_amount():
    """Test Payment Activity mit ungültigem Amount."""

    from temporalio.exceptions import ApplicationError

    with pytest.raises(ApplicationError) as exc_info:
        await process_payment(
            amount=-10.00,  # Ungültig!
            email="test@example.com"
        )

    assert "Invalid amount" in str(exc_info.value)
    assert exc_info.value.non_retryable is True

4.5.4 Replay Testing

Workflow History exportieren:

# Via CLI
temporal workflow show -w order-ORD-123 > workflow_history.json

# Via Web UI
# → Workflow Details → JSON Tab → Copy

Replay Test:

# tests/test_workflows/test_replay.py
import pytest
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory
from src.workflows.order_workflow import OrderWorkflow

@pytest.mark.asyncio
async def test_replay_order_workflow():
    """Test Workflow Replay für Non-Determinism."""

    # History laden
    with open("tests/fixtures/order_workflow_history.json") as f:
        history = WorkflowHistory.from_json(f.read())

    # Replayer erstellen
    replayer = Replayer(
        workflows=[OrderWorkflow]
    )

    # Replay (wirft Exception bei Non-Determinism)
    await replayer.replay_workflow(history)

Warum Replay Testing?

  • Erkennt Non-Deterministic Errors bei Code-Änderungen
  • Verifiziert Workflow-Kompatibilität mit alten Executions
  • Verhindert Production-Crashes durch Breaking Changes

4.5.5 Coverage und CI/CD

# Test Coverage
pytest --cov=src --cov-report=html tests/

# Output
# Coverage: 87%
# HTML Report: htmlcov/index.html

# Im CI/CD (GitHub Actions Beispiel)
# .github/workflows/test.yml
name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.13'

      - name: Install uv
        run: curl -LsSf https://astral.sh/uv/install.sh | sh

      - name: Install dependencies
        run: uv sync --all-extras

      - name: Run tests
        run: uv run pytest --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3

4.6 Debugging und Observability

4.6.1 Logging Best Practices

import logging
from temporalio import workflow, activity

# Root Logger konfigurieren
logging.basicConfig(
    level=logging.INFO,  # DEBUG für Development
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, input: OrderInput) -> OrderResult:
        # Workflow Logger nutzen (nicht logging.getLogger!)
        workflow.logger.info(f"Processing order {input.order_id}")
        workflow.logger.debug(f"Input details: {input}")

        try:
            result = await workflow.execute_activity(...)
            workflow.logger.info(f"Payment successful: {result}")
            return OrderResult(success=True, transaction_id=result)

        except Exception as e:
            workflow.logger.error(f"Order failed: {e}", exc_info=True)
            return OrderResult(success=False, error_message=str(e))

@activity.defn
async def process_payment(amount: float, email: str) -> str:
    # Activity Logger nutzen
    activity.logger.info(f"Processing payment: {amount} for {email}")

    try:
        txn_id = charge_card(amount, email)
        activity.logger.info(f"Payment successful: {txn_id}")
        return txn_id
    except Exception as e:
        activity.logger.error(f"Payment failed: {e}", exc_info=True)
        raise

Log Levels:

  • DEBUG: Development, detailliertes Troubleshooting
  • INFO: Production, wichtige Events
  • WARNING: Potentielle Probleme
  • ERROR: Fehler die gehandled werden
  • CRITICAL: Kritische Fehler

4.6.2 Prometheus Metrics

from temporalio.runtime import Runtime, TelemetryConfig, PrometheusConfig
from temporalio.client import Client

# Prometheus Endpoint konfigurieren
runtime = Runtime(
    telemetry=TelemetryConfig(
        metrics=PrometheusConfig(
            bind_address="0.0.0.0:9000"  # Metrics Port
        )
    )
)

client = await Client.connect(
    "localhost:7233",
    runtime=runtime
)

# Metrics verfügbar auf:
# http://localhost:9000/metrics

Verfügbare Metrics:

  • temporal_workflow_task_execution_total
  • temporal_activity_execution_total
  • temporal_workflow_task_execution_latency_ms
  • temporal_activity_execution_latency_ms
  • temporal_worker_task_slots_available
  • temporal_sticky_cache_hit_total

4.6.3 OpenTelemetry Tracing

# Installation
pip install temporalio[opentelemetry]
from temporalio.contrib.opentelemetry import TracingInterceptor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, BatchSpanProcessor

# OpenTelemetry Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

# Temporal Client mit Tracing
interceptor = TracingInterceptor()
client = await Client.connect(
    "localhost:7233",
    interceptors=[interceptor]
)

# Alle Workflow/Activity Executions werden getracet

4.7 Zusammenfassung

In diesem Kapitel haben wir die komplette Entwicklungsumgebung für Temporal aufgesetzt:

SDK-Auswahl:

  • 7 offizielle SDKs (Go, Java, TypeScript, Python, .NET, PHP, Ruby)
  • Python SDK: Python >= 3.10, Rust Core SDK, Type-Safe, Async/Await
  • Feature Parity über alle Major SDKs

Lokale Entwicklung:

  • Temporal Server: CLI Dev Server (Einstieg), Docker Compose (Production-like)
  • Python Setup: uv (modern, schnell) oder venv (traditionell)
  • IDE: VSCode/PyCharm mit Debugging (Limitations in Workflows!)

Projekt-Struktur:

  • Separation: workflows/, activities/, models/, workers/
  • Type-Safe Dataclasses für Input/Output
  • Multi-Environment Config (temporal.toml, .env)

Development Workflow:

  • Worker Development Loop: Code → Restart → Test → Debug
  • Temporal CLI: workflow list/show/execute/signal/query
  • Web UI: Event History, Timeline, Stack Traces

Testing:

  • pytest mit pytest-asyncio
  • Time-Skipping Environment für schnelle Tests
  • Activity Mocking
  • Replay Testing für Non-Determinism
  • Coverage Tracking

Debugging & Observability:

  • Logging: workflow.logger, activity.logger
  • Prometheus Metrics auf Port 9000
  • OpenTelemetry Tracing
  • Web UI Event History Inspection
graph TB
    Code[Code Schreiben]
    Test[Testen mit pytest]
    Debug[Debuggen mit Web UI]
    Deploy[Deployment]

    Code -->|Type-Safe Dataclasses| Test
    Test -->|Time-Skipping| Debug
    Debug -->|Logging + Metrics| Deploy
    Deploy -->|Observability| Monitor[Monitoring]

    style Code fill:#e1f5ff
    style Test fill:#fff4e1
    style Debug fill:#ffe1e1
    style Deploy fill:#90EE90
    style Monitor fill:#ffffcc

⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 5: Workflows programmieren

Code-Beispiele für dieses Kapitel: examples/part-02/chapter-04/

Praktische Übung: Richten Sie Ihre lokale Entwicklungsumgebung ein und erstellen Sie Ihren ersten eigenen Workflow!

Kapitel 5: Workflows programmieren

Nachdem wir die Entwicklungsumgebung in Kapitel 4 aufgesetzt haben, tauchen wir nun tief in die praktische Programmierung von Workflows ein. In diesem Kapitel lernen Sie fortgeschrittene Patterns kennen, die Ihnen helfen, robuste, skalierbare und wartbare Workflow-Anwendungen zu bauen.

5.1 Workflow-Komposition: Activities vs Child Workflows

5.1.1 Die goldene Regel

Activities sind die Default-Wahl. Nutzen Sie Child Workflows nur für spezifische Use Cases!

graph TB
    Start{Aufgabe zu erledigen}
    Start -->|Standard| Activity[Activity nutzen]
    Start -->|Spezialfall| Decision{Benötigen Sie...}

    Decision -->|Workload Partitionierung<br/>1000+ Activities| Child1[Child Workflow]
    Decision -->|Service Separation<br/>Eigener Worker Pool| Child2[Child Workflow]
    Decision -->|Resource Mapping<br/>Serialisierung| Child3[Child Workflow]
    Decision -->|Periodische Logic<br/>Continue-As-New| Child4[Child Workflow]
    Decision -->|Einfache Business Logic| Activity

    style Activity fill:#90EE90
    style Child1 fill:#fff4e1
    style Child2 fill:#fff4e1
    style Child3 fill:#fff4e1
    style Child4 fill:#fff4e1

5.1.2 Wann Activities nutzen

Activities sind perfekt für:

  • Business Logic (API-Aufrufe, Datenbank-Operationen)
  • Alle nicht-deterministischen Operationen
  • Automatische Retries
  • Niedrigerer Overhead (weniger Events in History)
from temporalio import workflow, activity
from datetime import timedelta

@activity.defn
async def send_email(to: str, subject: str, body: str) -> bool:
    """Activity für E-Mail-Versand (nicht-deterministisch)."""
    # Externer API-Aufruf - perfekt für Activity
    result = await email_service.send(to, subject, body)
    return result.success

@activity.defn
async def charge_credit_card(amount: float, card_token: str) -> str:
    """Activity für Payment Processing."""
    # Externe Payment API
    transaction = await payment_api.charge(amount, card_token)
    return transaction.id

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order_data: dict) -> dict:
        # Activities für Business Logic
        transaction_id = await workflow.execute_activity(
            charge_credit_card,
            args=[order_data["amount"], order_data["card_token"]],
            start_to_close_timeout=timedelta(seconds=30)
        )

        await workflow.execute_activity(
            send_email,
            args=[
                order_data["customer_email"],
                "Order Confirmation",
                f"Your order is confirmed. Transaction: {transaction_id}"
            ],
            start_to_close_timeout=timedelta(seconds=10)
        )

        return {"transaction_id": transaction_id, "status": "completed"}

5.1.3 Wann Child Workflows nutzen

Use Case 1: Workload Partitionierung

Für massive Fan-Outs (>1000 Activities):

from temporalio import workflow
from temporalio.workflow import ParentClosePolicy

@workflow.defn
class BatchCoordinatorWorkflow:
    """
    Koordiniert Verarbeitung von 100.000 Items.
    Nutzt Child Workflows zur Partitionierung.
    """

    @workflow.run
    async def run(self, total_items: int) -> dict:
        batch_size = 1000
        num_batches = (total_items + batch_size - 1) // batch_size

        workflow.logger.info(f"Processing {total_items} items in {num_batches} batches")

        # Starte Child Workflows (max ~1000)
        batch_handles = []
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = min((i + 1) * batch_size, total_items)

            handle = await workflow.start_child_workflow(
                BatchProcessorWorkflow.run,
                args=[{"start": start_idx, "end": end_idx}],
                id=f"batch-{workflow.info().workflow_id}-{i}",
                parent_close_policy=ParentClosePolicy.ABANDON,
            )
            batch_handles.append(handle)

        # Warte auf alle Batches
        results = await asyncio.gather(*batch_handles)

        return {
            "total_batches": num_batches,
            "total_processed": sum(r["processed"] for r in results)
        }

@workflow.defn
class BatchProcessorWorkflow:
    """
    Verarbeitet einen Batch von ~1000 Items.
    Jedes Child Workflow hat eigene Event History.
    """

    @workflow.run
    async def run(self, params: dict) -> dict:
        # Verarbeite bis zu 1000 Activities
        tasks = [
            workflow.execute_activity(
                process_single_item,
                item_id,
                start_to_close_timeout=timedelta(seconds=30)
            )
            for item_id in range(params["start"], params["end"])
        ]

        results = await asyncio.gather(*tasks)

        return {"processed": len(results)}

Warum Child Workflows hier?

  • Parent Workflow: 100 Batches → ~200 Events
  • Jedes Child: 1000 Activities → ~2000 Events
  • Ohne Child Workflows: 100.000 Activities → ~200.000 Events in einer History (Fehler!)
  • Mit Child Workflows: Verteilung über 100 separate Histories

Use Case 2: Service Separation

@workflow.defn
class OrderFulfillmentWorkflow:
    """
    Koordiniert verschiedene Microservices via Child Workflows.
    """

    @workflow.run
    async def run(self, order_id: str) -> dict:
        # Parallele Child Workflows auf verschiedenen Task Queues
        inventory_handle = await workflow.start_child_workflow(
            InventoryWorkflow.run,
            args=[order_id],
            task_queue="inventory-service",  # Eigener Worker Pool
            id=f"inventory-{order_id}",
        )

        shipping_handle = await workflow.start_child_workflow(
            ShippingWorkflow.run,
            args=[order_id],
            task_queue="shipping-service",  # Anderer Worker Pool
            id=f"shipping-{order_id}",
        )

        # Warte auf beide Services
        inventory_result, shipping_result = await asyncio.gather(
            inventory_handle,
            shipping_handle
        )

        return {
            "inventory": inventory_result,
            "shipping": shipping_result
        }

Use Case 3: Resource Mapping (Entity Workflows)

@workflow.defn
class HostUpgradeCoordinatorWorkflow:
    """
    Upgraded mehrere Hosts - ein Child Workflow pro Host.
    """

    @workflow.run
    async def run(self, hostnames: list[str]) -> dict:
        # Jeder Hostname mapped zu eigenem Child Workflow
        # Garantiert serialisierte Operationen pro Host
        upgrade_handles = []

        for hostname in hostnames:
            handle = await workflow.start_child_workflow(
                HostUpgradeWorkflow.run,
                args=[hostname],
                id=f"host-upgrade-{hostname}",  # Eindeutige ID pro Host
            )
            upgrade_handles.append(handle)

        results = await asyncio.gather(*upgrade_handles)
        return {"upgraded": len(results)}

@workflow.defn
class HostUpgradeWorkflow:
    """
    Upgraded einen einzelnen Host.
    Multiple Aufrufe mit gleicher ID werden de-duplicated.
    """

    @workflow.run
    async def run(self, hostname: str) -> dict:
        # Alle Operationen für diesen Host serialisiert
        await workflow.execute_activity(
            stop_host,
            hostname,
            start_to_close_timeout=timedelta(minutes=5)
        )

        await workflow.execute_activity(
            upgrade_host,
            hostname,
            start_to_close_timeout=timedelta(minutes=30)
        )

        await workflow.execute_activity(
            start_host,
            hostname,
            start_to_close_timeout=timedelta(minutes=5)
        )

        return {"hostname": hostname, "status": "upgraded"}

5.1.4 Parent-Child Kommunikation

from temporalio import workflow
from dataclasses import dataclass

@dataclass
class TaskUpdate:
    task_id: str
    status: str
    progress: int

@workflow.defn
class ChildWorkerWorkflow:
    def __init__(self) -> None:
        self.task_data = None
        self.paused = False

    @workflow.run
    async def run(self) -> str:
        # Warte auf Task-Zuweisung via Signal
        await workflow.wait_condition(lambda: self.task_data is not None)

        # Verarbeite Task
        for i in range(10):
            # Prüfe Pause-Signal
            if self.paused:
                await workflow.wait_condition(lambda: not self.paused)

            await workflow.execute_activity(
                process_task_step,
                args=[self.task_data, i],
                start_to_close_timeout=timedelta(minutes=2)
            )

        return "completed"

    @workflow.signal
    def assign_task(self, task_data: dict) -> None:
        """Signal vom Parent: Task zuweisen."""
        self.task_data = task_data

    @workflow.signal
    def pause(self) -> None:
        """Signal vom Parent: Pausieren."""
        self.paused = True

    @workflow.signal
    def resume(self) -> None:
        """Signal vom Parent: Fortsetzen."""
        self.paused = False

    @workflow.query
    def get_status(self) -> dict:
        """Query vom Parent oder External Client."""
        return {
            "has_task": self.task_data is not None,
            "paused": self.paused
        }

@workflow.defn
class CoordinatorWorkflow:
    @workflow.run
    async def run(self, tasks: list[dict]) -> dict:
        # Starte Worker Child Workflows
        worker_handles = []
        for i in range(3):  # 3 Worker
            handle = await workflow.start_child_workflow(
                ChildWorkerWorkflow.run,
                id=f"worker-{i}",
            )
            worker_handles.append(handle)

        # Verteile Tasks via Signals
        for i, task in enumerate(tasks):
            worker_idx = i % len(worker_handles)
            await worker_handles[worker_idx].signal("assign_task", task)

        # Query Worker Status
        statuses = []
        for handle in worker_handles:
            status = await handle.query("get_status")
            statuses.append(status)

        # Warte auf Completion
        await asyncio.gather(*worker_handles)

        return {"completed_tasks": len(tasks)}

5.2 Parallele Ausführung

5.2.1 asyncio.gather für parallele Activities

import asyncio
from temporalio import workflow
from datetime import timedelta

@workflow.defn
class ParallelProcessingWorkflow:
    @workflow.run
    async def run(self, urls: list[str]) -> list[dict]:
        # Alle URLs parallel scrapen
        tasks = [
            workflow.execute_activity(
                scrape_url,
                url,
                start_to_close_timeout=timedelta(minutes=5)
            )
            for url in urls
        ]

        # Warte auf alle (Results in Order der Input-Liste)
        results = await asyncio.gather(*tasks)

        return results

5.2.2 Fan-Out/Fan-In Pattern

graph LR
    Start[Workflow Start]
    FanOut[Fan-Out: Start parallel Activities]
    A1[Activity 1]
    A2[Activity 2]
    A3[Activity 3]
    A4[Activity N]
    FanIn[Fan-In: Gather Results]
    Aggregate[Aggregate Results]
    End[Workflow End]

    Start --> FanOut
    FanOut --> A1
    FanOut --> A2
    FanOut --> A3
    FanOut --> A4
    A1 --> FanIn
    A2 --> FanIn
    A3 --> FanIn
    A4 --> FanIn
    FanIn --> Aggregate
    Aggregate --> End

    style FanOut fill:#e1f5ff
    style FanIn fill:#ffe1e1
    style Aggregate fill:#fff4e1
from typing import List
from dataclasses import dataclass

@dataclass
class ScrapedData:
    url: str
    title: str
    content: str
    word_count: int

@workflow.defn
class FanOutFanInWorkflow:
    @workflow.run
    async def run(self, data_urls: List[str]) -> dict:
        workflow.logger.info(f"Fan-out: Scraping {len(data_urls)} URLs")

        # Fan-Out: Parallele Activities starten
        scrape_tasks = [
            workflow.execute_activity(
                scrape_url,
                url,
                start_to_close_timeout=timedelta(minutes=5)
            )
            for url in data_urls
        ]

        # Fan-In: Alle Results sammeln
        scraped_data: List[ScrapedData] = await asyncio.gather(*scrape_tasks)

        workflow.logger.info(f"Fan-in: Scraped {len(scraped_data)} pages")

        # Aggregation
        aggregated = await workflow.execute_activity(
            aggregate_scraped_data,
            scraped_data,
            start_to_close_timeout=timedelta(minutes=2)
        )

        return {
            "total_pages": len(scraped_data),
            "total_words": sum(d.word_count for d in scraped_data),
            "aggregated_insights": aggregated
        }

5.2.3 Performance-Limitierungen bei Fan-Outs

WICHTIG: Ein einzelner Workflow ist auf ~30 Activities/Sekunde limitiert, unabhängig von Ressourcen!

Lösung für massive Fan-Outs:

@workflow.defn
class ScalableFanOutWorkflow:
    """
    Für 10.000+ Items: Nutze Child Workflows zur Partitionierung.
    """

    @workflow.run
    async def run(self, total_items: int) -> dict:
        batch_size = 1000  # Items pro Child Workflow

        # Berechne Anzahl Batches
        num_batches = (total_items + batch_size - 1) // batch_size

        workflow.logger.info(
            f"Processing {total_items} items via {num_batches} child workflows"
        )

        # Fan-Out über Child Workflows
        batch_workflows = []
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = min((i + 1) * batch_size, total_items)

            handle = await workflow.start_child_workflow(
                BatchProcessorWorkflow.run,
                {"start": start_idx, "end": end_idx},
                id=f"batch-{i}",
            )
            batch_workflows.append(handle)

        # Fan-In: Warte auf alle Batches
        batch_results = await asyncio.gather(*batch_workflows)

        return {
            "batches_processed": len(batch_results),
            "total_items": total_items
        }

Performance-Matrix:

ItemsStrategieGeschätzte Zeit
10-100Direkte Activities in WorkflowSekunden
100-1000asyncio.gatherMinuten
1000-10.000Batch Processing5-10 Minuten
10.000+Child Workflows30+ Minuten

5.3 Timers und Scheduling

5.3.1 workflow.sleep() für Delays

import asyncio
from datetime import timedelta
from temporalio import workflow

@workflow.defn
class DelayWorkflow:
    @workflow.run
    async def run(self) -> str:
        workflow.logger.info("Starting workflow")

        # Sleep für 10 Sekunden (durable timer)
        await asyncio.sleep(10)
        workflow.logger.info("10 seconds passed")

        # Kann auch Monate schlafen - Resource-Light!
        await asyncio.sleep(60 * 60 * 24 * 30)  # 30 Tage
        workflow.logger.info("30 days passed")

        return "Timers completed"

Wichtig: Timers sind persistent! Worker/Service Restarts haben keinen Einfluss.

5.3.2 Timeout Patterns

@workflow.defn
class TimeoutWorkflow:
    def __init__(self) -> None:
        self.approval_received = False

    @workflow.run
    async def run(self, order_id: str) -> dict:
        workflow.logger.info(f"Awaiting approval for order {order_id}")

        try:
            # Warte auf Approval Signal oder Timeout
            await workflow.wait_condition(
                lambda: self.approval_received,
                timeout=timedelta(hours=24)  # 24h Timeout
            )

            return {"status": "approved", "order_id": order_id}

        except asyncio.TimeoutError:
            workflow.logger.warning(f"Approval timeout for order {order_id}")

            # Automatische Ablehnung nach Timeout
            await workflow.execute_activity(
                reject_order,
                order_id,
                start_to_close_timeout=timedelta(seconds=30)
            )

            return {"status": "rejected_timeout", "order_id": order_id}

    @workflow.signal
    def approve(self) -> None:
        self.approval_received = True

5.3.3 Cron Workflows mit Schedules

Moderne Methode (Empfohlen):

from temporalio.client import (
    Client,
    Schedule,
    ScheduleActionStartWorkflow,
    ScheduleSpec,
    ScheduleIntervalSpec
)
from datetime import timedelta

async def create_daily_report_schedule():
    client = await Client.connect("localhost:7233")

    # Schedule erstellen: Täglich um 9 Uhr
    await client.create_schedule(
        "daily-report-schedule",
        Schedule(
            action=ScheduleActionStartWorkflow(
                DailyReportWorkflow.run,
                task_queue="reports",
            ),
            spec=ScheduleSpec(
                # Cron Expression: Minute Hour Day Month Weekday
                cron_expressions=["0 9 * * *"],  # Täglich 9:00 UTC
            ),
        ),
    )

    # Interval-basiert: Jede Stunde
    await client.create_schedule(
        "hourly-sync-schedule",
        Schedule(
            action=ScheduleActionStartWorkflow(
                SyncWorkflow.run,
                task_queue="sync",
            ),
            spec=ScheduleSpec(
                intervals=[
                    ScheduleIntervalSpec(every=timedelta(hours=1))
                ],
            ),
        ),
    )

Cron Expression Beispiele:

# Jede Minute
"* * * * *"

# Jeden Tag um Mitternacht
"0 0 * * *"

# Wochentags um 12 Uhr
"0 12 * * MON-FRI"

# Jeden Montag um 8:00
"0 8 * * MON"

# Am 1. jeden Monats
"0 0 1 * *"

# Alle 15 Minuten
"*/15 * * * *"

5.3.4 Timer Cancellation

@workflow.defn
class CancellableTimerWorkflow:
    def __init__(self) -> None:
        self.timer_cancelled = False

    @workflow.run
    async def run(self) -> str:
        # Starte 1-Stunden Timer
        sleep_task = asyncio.create_task(asyncio.sleep(3600))

        # Warte auf Timer oder Cancellation
        await workflow.wait_condition(
            lambda: self.timer_cancelled or sleep_task.done()
        )

        if self.timer_cancelled:
            # Timer canceln
            sleep_task.cancel()
            try:
                await sleep_task
            except asyncio.CancelledError:
                return "Timer was cancelled"

        return "Timer completed normally"

    @workflow.signal
    def cancel_timer(self) -> None:
        self.timer_cancelled = True

5.4 State Management und Queries

5.4.1 Workflow Instance Variables

from dataclasses import dataclass, field
from typing import Dict, List

@dataclass
class OrderState:
    order_id: str
    items: List[dict] = field(default_factory=list)
    total_amount: float = 0.0
    status: str = "pending"
    approvals: Dict[str, bool] = field(default_factory=dict)

@workflow.defn
class StatefulOrderWorkflow:
    def __init__(self) -> None:
        # Instance Variables halten State
        self.state = OrderState(order_id="")
        self.processing_complete = False

    @workflow.run
    async def run(self, order_id: str) -> OrderState:
        self.state.order_id = order_id
        self.state.status = "fetching_items"

        # State persistiert über Activities
        items = await workflow.execute_activity(
            fetch_order_items,
            order_id,
            start_to_close_timeout=timedelta(minutes=1)
        )
        self.state.items = items
        self.state.total_amount = sum(item["price"] for item in items)

        # Conditional basierend auf State
        if self.state.total_amount > 1000:
            self.state.status = "awaiting_approval"
            await workflow.wait_condition(
                lambda: "manager" in self.state.approvals
            )

        self.state.status = "approved"
        return self.state

    @workflow.signal
    def approve(self, approver: str) -> None:
        """Signal updated State."""
        self.state.approvals[approver] = True

    @workflow.query
    def get_state(self) -> OrderState:
        """Query liest State (read-only!)."""
        return self.state

    @workflow.query
    def get_total(self) -> float:
        return self.state.total_amount

5.4.2 State Queries für Progress Tracking

from dataclasses import dataclass
from datetime import datetime

@dataclass
class ProgressInfo:
    phase: str
    current_step: int
    total_steps: int
    percentage: float
    start_time: datetime
    estimated_completion: datetime = None

@workflow.defn
class ProgressTrackingWorkflow:
    def __init__(self) -> None:
        self.progress = ProgressInfo(
            phase="initializing",
            current_step=0,
            total_steps=100,
            percentage=0.0,
            start_time=None
        )

    @workflow.run
    async def run(self, total_items: int) -> dict:
        self.progress.total_steps = total_items
        self.progress.start_time = workflow.time()

        # Phase 1: Initialization
        self.progress.phase = "initialization"
        await workflow.execute_activity(
            initialize_activity,
            start_to_close_timeout=timedelta(minutes=1)
        )
        self._update_progress(10)

        # Phase 2: Processing
        self.progress.phase = "processing"
        for i in range(total_items):
            await workflow.execute_activity(
                process_item,
                i,
                start_to_close_timeout=timedelta(minutes=2)
            )
            self.progress.current_step = i + 1
            self._update_progress()

        # Phase 3: Finalization
        self.progress.phase = "finalization"
        self._update_progress(100)

        return {"completed": self.progress.current_step}

    def _update_progress(self, override_percentage: float = None) -> None:
        if override_percentage is not None:
            self.progress.percentage = override_percentage
        else:
            self.progress.percentage = (
                self.progress.current_step / self.progress.total_steps * 100
            )

        # ETA Berechnung
        if self.progress.current_step > 0:
            elapsed = (workflow.time() - self.progress.start_time).total_seconds()
            rate = self.progress.current_step / elapsed
            remaining = self.progress.total_steps - self.progress.current_step
            eta_seconds = remaining / rate if rate > 0 else 0
            self.progress.estimated_completion = (
                workflow.time() + timedelta(seconds=eta_seconds)
            )

    @workflow.query
    def get_progress(self) -> ProgressInfo:
        """Query für aktuellen Progress."""
        return self.progress

    @workflow.query
    def get_percentage(self) -> float:
        """Query nur für Percentage."""
        return self.progress.percentage

Client-Side Progress Monitoring:

from temporalio.client import Client
import asyncio

async def monitor_workflow_progress():
    client = await Client.connect("localhost:7233")
    handle = client.get_workflow_handle("workflow-id")

    while True:
        # Query Progress
        progress = await handle.query("get_progress")

        print(f"Phase: {progress.phase}")
        print(f"Progress: {progress.percentage:.2f}%")
        print(f"Step: {progress.current_step}/{progress.total_steps}")

        if progress.estimated_completion:
            print(f"ETA: {progress.estimated_completion}")

        if progress.percentage >= 100:
            print("Workflow complete!")
            break

        await asyncio.sleep(5)  # Poll alle 5 Sekunden

5.5 Error Handling und Resilience

5.5.1 try/except in Workflows

from temporalio.exceptions import ApplicationError, ActivityError

@workflow.defn
class ErrorHandlingWorkflow:
    @workflow.run
    async def run(self, items: List[str]) -> dict:
        successful = []
        failed = []

        for item in items:
            try:
                result = await workflow.execute_activity(
                    process_item,
                    item,
                    start_to_close_timeout=timedelta(minutes=2),
                    retry_policy=RetryPolicy(
                        maximum_attempts=3,
                        non_retryable_error_types=["InvalidInput"]
                    )
                )
                successful.append(result)

            except ActivityError as e:
                # Activity failed nach allen Retries
                workflow.logger.warning(f"Failed to process {item}: {e.cause}")
                failed.append({
                    "item": item,
                    "error": str(e.cause),
                    "attempts": e.retry_state.attempt if e.retry_state else 0
                })
                # Workflow fährt fort!

        return {
            "successful": len(successful),
            "failed": len(failed),
            "total": len(items)
        }

5.5.2 SAGA Pattern für Compensation

from typing import List, Callable

@workflow.defn
class BookingWorkflow:
    """
    SAGA Pattern: Bei Fehler Rollback aller vorherigen Schritte.
    """

    @workflow.run
    async def run(self, booking_data: dict) -> dict:
        compensations: List[Callable] = []

        try:
            # Step 1: Buche Auto
            car_result = await workflow.execute_activity(
                book_car,
                booking_data,
                start_to_close_timeout=timedelta(seconds=10),
            )
            # Registriere Compensation
            compensations.append(
                lambda: workflow.execute_activity(
                    undo_book_car,
                    booking_data,
                    start_to_close_timeout=timedelta(seconds=10)
                )
            )

            # Step 2: Buche Hotel
            hotel_result = await workflow.execute_activity(
                book_hotel,
                booking_data,
                start_to_close_timeout=timedelta(seconds=10),
            )
            compensations.append(
                lambda: workflow.execute_activity(
                    undo_book_hotel,
                    booking_data,
                    start_to_close_timeout=timedelta(seconds=10)
                )
            )

            # Step 3: Buche Flug
            flight_result = await workflow.execute_activity(
                book_flight,
                booking_data,
                start_to_close_timeout=timedelta(seconds=10),
            )
            compensations.append(
                lambda: workflow.execute_activity(
                    undo_book_flight,
                    booking_data,
                    start_to_close_timeout=timedelta(seconds=10)
                )
            )

            return {
                "status": "success",
                "car": car_result,
                "hotel": hotel_result,
                "flight": flight_result
            }

        except Exception as e:
            # Fehler - Führe Compensations in umgekehrter Reihenfolge aus
            workflow.logger.error(f"Booking failed: {e}, rolling back...")

            for compensation in reversed(compensations):
                try:
                    await compensation()
                except Exception as comp_error:
                    workflow.logger.error(f"Compensation failed: {comp_error}")

            return {
                "status": "rolled_back",
                "error": str(e)
            }

5.6 Long-Running Workflows und Continue-As-New

5.6.1 Event History Management

Problem: Event History ist auf 51.200 Events oder 50 MB limitiert.

Lösung: Continue-As-New

from dataclasses import dataclass

@dataclass
class WorkflowState:
    processed_count: int = 0
    iteration: int = 0

@workflow.defn
class LongRunningWorkflow:
    @workflow.run
    async def run(self, state: WorkflowState = None) -> None:
        # Initialisiere oder restore State
        if state is None:
            self.state = WorkflowState()
        else:
            self.state = state
            workflow.logger.info(f"Resumed at iteration {self.state.iteration}")

        # Verarbeite Batch
        for i in range(100):
            await workflow.execute_activity(
                process_item,
                self.state.processed_count,
                start_to_close_timeout=timedelta(minutes=1)
            )
            self.state.processed_count += 1

        self.state.iteration += 1

        # Check Continue-As-New Suggestion
        if workflow.info().is_continue_as_new_suggested():
            workflow.logger.info(
                f"Continuing as new after {self.state.processed_count} items"
            )
            workflow.continue_as_new(self.state)

        # Oder: Custom Trigger
        if self.state.processed_count % 10000 == 0:
            workflow.continue_as_new(self.state)

5.6.2 Infinite Loop mit Continue-As-New

Entity Workflow Pattern (Actor Model):

from dataclasses import dataclass
from datetime import datetime

@dataclass
class AccountState:
    account_id: str
    balance: float = 0.0
    transaction_count: int = 0

@workflow.defn
class AccountEntityWorkflow:
    """
    Läuft unbegrenzt - Entity Workflow für ein Bank-Konto.
    """

    def __init__(self) -> None:
        self.state: AccountState = None
        self.pending_transactions: List[dict] = []
        self.should_shutdown = False

    @workflow.run
    async def run(self, initial_state: AccountState = None) -> None:
        # Initialize oder restore
        if initial_state:
            self.state = initial_state
        else:
            self.state = AccountState(
                account_id=workflow.info().workflow_id
            )

        workflow.logger.info(
            f"Account {self.state.account_id} started. "
            f"Balance: {self.state.balance}, "
            f"Transactions: {self.state.transaction_count}"
        )

        # Infinite Loop
        while not self.should_shutdown:
            # Warte auf Transactions oder Timeout
            await workflow.wait_condition(
                lambda: len(self.pending_transactions) > 0 or self.should_shutdown,
                timeout=timedelta(seconds=30)
            )

            # Verarbeite Transactions
            while self.pending_transactions:
                transaction = self.pending_transactions.pop(0)

                try:
                    result = await workflow.execute_activity(
                        process_transaction,
                        transaction,
                        start_to_close_timeout=timedelta(seconds=10)
                    )

                    self.state.balance += result["amount"]
                    self.state.transaction_count += 1

                except Exception as e:
                    workflow.logger.error(f"Transaction failed: {e}")

            # Continue-As-New nach 1000 Transactions
            if self.state.transaction_count % 1000 == 0:
                workflow.logger.info(
                    f"Continuing as new after {self.state.transaction_count} transactions"
                )
                workflow.continue_as_new(self.state)

        # Graceful Shutdown
        workflow.logger.info("Account workflow shutting down gracefully")

    @workflow.signal
    def deposit(self, amount: float) -> None:
        """Signal: Geld einzahlen."""
        self.pending_transactions.append({
            "type": "deposit",
            "amount": amount,
            "timestamp": workflow.time()
        })

    @workflow.signal
    def withdraw(self, amount: float) -> None:
        """Signal: Geld abheben."""
        self.pending_transactions.append({
            "type": "withdraw",
            "amount": -amount,
            "timestamp": workflow.time()
        })

    @workflow.signal
    def shutdown(self) -> None:
        """Signal: Workflow beenden."""
        self.should_shutdown = True

    @workflow.query
    def get_balance(self) -> float:
        """Query: Aktueller Kontostand."""
        return self.state.balance

    @workflow.query
    def get_transaction_count(self) -> int:
        """Query: Anzahl Transaktionen."""
        return self.state.transaction_count

5.7 Zusammenfassung

In diesem Kapitel haben wir fortgeschrittene Workflow-Programming-Patterns kennengelernt:

Workflow-Komposition:

  • Activities: Default für Business Logic
  • Child Workflows: Nur für Workload-Partitionierung, Service-Separation, Resource-Mapping
  • Parent-Child Kommunikation via Signals

Parallele Ausführung:

  • asyncio.gather() für parallele Activities
  • Fan-Out/Fan-In Patterns
  • Performance-Limit: ~30 Activities/Sekunde pro Workflow
  • Lösung für 10.000+ Items: Child Workflows

Timers und Scheduling:

  • workflow.sleep() für Delays (Tage, Monate möglich!)
  • Timeout Patterns mit wait_condition()
  • Cron Workflows via Schedules
  • Timer Cancellation

State Management:

  • Instance Variables für Workflow-State
  • Queries für Progress Tracking (read-only!)
  • Signals für State Updates
  • ETA-Berechnungen

Error Handling:

  • try/except für Activity Failures
  • SAGA Pattern für Compensations
  • Graceful Degradation
  • Workflows failen NICHT automatisch bei Activity Errors

Long-Running Workflows:

  • Event History Limit: 51.200 Events / 50 MB
  • workflow.info().is_continue_as_new_suggested()
  • State Transfer via workflow.continue_as_new()
  • Entity Workflows mit Infinite Loops
graph TB
    Start[Workflow Development]
    Design{Design Pattern}

    Design -->|Business Logic| Activities[Use Activities]
    Design -->|Massive Scale| ChildWF[Use Child Workflows]
    Design -->|Parallel| Gather[asyncio.gather]
    Design -->|Long-Running| CAN[Continue-As-New]
    Design -->|Error Handling| SAGA[SAGA Pattern]

    Activities --> Best[Best Practices]
    ChildWF --> Best
    Gather --> Best
    CAN --> Best
    SAGA --> Best

    Best --> Production[Production-Ready Workflows]

    style Activities fill:#90EE90
    style ChildWF fill:#fff4e1
    style Gather fill:#e1f5ff
    style CAN fill:#ffe1e1
    style SAGA fill:#ffffcc
    style Production fill:#90EE90

⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 6: Kommunikation (Signale und Queries)

Code-Beispiele für dieses Kapitel: examples/part-02/chapter-05/

Praktische Übung: Implementieren Sie einen Entity Workflow mit Signals, Queries und Continue-As-New!

Kapitel 6: Kommunikation mit Workflows - Signale, Queries und Updates

Einleitung

In den vorherigen Kapiteln haben Sie gelernt, wie man Workflows definiert und programmiert. Doch in der realen Welt existieren Workflows nicht isoliert – sie müssen mit externen Systemen, Benutzern und anderen Workflows kommunizieren. Ein Approval-Workflow muss auf Genehmigungsentscheidungen warten, ein Bestellprozess muss Statusabfragen beantworten, und lange laufende Prozesse müssen auf externe Events reagieren können.

Temporal bietet drei leistungsstarke Mechanismen für die Kommunikation mit laufenden Workflows:

  1. Signale (Signals): Asynchrone, fire-and-forget Nachrichten zur Zustandsänderung
  2. Queries: Synchrone, read-only Abfragen des Workflow-Zustands
  3. Updates: Synchrone Zustandsänderungen mit Rückgabewerten und Validierung

Dieses Kapitel erklärt detailliert, wie diese drei Mechanismen funktionieren, wann Sie welchen verwenden sollten, und welche Best Practices und Fallstricke es zu beachten gilt.

Lernziele

Nach diesem Kapitel können Sie:

  • Signale implementieren und verstehen, wann sie geeignet sind
  • Queries für effiziente Zustandsabfragen nutzen
  • Updates mit Validierung für synchrone Operationen einsetzen
  • Zwischen den drei Mechanismen fundiert entscheiden
  • Human-in-the-Loop Workflows implementieren
  • Häufige Fehler und Anti-Patterns vermeiden

6.1 Signale (Signals) - Asynchrone Workflow-Kommunikation

6.1.1 Definition und Zweck

Signale sind asynchrone, fire-and-forget Nachrichten, die den Zustand eines laufenden Workflows ändern, ohne dass der Sender auf eine Antwort warten muss. Wenn ein Signal empfangen wird, persistiert Temporal sowohl das Event als auch die Payload dauerhaft in der Event History.

Hauptmerkmale von Signalen:

  • Asynchron: Server akzeptiert Signal sofort ohne auf Verarbeitung zu warten
  • Non-blocking: Sender erhält keine Rückgabewerte oder Exceptions
  • Dauerhaft: Erzeugt WorkflowExecutionSignaled Event in der Event History
  • Geordnet: Signale werden in der Reihenfolge verarbeitet, in der sie empfangen wurden
  • Gepuffert: Können vor Workflow-Start gesendet werden; werden dann gepuffert und beim Start verarbeitet

Sequenzdiagramm: Signal-Ablauf

sequenceDiagram
    participant Client as External Client
    participant Frontend as Temporal Frontend
    participant History as History Service
    participant Worker as Worker
    participant Workflow as Workflow Code

    Client->>Frontend: send_signal("approve", data)
    Frontend->>History: Store WorkflowExecutionSignaled Event
    History-->>Frontend: Event stored
    Frontend-->>Client: Signal accepted (async return)

    Note over Worker: Worker polls for tasks
    Worker->>Frontend: Poll for Workflow Task
    Frontend->>History: Get pending events
    History-->>Frontend: Return events (including signal)
    Frontend-->>Worker: Workflow Task with signal event

    Worker->>Workflow: Replay + execute signal handler
    Workflow->>Workflow: approve(data)
    Workflow->>Workflow: Update state
    Worker->>Frontend: Complete task with new commands
    Frontend->>History: Store completion events

Wann Signale verwenden:

  • ✅ Asynchrone Benachrichtigungen ohne Rückmeldung
  • ✅ Event-gesteuerte Workflows (z.B. “neuer Upload”, “Zahlung eingegangen”)
  • ✅ Human-in-the-Loop mit wait conditions
  • ✅ Externe System-Integrationen mit fire-and-forget Semantik
  • ❌ Wenn Sie wissen müssen, ob Operation erfolgreich war → Update verwenden
  • ❌ Wenn Sie Validierung vor Ausführung brauchen → Update verwenden

6.1.2 Signal Handler definieren

Signal Handler werden mit dem @workflow.signal Decorator dekoriert und können synchron (def) oder asynchron (async def) sein.

Einfacher Signal Handler:

from temporalio import workflow
from typing import Optional
from dataclasses import dataclass

@dataclass
class ApprovalInput:
    """Typsichere Signal-Daten mit Dataclass"""
    approver_name: str
    approved: bool
    comment: str = ""

@workflow.defn
class ApprovalWorkflow:
    """Workflow mit Signal-basierter Genehmigung"""

    @workflow.init
    def __init__(self) -> None:
        # WICHTIG: Initialisierung mit @workflow.init
        # garantiert Ausführung vor Signal Handlern
        self.approved: Optional[bool] = None
        self.approver_name: Optional[str] = None
        self.comment: str = ""

    @workflow.signal
    def approve(self, input: ApprovalInput) -> None:
        """Signal Handler - ändert Workflow-Zustand"""
        self.approved = input.approved
        self.approver_name = input.approver_name
        self.comment = input.comment

        workflow.logger.info(
            f"Approval decision received from {input.approver_name}: "
            f"{'approved' if input.approved else 'rejected'}"
        )

    @workflow.run
    async def run(self, request_id: str) -> str:
        """Wartet auf Signal via wait_condition"""
        workflow.logger.info(f"Waiting for approval on request {request_id}")

        # Warten bis Signal empfangen wurde
        await workflow.wait_condition(lambda: self.approved is not None)

        if self.approved:
            return f"Request approved by {self.approver_name}"
        else:
            return f"Request rejected by {self.approver_name}: {self.comment}"

Asynchroner Signal Handler mit Activities:

Signal Handler können auch asynchron sein und Activities, Child Workflows oder Timers ausführen:

from datetime import timedelta

@workflow.defn
class OrderWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.orders: list[dict] = []

    @workflow.signal
    async def process_order(self, order: dict) -> None:
        """Async Signal Handler kann Activities ausführen"""
        # Validierung via Activity
        validated_order = await workflow.execute_activity(
            validate_order,
            order,
            start_to_close_timeout=timedelta(seconds=10),
        )

        # Zustand aktualisieren
        self.orders.append(validated_order)

        workflow.logger.info(f"Processed order: {validated_order['order_id']}")

Best Practice: Leichtgewichtige Signal Handler

# ✓ Empfohlen: Signal aktualisiert Zustand, Workflow verarbeitet
@workflow.signal
def add_order(self, order: dict) -> None:
    """Leichtgewichtiger Handler - nur Zustand ändern"""
    self.pending_orders.append(order)

@workflow.run
async def run(self) -> None:
    """Haupt-Workflow verarbeitet gepufferte Orders"""
    while True:
        # Warte auf neue Orders
        await workflow.wait_condition(lambda: len(self.pending_orders) > 0)

        # Verarbeite alle gepufferten Orders
        while self.pending_orders:
            order = self.pending_orders.pop(0)

            # Verarbeitung im Haupt-Workflow
            await workflow.execute_activity(
                process_order_activity,
                order,
                start_to_close_timeout=timedelta(minutes=5),
            )

6.1.3 Signale von externen Clients senden

Um ein Signal an einen laufenden Workflow zu senden, benötigen Sie ein Workflow Handle:

from temporalio.client import Client

async def send_approval_signal():
    """Signal von externem Client senden"""
    # Verbindung zum Temporal Service
    client = await Client.connect("localhost:7233")

    # Handle für existierenden Workflow abrufen
    workflow_handle = client.get_workflow_handle_for(
        ApprovalWorkflow,
        workflow_id="approval-request-123"
    )

    # Signal senden (fire-and-forget)
    await workflow_handle.signal(
        ApprovalWorkflow.approve,
        ApprovalInput(
            approver_name="Alice Johnson",
            approved=True,
            comment="Budget approved"
        )
    )

    print("Signal sent successfully")
    # Kehrt sofort zurück - wartet nicht auf Verarbeitung!

Typsicheres Signaling mit Workflow-Referenz:

# Empfohlen: Typsicher mit Workflow-Klasse
await workflow_handle.signal(
    ApprovalWorkflow.approve,  # Methoden-Referenz (typsicher)
    ApprovalInput(...)
)

# Alternativ: String-basiert (anfällig für Tippfehler)
await workflow_handle.signal(
    "approve",  # String-basiert
    ApprovalInput(...)
)

6.1.4 Signal-with-Start Pattern

Das Signal-with-Start Pattern ermöglicht lazy Workflow-Initialisierung: Es sendet ein Signal an einen existierenden Workflow oder startet einen neuen, falls keiner existiert.

Use Case: Shopping Cart Workflow

@workflow.defn
class ShoppingCartWorkflow:
    """Lazy-initialisierter Shopping Cart via Signal-with-Start"""

    @workflow.init
    def __init__(self) -> None:
        self.items: list[dict] = []
        self.total = 0.0

    @workflow.signal
    def add_item(self, item: dict) -> None:
        """Items zum Warenkorb hinzufügen"""
        self.items.append(item)
        self.total += item['price']
        workflow.logger.info(f"Added {item['name']} - Total: ${self.total:.2f}")

    @workflow.run
    async def run(self) -> dict:
        """Warenkorb läuft 24h, dann automatischer Checkout"""
        # Warte 24 Stunden auf weitere Items
        await asyncio.sleep(timedelta(hours=24))

        # Automatischer Checkout
        if len(self.items) > 0:
            await workflow.execute_activity(
                process_checkout,
                {"items": self.items, "total": self.total},
                start_to_close_timeout=timedelta(minutes=5),
            )

        return {"items": len(self.items), "total": self.total}

# Client: Signal-with-Start verwenden
async def add_to_cart(user_id: str, item: dict):
    """Item zum User-Warenkorb hinzufügen (lazy init)"""
    client = await Client.connect("localhost:7233")

    # Start Workflow mit Initial-Signal
    await client.start_workflow(
        ShoppingCartWorkflow.run,
        id=f"cart-{user_id}",  # Ein Warenkorb pro User
        task_queue="shopping",
        start_signal="add_item",  # Signal-Name
        start_signal_args=[item],  # Signal-Argumente
    )

    print(f"Item added to cart for user {user_id}")

Ablaufdiagramm: Signal-with-Start

flowchart TD
    A[Client: Signal-with-Start] --> B{Workflow existiert?}
    B -->|Ja| C[Signal an existierenden Workflow senden]
    B -->|Nein| D[Neuen Workflow starten]
    D --> E[Gepuffertes Signal ausliefern]
    C --> F[Signal verarbeitet]
    E --> F

Vorteile:

  • ✅ Idempotent: Mehrfache Aufrufe sicher
  • ✅ Race-condition sicher: Kein “create before send signal” Problem
  • ✅ Lazy Initialization: Workflow nur wenn nötig
  • ✅ Perfekt für User-Sessions, Shopping Carts, etc.

6.1.5 Wait Conditions mit Signalen

Die workflow.wait_condition() Funktion ist das Kernstück für Signal-basierte Koordination:

Einfache Wait Condition:

@workflow.run
async def run(self) -> str:
    # Warten bis Signal empfangen
    await workflow.wait_condition(lambda: self.approved is not None)

    # Fortfahren nach Signal
    return "Approved!"

Wait Condition mit Timeout:

import asyncio

@workflow.run
async def run(self) -> str:
    try:
        # Warte maximal 7 Tage auf Approval
        await workflow.wait_condition(
            lambda: self.approved is not None,
            timeout=timedelta(days=7)
        )
        return "Approved!"
    except asyncio.TimeoutError:
        # Automatische Ablehnung nach Timeout
        return "Approval timeout - request auto-rejected"

Mehrere Bedingungen kombinieren:

@workflow.defn
class MultiStageWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.stage1_complete = False
        self.stage2_complete = False
        self.payment_confirmed = False

    @workflow.signal
    def complete_stage1(self) -> None:
        self.stage1_complete = True

    @workflow.signal
    def complete_stage2(self) -> None:
        self.stage2_complete = True

    @workflow.signal
    def confirm_payment(self, amount: float) -> None:
        self.payment_confirmed = True

    @workflow.run
    async def run(self) -> str:
        # Warte auf ALLE Bedingungen
        await workflow.wait_condition(
            lambda: (
                self.stage1_complete and
                self.stage2_complete and
                self.payment_confirmed
            ),
            timeout=timedelta(hours=48)
        )

        return "All conditions met - proceeding"

Wait Condition Pattern Visualisierung:

stateDiagram-v2
    [*] --> Waiting: workflow.wait_condition()
    Waiting --> CheckCondition: Worker wakes up
    CheckCondition --> Waiting: Condition false
    CheckCondition --> Continue: Condition true
    CheckCondition --> Timeout: Timeout reached
    Continue --> [*]: Workflow proceeds
    Timeout --> [*]: asyncio.TimeoutError

    note right of Waiting
        Workflow blockiert hier
        bis Signal empfangen
        oder Timeout
    end note

6.1.6 Signal-Reihenfolge und Garantien

Ordering Guarantee:

Temporal garantiert, dass Signale in der Reihenfolge verarbeitet werden, in der sie empfangen wurden:

# Client sendet 3 Signale schnell hintereinander
await handle.signal(MyWorkflow.signal1, "first")
await handle.signal(MyWorkflow.signal2, "second")
await handle.signal(MyWorkflow.signal3, "third")

# Workflow-Handler werden GARANTIERT in dieser Reihenfolge ausgeführt:
# 1. signal1("first")
# 2. signal2("second")
# 3. signal3("third")

Event History Eintrag:

Jedes Signal erzeugt einen WorkflowExecutionSignaled Event:

{
  "eventType": "WorkflowExecutionSignaled",
  "eventId": 42,
  "workflowExecutionSignaledEventAttributes": {
    "signalName": "approve",
    "input": {
      "payloads": [...]
    }
  }
}

Replay-Sicherheit:

Bei Replay werden Signale in derselben Reihenfolge aus der Event History gelesen und ausgeführt - deterministisch.

6.1.7 Praktische Anwendungsfälle für Signale

Use Case 1: Human-in-the-Loop Expense Approval

from decimal import Decimal
from datetime import datetime

@dataclass
class ExpenseRequest:
    request_id: str
    employee: str
    amount: Decimal
    description: str
    category: str

@dataclass
class ApprovalDecision:
    approved: bool
    approver: str
    timestamp: datetime
    comment: str

@workflow.defn
class ExpenseApprovalWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.decision: Optional[ApprovalDecision] = None

    @workflow.signal
    def approve(self, approver: str, comment: str = "") -> None:
        """Manager genehmigt Expense"""
        self.decision = ApprovalDecision(
            approved=True,
            approver=approver,
            timestamp=datetime.now(timezone.utc),
            comment=comment
        )

    @workflow.signal
    def reject(self, approver: str, reason: str) -> None:
        """Manager lehnt Expense ab"""
        self.decision = ApprovalDecision(
            approved=False,
            approver=approver,
            timestamp=datetime.now(timezone.utc),
            comment=reason
        )

    @workflow.run
    async def run(self, request: ExpenseRequest) -> str:
        # Benachrichtigung an Manager senden
        await workflow.execute_activity(
            send_approval_notification,
            request,
            start_to_close_timeout=timedelta(minutes=5),
        )

        # Warte bis zu 7 Tage auf Entscheidung
        try:
            await workflow.wait_condition(
                lambda: self.decision is not None,
                timeout=timedelta(days=7)
            )
        except asyncio.TimeoutError:
            # Auto-Reject nach Timeout
            self.decision = ApprovalDecision(
                approved=False,
                approver="system",
                timestamp=datetime.now(timezone.utc),
                comment="Approval timeout - automatically rejected"
            )

        # Entscheidung verarbeiten
        if self.decision.approved:
            await workflow.execute_activity(
                process_approved_expense,
                request,
                start_to_close_timeout=timedelta(minutes=10),
            )
            return f"Expense ${request.amount} approved by {self.decision.approver}"
        else:
            await workflow.execute_activity(
                notify_rejection,
                request,
                self.decision.comment,
                start_to_close_timeout=timedelta(minutes=5),
            )
            return f"Expense rejected: {self.decision.comment}"

Use Case 2: Event-getriebener Multi-Stage Prozess

@workflow.defn
class DataPipelineWorkflow:
    """Event-getriebene Daten-Pipeline mit Signalen"""

    @workflow.init
    def __init__(self) -> None:
        self.data_uploaded = False
        self.validation_complete = False
        self.transform_complete = False
        self.uploaded_data_location: Optional[str] = None

    @workflow.signal
    def notify_upload_complete(self, data_location: str) -> None:
        """Signal: Daten-Upload abgeschlossen"""
        self.data_uploaded = True
        self.uploaded_data_location = data_location
        workflow.logger.info(f"Data uploaded to {data_location}")

    @workflow.signal
    def notify_validation_complete(self) -> None:
        """Signal: Validierung abgeschlossen"""
        self.validation_complete = True
        workflow.logger.info("Validation complete")

    @workflow.signal
    def notify_transform_complete(self) -> None:
        """Signal: Transformation abgeschlossen"""
        self.transform_complete = True
        workflow.logger.info("Transform complete")

    @workflow.run
    async def run(self, pipeline_id: str) -> str:
        workflow.logger.info(f"Starting pipeline {pipeline_id}")

        # Stage 1: Warte auf Upload
        await workflow.wait_condition(lambda: self.data_uploaded)

        # Stage 2: Starte Validierung
        await workflow.execute_activity(
            start_validation,
            self.uploaded_data_location,
            start_to_close_timeout=timedelta(minutes=5),
        )

        # Warte auf Validierungs-Signal (externe Validierung)
        await workflow.wait_condition(lambda: self.validation_complete)

        # Stage 3: Starte Transformation
        await workflow.execute_activity(
            start_transformation,
            self.uploaded_data_location,
            start_to_close_timeout=timedelta(minutes=5),
        )

        # Warte auf Transform-Signal
        await workflow.wait_condition(lambda: self.transform_complete)

        # Stage 4: Finalisierung
        await workflow.execute_activity(
            finalize_pipeline,
            pipeline_id,
            start_to_close_timeout=timedelta(minutes=10),
        )

        return f"Pipeline {pipeline_id} completed successfully"

Pipeline Zustandsdiagramm:

stateDiagram-v2
    [*] --> WaitingForUpload: Workflow gestartet
    WaitingForUpload --> ValidatingData: notify_upload_complete
    ValidatingData --> WaitingForValidation: Activity started
    WaitingForValidation --> TransformingData: notify_validation_complete
    TransformingData --> WaitingForTransform: Activity started
    WaitingForTransform --> Finalizing: notify_transform_complete
    Finalizing --> [*]: Pipeline complete

6.1.8 Signal Best Practices

1. Dataclasses für Signal-Parameter:

# ✓ Gut: Typsicher und erweiterbar
@dataclass
class SignalInput:
    field1: str
    field2: int
    field3: Optional[str] = None  # Einfach neue Felder hinzufügen

@workflow.signal
def my_signal(self, input: SignalInput) -> None:
    ...

# ✗ Schlecht: Untypisiert und schwer zu warten
@workflow.signal
def my_signal(self, data: dict) -> None:
    ...

2. Idempotenz implementieren:

@workflow.signal
def process_payment(self, payment_id: str, amount: Decimal) -> None:
    """Idempotenter Signal Handler"""
    # Prüfen ob bereits verarbeitet
    if payment_id in self.processed_payments:
        workflow.logger.warn(f"Duplicate payment signal: {payment_id}")
        return

    # Verarbeiten und markieren
    self.processed_payments.add(payment_id)
    self.total_amount += amount

3. Signal-Limits beachten:

# ⚠️ Problem: Zu viele Signale
for i in range(10000):
    await handle.signal(MyWorkflow.process_item, f"item-{i}")
# Kann Event History Limit (50,000 Events) überschreiten!

# ✓ Lösung: Batch-Signale
items = [f"item-{i}" for i in range(10000)]
await handle.signal(MyWorkflow.process_batch, items)

4. @workflow.init für Initialisierung:

# ✓ Korrekt: @workflow.init garantiert Ausführung vor Handlern
@workflow.init
def __init__(self) -> None:
    self.counter = 0
    self.items = []

# ✗ Falsch: run() könnte NACH Signal Handler ausgeführt werden
@workflow.run
async def run(self) -> None:
    self.counter = 0  # Zu spät!

6.2 Queries - Synchrone Zustandsabfragen

6.2.1 Definition und Zweck

Queries sind synchrone, read-only Anfragen, die den Zustand eines Workflows inspizieren ohne ihn zu verändern. Queries erzeugen keine Events in der Event History und können sogar auf abgeschlossene Workflows (innerhalb der Retention Period) ausgeführt werden.

Hauptmerkmale von Queries:

  • Synchron: Aufrufer wartet auf Antwort
  • Read-only: Können Workflow-Zustand NICHT ändern
  • Non-blocking: Können keine Activities oder Timers ausführen
  • History-free: Erzeugen KEINE Event History Einträge
  • Auf abgeschlossenen Workflows: Query funktioniert nach Workflow-Ende

Query Sequenzdiagramm:

sequenceDiagram
    participant Client as External Client
    participant Frontend as Temporal Frontend
    participant Worker as Worker
    participant Workflow as Workflow Code

    Client->>Frontend: query("get_status")
    Frontend->>Worker: Query Task
    Worker->>Workflow: Replay History (deterministic state)
    Workflow->>Workflow: Execute query handler (read-only)
    Workflow-->>Worker: Return query result
    Worker-->>Frontend: Query result
    Frontend-->>Client: Return result (synchron)

    Note over Frontend: KEINE Event History Einträge!

Wann Queries verwenden:

  • ✅ Zustand abfragen ohne zu ändern
  • ✅ Progress Tracking (Fortschritt anzeigen)
  • ✅ Debugging (aktueller Zustand inspizieren)
  • ✅ Dashboards mit Echtzeit-Status
  • ✅ Nach Workflow-Ende Status abrufen
  • ❌ Zustand ändern → Signal oder Update verwenden
  • ❌ Kontinuierliches Polling → Update mit wait_condition besser

6.2.2 Query Handler definieren

Query Handler werden mit @workflow.query dekoriert und MÜSSEN synchron sein (def, NICHT async def):

from enum import Enum
from typing import List

class OrderStatus(Enum):
    PENDING = "pending"
    PROCESSING = "processing"
    SHIPPED = "shipped"
    DELIVERED = "delivered"

@dataclass
class OrderProgress:
    """Query-Rückgabewert mit vollständiger Info"""
    status: OrderStatus
    items_processed: int
    total_items: int
    percent_complete: float
    current_step: str

@workflow.defn
class OrderProcessingWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.status = OrderStatus.PENDING
        self.items_processed = 0
        self.total_items = 0
        self.current_step = "Initialization"

    @workflow.query
    def get_status(self) -> str:
        """Einfacher Status-Query"""
        return self.status.value

    @workflow.query
    def get_progress(self) -> OrderProgress:
        """Detaillierter Progress-Query mit Dataclass"""
        return OrderProgress(
            status=self.status,
            items_processed=self.items_processed,
            total_items=self.total_items,
            percent_complete=(
                (self.items_processed / self.total_items * 100)
                if self.total_items > 0 else 0
            ),
            current_step=self.current_step
        )

    @workflow.query
    def get_items_remaining(self) -> int:
        """Berechneter Query-Wert"""
        return self.total_items - self.items_processed

    @workflow.run
    async def run(self, order_id: str, items: List[str]) -> str:
        self.total_items = len(items)
        self.status = OrderStatus.PROCESSING

        for i, item in enumerate(items):
            self.current_step = f"Processing item {item}"

            await workflow.execute_activity(
                process_item,
                item,
                start_to_close_timeout=timedelta(minutes=5),
            )

            self.items_processed = i + 1

        self.status = OrderStatus.SHIPPED
        self.current_step = "Shipped to customer"

        return f"Order {order_id} completed"

6.2.3 Queries von Clients ausführen

async def monitor_order_progress(order_id: str):
    """Query-basiertes Progress Monitoring"""
    client = await Client.connect("localhost:7233")

    # Handle für Workflow abrufen
    handle = client.get_workflow_handle_for(
        OrderProcessingWorkflow,
        workflow_id=order_id
    )

    # Einfacher Status-Query
    status = await handle.query(OrderProcessingWorkflow.get_status)
    print(f"Order status: {status}")

    # Detaillierter Progress-Query
    progress = await handle.query(OrderProcessingWorkflow.get_progress)
    print(f"Progress: {progress.percent_complete:.1f}%")
    print(f"Current step: {progress.current_step}")
    print(f"Items: {progress.items_processed}/{progress.total_items}")

    # Berechneter Query
    remaining = await handle.query(OrderProcessingWorkflow.get_items_remaining)
    print(f"Items remaining: {remaining}")

Query auf abgeschlossenen Workflow:

async def query_completed_workflow(workflow_id: str):
    """Query funktioniert auch nach Workflow-Ende"""
    client = await Client.connect("localhost:7233")

    handle = client.get_workflow_handle(workflow_id)

    try:
        # Funktioniert innerhalb der Retention Period!
        final_status = await handle.query("get_status")
        print(f"Final status: {final_status}")
    except Exception as e:
        print(f"Query failed: {e}")

6.2.4 Query-Einschränkungen

1. Queries MÜSSEN synchron sein:

# ✓ Korrekt: Synchrone Query
@workflow.query
def get_data(self) -> dict:
    return {"status": self.status}

# ✗ FALSCH: Async nicht erlaubt!
@workflow.query
async def get_data(self) -> dict:  # TypeError!
    return {"status": self.status}

2. Queries können KEINE Activities ausführen:

# ✗ FALSCH: Keine Activities in Queries!
@workflow.query
def get_external_data(self) -> dict:
    # Kompiliert, aber schlägt zur Laufzeit fehl!
    result = await workflow.execute_activity(...)  # ERROR!
    return result

3. Queries dürfen Zustand NICHT ändern:

# ✗ FALSCH: State-Mutation in Query
@workflow.query
def increment_counter(self) -> int:
    self.counter += 1  # Verletzt Read-Only Constraint!
    return self.counter

# ✓ Korrekt: Read-Only Query
@workflow.query
def get_counter(self) -> int:
    return self.counter

# ✓ Für Mutation: Update verwenden
@workflow.update
def increment_counter(self) -> int:
    self.counter += 1
    return self.counter

Vergleich: Query vs Update für State Access

flowchart LR
    A[Workflow State Access] --> B{Lesen oder Schreiben?}
    B -->|Nur Lesen| C[Query verwenden]
    B -->|Schreiben| D[Update verwenden]

    C --> E[Vorteile:<br/>- Keine Event History<br/>- Funktioniert nach Workflow-Ende<br/>- Niedrige Latenz]

    D --> F[Vorteile:<br/>- State-Mutation erlaubt<br/>- Validierung möglich<br/>- Fehler-Feedback]

    style C fill:#90EE90
    style D fill:#87CEEB

6.2.5 Stack Trace Query für Debugging

Temporal bietet einen eingebauten __stack_trace Query für Debugging:

# CLI: Stack Trace abrufen
temporal workflow stack --workflow-id stuck-workflow-123
# Programmatisch: Stack Trace abrufen
async def debug_workflow(workflow_id: str):
    client = await Client.connect("localhost:7233")
    handle = client.get_workflow_handle(workflow_id)

    # Eingebauter Stack Trace Query
    stack_trace = await handle.query("__stack_trace")
    print(f"Workflow stack trace:\n{stack_trace}")

Wann Stack Trace verwenden:

  • ✅ Workflow scheint “hängen zu bleiben”
  • ✅ Debugging von wait_condition Problemen
  • ✅ Verstehen wo Workflow aktuell wartet
  • ✅ Identifizieren von Deadlocks

6.2.6 Praktische Query Use Cases

Use Case 1: Real-Time Dashboard

@dataclass
class DashboardData:
    """Aggregierte Daten für Dashboard"""
    total_processed: int
    total_failed: int
    current_batch: int
    average_processing_time: float
    estimated_completion: datetime

@workflow.defn
class BatchProcessingWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.processed = 0
        self.failed = 0
        self.current_batch = 0
        self.processing_times: List[float] = []
        self.start_time = datetime.now(timezone.utc)

    @workflow.query
    def get_dashboard_data(self) -> DashboardData:
        """Real-time Dashboard Query"""
        avg_time = (
            sum(self.processing_times) / len(self.processing_times)
            if self.processing_times else 0
        )

        remaining = self.total_batches - self.current_batch
        eta_seconds = remaining * avg_time
        eta = datetime.now(timezone.utc) + timedelta(seconds=eta_seconds)

        return DashboardData(
            total_processed=self.processed,
            total_failed=self.failed,
            current_batch=self.current_batch,
            average_processing_time=avg_time,
            estimated_completion=eta
        )

    @workflow.run
    async def run(self, total_batches: int) -> str:
        self.total_batches = total_batches

        for batch_num in range(total_batches):
            self.current_batch = batch_num
            batch_start = time.time()

            try:
                await workflow.execute_activity(
                    process_batch,
                    batch_num,
                    start_to_close_timeout=timedelta(minutes=10),
                )
                self.processed += 1
            except Exception as e:
                self.failed += 1
                workflow.logger.error(f"Batch {batch_num} failed: {e}")

            # Tracking für ETA
            batch_time = time.time() - batch_start
            self.processing_times.append(batch_time)

        return f"Processed {self.processed} batches, {self.failed} failed"

Dashboard Client:

async def display_dashboard(workflow_id: str):
    """Live Dashboard mit Query Polling"""
    client = await Client.connect("localhost:7233")
    handle = client.get_workflow_handle_for(
        BatchProcessingWorkflow,
        workflow_id
    )

    while True:
        try:
            # Dashboard-Daten abfragen
            data = await handle.query(
                BatchProcessingWorkflow.get_dashboard_data
            )

            # Anzeigen (vereinfacht)
            print(f"\n{'='*50}")
            print(f"Batch Progress Dashboard")
            print(f"{'='*50}")
            print(f"Processed: {data.total_processed}")
            print(f"Failed: {data.total_failed}")
            print(f"Current Batch: {data.current_batch}")
            print(f"Avg Time: {data.average_processing_time:.2f}s")
            print(f"ETA: {data.estimated_completion}")

            # Prüfen ob abgeschlossen
            result = await handle.result(timeout=0.1)
            print(f"\nWorkflow completed: {result}")
            break

        except asyncio.TimeoutError:
            # Workflow läuft noch
            await asyncio.sleep(2)  # Update alle 2 Sekunden

Use Case 2: State Inspection für Debugging

@workflow.defn
class ComplexWorkflow:
    """Workflow mit umfangreichem State für Debugging"""

    @workflow.query
    def get_full_state(self) -> dict:
        """Vollständiger State Dump für Debugging"""
        return {
            "status": self.status.value,
            "current_stage": self.current_stage,
            "pending_operations": len(self.pending_ops),
            "completed_tasks": self.completed_tasks,
            "errors": [str(e) for e in self.errors],
            "metadata": self.metadata,
            "last_updated": self.last_updated.isoformat(),
        }

    @workflow.query
    def get_pending_operations(self) -> List[dict]:
        """Detaillierte Pending Operations"""
        return [
            {
                "id": op.id,
                "type": op.type,
                "created_at": op.created_at.isoformat(),
                "retry_count": op.retry_count,
            }
            for op in self.pending_ops
        ]

6.2.7 Query Best Practices

1. Pre-compute komplexe Werte:

# ✗ Schlecht: Schwere Berechnung in Query
@workflow.query
def get_statistics(self) -> dict:
    # Vermeiden: O(n) Berechnung bei jedem Query!
    total = sum(item.price for item in self.items)
    avg = total / len(self.items)
    return {"total": total, "average": avg}

# ✓ Gut: Inkrementell updaten, Query nur lesen
@workflow.signal
def add_item(self, item: Item) -> None:
    self.items.append(item)
    # Statistiken inkrementell updaten
    self.total += item.price
    self.count += 1
    self.average = self.total / self.count

@workflow.query
def get_statistics(self) -> dict:
    # Instant return - keine Berechnung
    return {"total": self.total, "average": self.average}

2. Dataclass für Query-Responses:

# ✓ Gut: Typsicher und selbst-dokumentierend
@dataclass
class WorkflowStatus:
    state: str
    progress_percent: float
    items_processed: int
    errors: List[str]

@workflow.query
def get_status(self) -> WorkflowStatus:
    return WorkflowStatus(...)

3. Nicht kontinuierlich pollen:

# ✗ Ineffizient: Tight polling loop
async def wait_for_completion_bad(handle):
    while True:
        status = await handle.query(MyWorkflow.get_status)
        if status == "completed":
            break
        await asyncio.sleep(0.5)  # Sehr ineffizient!

# ✓ Besser: Update mit wait_condition (siehe Updates Sektion)
# Oder: Workflow result() awaiten
async def wait_for_completion_good(handle):
    result = await handle.result()  # Wartet automatisch
    return result

6.3 Updates - Synchrone Zustandsänderungen

6.3.1 Definition und Zweck

Updates sind synchrone, getrackte Write-Operationen, die Workflow-Zustand ändern UND ein Ergebnis an den Aufrufer zurückgeben. Sie kombinieren die State-Mutation von Signalen mit der synchronen Response von Queries, plus optionale Validierung.

Hauptmerkmale von Updates:

  • Synchron: Aufrufer erhält Response oder Error
  • Validiert: Optionale Validators lehnen ungültige Updates ab
  • Tracked: Erzeugt Event History Einträge
  • Blocking: Kann Activities, Child Workflows, wait conditions ausführen
  • Reliable: Sender weiß ob Update erfolgreich war oder fehlschlug

Update Sequenzdiagramm:

sequenceDiagram
    participant Client as External Client
    participant Frontend as Temporal Frontend
    participant History as History Service
    participant Worker as Worker
    participant Workflow as Workflow Code

    Client->>Frontend: execute_update("set_language", GERMAN)
    Frontend->>History: Store WorkflowExecutionUpdateAccepted
    History-->>Frontend: Event stored
    Frontend->>Worker: Workflow Task with update

    Worker->>Workflow: Replay history
    Workflow->>Workflow: Execute validator (optional)

    alt Validator fails
        Workflow-->>Worker: Validation error
        Worker-->>Frontend: Update rejected
        Frontend-->>Client: WorkflowUpdateFailedError
    else Validator passes
        Workflow->>Workflow: Execute update handler
        Workflow->>Workflow: Can execute activities
        Workflow-->>Worker: Update result
        Worker->>Frontend: Update completed
        Frontend->>History: Store WorkflowExecutionUpdateCompleted
        Frontend-->>Client: Return result (synchron)
    end

Updates vs Signals Entscheidungsbaum:

flowchart TD
    A[Workflow-Zustand ändern] --> B{Brauche ich Response?}
    B -->|Nein| C[Signal verwenden]
    B -->|Ja| D{Brauche ich Validierung?}
    D -->|Nein| E[Update ohne Validator]
    D -->|Ja| F[Update mit Validator]

    C --> G[Vorteile:<br/>- Niedrige Latenz<br/>- Fire-and-forget<br/>- Einfach]

    E --> H[Vorteile:<br/>- Synchrone Response<br/>- Fehler-Feedback<br/>- Activity-Ausführung]

    F --> I[Vorteile:<br/>- Frühe Ablehnung<br/>- Input-Validierung<br/>- Keine ungültigen Events]

    style C fill:#90EE90
    style E fill:#87CEEB
    style F fill:#FFD700

6.3.2 Update Handler mit Validierung

Einfacher Update Handler:

from enum import Enum

class Language(Enum):
    ENGLISH = "en"
    GERMAN = "de"
    SPANISH = "es"
    FRENCH = "fr"

@workflow.defn
class GreetingWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.language = Language.ENGLISH
        self.greetings = {
            Language.ENGLISH: "Hello",
            Language.GERMAN: "Hallo",
            Language.SPANISH: "Hola",
            Language.FRENCH: "Bonjour",
        }

    @workflow.update
    def set_language(self, language: Language) -> Language:
        """Update Handler - gibt vorherige Sprache zurück"""
        previous_language = self.language
        self.language = language
        workflow.logger.info(f"Language changed from {previous_language} to {language}")
        return previous_language

    @set_language.validator
    def validate_language(self, language: Language) -> None:
        """Validator - lehnt nicht unterstützte Sprachen ab"""
        if language not in self.greetings:
            raise ValueError(f"Language {language.name} not supported")

    @workflow.run
    async def run(self) -> str:
        # Warte auf Language-Updates...
        await asyncio.sleep(timedelta(hours=24))
        return self.greetings[self.language]

Update Handler mit Activity-Ausführung:

from asyncio import Lock
from decimal import Decimal

@dataclass
class PaymentInfo:
    payment_method: str
    amount: Decimal
    card_token: str

@dataclass
class PaymentResult:
    success: bool
    transaction_id: str
    amount: Decimal

@workflow.defn
class OrderWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.lock = Lock()  # Concurrency-Schutz
        self.order_status = "pending"
        self.total_amount = Decimal("0.00")

    @workflow.update
    async def process_payment(self, payment: PaymentInfo) -> PaymentResult:
        """Async Update - kann Activities ausführen"""
        async with self.lock:  # Verhindert concurrent execution
            # Activity ausführen für Zahlung
            result = await workflow.execute_activity(
                charge_payment,
                payment,
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(
                    initial_interval=timedelta(seconds=1),
                    maximum_attempts=3,
                )
            )

            # Workflow-State updaten
            if result.success:
                self.order_status = "paid"
                workflow.logger.info(f"Payment successful: {result.transaction_id}")
            else:
                workflow.logger.error("Payment failed")

            return result

    @process_payment.validator
    def validate_payment(self, payment: PaymentInfo) -> None:
        """Validator - prüft vor Activity-Ausführung"""
        # Status-Check
        if self.order_status != "pending":
            raise ValueError(
                f"Cannot process payment in status: {self.order_status}"
            )

        # Amount-Check
        if payment.amount <= 0:
            raise ValueError("Payment amount must be positive")

        if payment.amount != self.total_amount:
            raise ValueError(
                f"Payment amount {payment.amount} does not match "
                f"order total {self.total_amount}"
            )

        # Card token Check
        if not payment.card_token or len(payment.card_token) < 10:
            raise ValueError("Invalid card token")

6.3.3 Validator-Charakteristiken

Validators müssen folgende Regeln einhalten:

  1. Synchron - Nur def, NICHT async def
  2. Read-Only - Dürfen State NICHT mutieren
  3. Non-Blocking - Keine Activities, Timers, wait conditions
  4. Selbe Parameter - Wie der Update Handler
  5. Return None - Raise Exception zum Ablehnen
# ✓ Korrekt: Validator Implementierung
@workflow.update
def add_item(self, item: CartItem) -> int:
    """Item hinzufügen, return neue Anzahl"""
    self.items.append(item)
    return len(self.items)

@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
    """Validator - synchron, read-only"""
    # Item-Validierung
    if not item.sku or len(item.sku) == 0:
        raise ValueError("Item SKU is required")

    # Cart-Limits
    if len(self.items) >= 100:
        raise ValueError("Cart is full (max 100 items)")

    # Duplikat-Check
    if any(i.sku == item.sku for i in self.items):
        raise ValueError(f"Item {item.sku} already in cart")

    # Validation passed - kein expliziter Return

# ✗ FALSCH: Async Validator
@add_item.validator
async def validate_add_item(self, item: CartItem) -> None:  # ERROR!
    # Async nicht erlaubt
    result = await some_async_check(item)

# ✗ FALSCH: State Mutation
@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
    self.validation_count += 1  # ERROR! Read-only

# ✗ FALSCH: Activities ausführen
@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
    # ERROR! Keine Activities in Validator
    result = await workflow.execute_activity(...)

Validator Execution Flow:

flowchart TD
    A[Update Request] --> B[Execute Validator]
    B --> C{Validator Result}
    C -->|Exception| D[Reject Update]
    C -->|No Exception| E[Accept Update]

    D --> F[Return Error to Client]
    D --> G[NO Event History Entry]

    E --> H[Store UpdateAccepted Event]
    E --> I[Execute Update Handler]
    I --> J[Store UpdateCompleted Event]
    J --> K[Return Result to Client]

    style D fill:#FFB6C1
    style E fill:#90EE90
    style G fill:#FFB6C1

6.3.4 Updates von Clients senden

Execute Update (Wait for Completion):

from temporalio.client import Client
from temporalio.exceptions import WorkflowUpdateFailedError

async def update_workflow_language():
    """Update ausführen und auf Completion warten"""
    client = await Client.connect("localhost:7233")

    workflow_handle = client.get_workflow_handle_for(
        GreetingWorkflow,
        workflow_id="greeting-123"
    )

    try:
        # Update ausführen - wartet auf Validator + Handler
        previous_lang = await workflow_handle.execute_update(
            GreetingWorkflow.set_language,
            Language.GERMAN
        )
        print(f"Language changed from {previous_lang} to German")

    except WorkflowUpdateFailedError as e:
        # Validator rejected ODER Handler exception
        print(f"Update failed: {e}")
        # Event History: KEINE Einträge wenn Validator rejected

Start Update (Non-blocking):

from temporalio.client import WorkflowUpdateStage

async def start_update_non_blocking():
    """Update starten, später auf Result warten"""
    client = await Client.connect("localhost:7233")

    workflow_handle = client.get_workflow_handle_for(
        OrderWorkflow,
        workflow_id="order-456"
    )

    payment = PaymentInfo(
        payment_method="credit_card",
        amount=Decimal("99.99"),
        card_token="tok_abc123xyz"
    )

    # Update starten - wartet nur bis ACCEPTED
    update_handle = await workflow_handle.start_update(
        OrderWorkflow.process_payment,
        payment,
        wait_for_stage=WorkflowUpdateStage.ACCEPTED,
    )

    print("Update accepted by workflow (validator passed)")

    # Andere Arbeit erledigen...
    await do_other_work()

    # Später: Auf Completion warten
    try:
        result = await update_handle.result()
        print(f"Payment processed: {result.transaction_id}")
    except Exception as e:
        print(f"Payment failed: {e}")

WorkflowUpdateStage Optionen:

# ADMITTED: Warten bis Server Update empfangen hat (selten verwendet)
handle = await workflow_handle.start_update(
    MyWorkflow.my_update,
    data,
    wait_for_stage=WorkflowUpdateStage.ADMITTED,
)

# ACCEPTED: Warten bis Validator passed (empfohlen für non-blocking)
handle = await workflow_handle.start_update(
    MyWorkflow.my_update,
    data,
    wait_for_stage=WorkflowUpdateStage.ACCEPTED,  # Default für start_update
)

# COMPLETED: Warten bis Handler fertig (default für execute_update)
result = await workflow_handle.execute_update(
    MyWorkflow.my_update,
    data,
    # Implizit: wait_for_stage=WorkflowUpdateStage.COMPLETED
)

6.3.5 Update-with-Start Pattern

Ähnlich wie Signal-with-Start - Update an existierenden Workflow senden ODER neuen starten:

from temporalio.client import WithStartWorkflowOperation
from temporalio.common import WorkflowIDConflictPolicy

async def update_or_start_shopping_cart(user_id: str, item: CartItem):
    """Update-with-Start für Shopping Cart"""
    client = await Client.connect("localhost:7233")

    # Workflow Start Operation definieren
    start_op = WithStartWorkflowOperation(
        ShoppingCartWorkflow.run,
        id=f"cart-{user_id}",
        id_conflict_policy=WorkflowIDConflictPolicy.USE_EXISTING,
        task_queue="shopping",
    )

    try:
        # Update-with-Start ausführen
        cart_total = await client.execute_update_with_start_workflow(
            ShoppingCartWorkflow.add_item,
            item,
            start_workflow_operation=start_op,
        )

        print(f"Item added. Cart total: ${cart_total}")

    except WorkflowUpdateFailedError as e:
        print(f"Failed to add item: {e}")

    # Workflow Handle abrufen
    workflow_handle = await start_op.workflow_handle()
    return workflow_handle

6.3.6 Updates vs Signals - Detaillierter Vergleich

FeatureSignalUpdate
ResponseKeine (fire-and-forget)Gibt Result oder Error zurück
SynchronNeinJa
ValidierungNeinOptional mit Validator
Event HistoryImmer (WorkflowExecutionSignaled)Nur wenn validiert (Accepted + Completed)
LatenzNiedrig (async return)Höher (wartet auf Response)
Fehler-FeedbackNeinJa (Exception an Client)
Activities ausführenJa (async handler)Ja (async handler)
Use CaseAsync NotificationsSynchrone State Changes
Read-After-WritePolling mit Query nötigBuilt-in Response
Worker RequiredKann ohne Worker sendenWorker muss acknowledgment geben
Best fürEvent-driven, fire-and-forgetRequest-Response, Validierung

Entscheidungsmatrix:

flowchart TD
    A[Workflow State ändern] --> B{Brauche ich sofortige<br/>Bestätigung?}
    B -->|Nein| C{Ist niedrige Latenz<br/>wichtig?}
    B -->|Ja| D{Brauche ich<br/>Validierung?}

    C -->|Ja| E[Signal]
    C -->|Nein| F[Signal oder Update]

    D -->|Ja| G[Update mit Validator]
    D -->|Nein| H[Update ohne Validator]

    E --> I[Fire-and-forget<br/>Async notification]
    F --> I
    G --> J[Synchrone Operation<br/>mit Input-Validierung]
    H --> K[Synchrone Operation<br/>ohne Validierung]

    style E fill:#90EE90
    style G fill:#FFD700
    style H fill:#87CEEB

Code-Beispiel: Signal vs Update

# Scenario: Item zum Warenkorb hinzufügen

# Option 1: Mit Signal (fire-and-forget)
@workflow.signal
def add_item_signal(self, item: CartItem) -> None:
    """Signal - keine Response"""
    self.items.append(item)
    self.total += item.price

# Client (Signal)
await handle.signal(CartWorkflow.add_item_signal, item)
# Kehrt sofort zurück - weiß nicht ob erfolgreich!

# Wenn Status prüfen will: Query nötig
total = await handle.query(CartWorkflow.get_total)  # Extra roundtrip

# ============================================

# Option 2: Mit Update (synchrone Response)
@workflow.update
def add_item_update(self, item: CartItem) -> dict:
    """Update - gibt neuen State zurück"""
    self.items.append(item)
    self.total += item.price
    return {"items": len(self.items), "total": self.total}

@add_item_update.validator
def validate_add_item(self, item: CartItem) -> None:
    """Frühe Ablehnung ungültiger Items"""
    if len(self.items) >= 100:
        raise ValueError("Cart full")
    if item.price < 0:
        raise ValueError("Invalid price")

# Client (Update)
try:
    result = await handle.execute_update(CartWorkflow.add_item_update, item)
    print(f"Added! Total: ${result['total']}")
except WorkflowUpdateFailedError as e:
    print(f"Failed: {e}")  # Sofortiges Feedback!

6.3.7 Error Handling bei Updates

Validator Rejection:

# Client Code
try:
    result = await workflow_handle.execute_update(
        CartWorkflow.add_item,
        invalid_item  # z.B. leere SKU
    )
except WorkflowUpdateFailedError as e:
    # Validator hat rejected
    print(f"Validation failed: {e}")
    # Event History: KEINE Einträge (frühe Ablehnung)

Handler Exception:

@workflow.update
async def process_order(self, order: Order) -> Receipt:
    """Handler mit Activity - Exception propagiert zu Client"""
    # Activity failure propagiert
    receipt = await workflow.execute_activity(
        charge_customer,
        order,
        start_to_close_timeout=timedelta(seconds=30),
    )
    return receipt

# Client
try:
    receipt = await workflow_handle.execute_update(
        OrderWorkflow.process_order,
        order
    )
    print(f"Order processed: {receipt.id}")

except WorkflowUpdateFailedError as e:
    # Handler raised exception ODER Activity failed
    print(f"Order processing failed: {e}")
    # Event History: UpdateAccepted + UpdateFailed Events

Idempotenz mit Update Info:

from temporalio import workflow

@workflow.update
async def process_payment(self, payment_id: str, amount: Decimal) -> str:
    """Idempotenter Update Handler"""

    # Update ID für Deduplizierung abrufen
    update_info = workflow.current_update_info()

    if update_info and update_info.id in self.processed_updates:
        # Bereits verarbeitet (wichtig bei Continue-As-New)
        workflow.logger.info(f"Duplicate update: {update_info.id}")
        return self.update_results[update_info.id]

    # Payment verarbeiten
    result = await workflow.execute_activity(
        charge_payment,
        payment_id,
        amount,
        start_to_close_timeout=timedelta(seconds=30),
    )

    # Für Deduplizierung speichern
    if update_info:
        self.processed_updates.add(update_info.id)
        self.update_results[update_info.id] = result

    return result

6.3.8 Unfinished Handler Policy

Kontrolle über Verhalten wenn Workflow endet während Update läuft:

@workflow.update(
    unfinished_policy=workflow.HandlerUnfinishedPolicy.ABANDON
)
async def optional_update(self, data: str) -> None:
    """Update der abgebrochen werden kann"""
    # Lange laufende Operation
    await workflow.execute_activity(
        process_data,
        data,
        start_to_close_timeout=timedelta(hours=1),
    )

# Best Practice: Auf Handler-Completion warten vor Workflow-Ende
@workflow.run
async def run(self) -> str:
    # Haupt-Workflow Logik
    ...

    # Alle Handler fertigstellen lassen
    await workflow.wait_condition(
        lambda: workflow.all_handlers_finished()
    )

    return "Completed"

6.4 Patterns und Best Practices

6.4.1 Human-in-the-Loop Approval Pattern

Ein häufiges Pattern: Workflow wartet auf menschliche Genehmigung mit Timeout.

Multi-Level Approval Workflow:

from typing import Dict, Optional

@dataclass
class ApprovalRequest:
    request_id: str
    requester: str
    amount: Decimal
    description: str

@dataclass
class ApprovalDecision:
    approved: bool
    approver: str
    timestamp: datetime
    comment: str

@workflow.defn
class MultiLevelApprovalWorkflow:
    """Mehrstufige Genehmigung basierend auf Betrag"""

    @workflow.init
    def __init__(self) -> None:
        self.approvals: Dict[str, ApprovalDecision] = {}
        self.required_approvers: List[str] = []

    def _get_required_approvers(self, amount: Decimal) -> List[str]:
        """Bestimme erforderliche Genehmiger basierend auf Betrag"""
        if amount < Decimal("1000"):
            return ["manager"]
        elif amount < Decimal("10000"):
            return ["manager", "director"]
        else:
            return ["manager", "director", "vp"]

    @workflow.signal
    def approve_manager(self, decision: ApprovalDecision) -> None:
        """Manager Genehmigung"""
        self.approvals["manager"] = decision
        workflow.logger.info(f"Manager approval: {decision.approved}")

    @workflow.signal
    def approve_director(self, decision: ApprovalDecision) -> None:
        """Director Genehmigung"""
        self.approvals["director"] = decision
        workflow.logger.info(f"Director approval: {decision.approved}")

    @workflow.signal
    def approve_vp(self, decision: ApprovalDecision) -> None:
        """VP Genehmigung"""
        self.approvals["vp"] = decision
        workflow.logger.info(f"VP approval: {decision.approved}")

    @workflow.query
    def get_approval_status(self) -> Dict[str, str]:
        """Aktueller Genehmigungs-Status"""
        status = {}
        for role in self.required_approvers:
            if role in self.approvals:
                decision = self.approvals[role]
                status[role] = "approved" if decision.approved else "rejected"
            else:
                status[role] = "pending"
        return status

    @workflow.query
    def is_fully_approved(self) -> bool:
        """Alle erforderlichen Genehmigungen vorhanden?"""
        if len(self.approvals) < len(self.required_approvers):
            return False

        return all(
            role in self.approvals and self.approvals[role].approved
            for role in self.required_approvers
        )

    @workflow.run
    async def run(self, request: ApprovalRequest) -> str:
        # Erforderliche Genehmiger bestimmen
        self.required_approvers = self._get_required_approvers(request.amount)

        workflow.logger.info(
            f"Request {request.request_id} requires approval from: "
            f"{', '.join(self.required_approvers)}"
        )

        # Genehmigungs-Requests senden
        await workflow.execute_activity(
            send_approval_requests,
            request,
            self.required_approvers,
            start_to_close_timeout=timedelta(minutes=5),
        )

        # Auf alle Genehmigungen warten (max 14 Tage)
        try:
            await workflow.wait_condition(
                lambda: len(self.approvals) >= len(self.required_approvers),
                timeout=timedelta(days=14)
            )
        except asyncio.TimeoutError:
            return (
                f"Approval timeout - only {len(self.approvals)} of "
                f"{len(self.required_approvers)} approvals received"
            )

        # Prüfen ob alle approved
        if not self.is_fully_approved():
            rejected_by = [
                role for role, decision in self.approvals.items()
                if not decision.approved
            ]

            # Ablehnung verarbeiten
            await workflow.execute_activity(
                notify_rejection,
                request,
                rejected_by,
                start_to_close_timeout=timedelta(minutes=5),
            )

            return f"Request rejected by: {', '.join(rejected_by)}"

        # Alle approved - Request verarbeiten
        await workflow.execute_activity(
            process_approved_request,
            request,
            start_to_close_timeout=timedelta(minutes=10),
        )

        return (
            f"Request approved by all {len(self.required_approvers)} approvers "
            f"and processed successfully"
        )

Approval Workflow Zustandsdiagramm:

stateDiagram-v2
    [*] --> SendingRequests: Workflow gestartet
    SendingRequests --> WaitingForApprovals: Notifications sent

    WaitingForApprovals --> CheckingStatus: Signal received
    CheckingStatus --> WaitingForApprovals: Not all approvals yet
    CheckingStatus --> Timeout: 14 days timeout
    CheckingStatus --> ValidatingDecisions: All approvals received

    ValidatingDecisions --> Rejected: Any rejection
    ValidatingDecisions --> Processing: All approved

    Processing --> [*]: Success
    Rejected --> [*]: Rejected
    Timeout --> [*]: Timeout

    note right of WaitingForApprovals
        Wartet auf Signale:
        - approve_manager
        - approve_director
        - approve_vp
    end note

6.4.2 Progress Tracking mit Updates statt Query Polling

Ineffizient: Query Polling

# ✗ Ineffizient: Kontinuierliches Query Polling
async def wait_for_progress_old(handle, target_progress: int):
    """Veraltetes Pattern - vermeiden!"""
    while True:
        progress = await handle.query(MyWorkflow.get_progress)
        if progress >= target_progress:
            return progress
        await asyncio.sleep(1)  # Verschwendung!

Effizient: Update mit wait_condition

@workflow.defn
class DataMigrationWorkflow:
    """Progress Tracking mit Update statt Polling"""

    @workflow.init
    def __init__(self) -> None:
        self.progress = 0
        self.total_records = 0
        self.completed = False

    @workflow.update
    async def wait_for_progress(self, min_progress: int) -> int:
        """Warte bis Progress erreicht, dann return"""
        # Workflow benachrichtigt Client wenn bereit
        await workflow.wait_condition(
            lambda: self.progress >= min_progress or self.completed
        )
        return self.progress

    @workflow.query
    def get_current_progress(self) -> int:
        """Sofortiger Progress-Check (wenn nötig)"""
        return self.progress

    @workflow.run
    async def run(self, total_records: int) -> str:
        self.total_records = total_records

        for i in range(total_records):
            await workflow.execute_activity(
                migrate_record,
                i,
                start_to_close_timeout=timedelta(seconds=30),
            )
            self.progress = i + 1

            # Log alle 10%
            if (i + 1) % (total_records // 10) == 0:
                workflow.logger.info(
                    f"Progress: {(i+1)/total_records*100:.0f}%"
                )

        self.completed = True
        return f"Migrated {total_records} records"

# Client: Effiziente Progress-Überwachung
async def monitor_migration_progress(handle):
    """✓ Effizienter Ansatz mit Update"""
    # Warte auf 50% Progress
    progress = await handle.execute_update(
        DataMigrationWorkflow.wait_for_progress,
        min_progress=50
    )
    print(f"50% checkpoint reached: {progress} records")

    # Warte auf 100%
    progress = await handle.execute_update(
        DataMigrationWorkflow.wait_for_progress,
        min_progress=100
    )
    print(f"100% complete: {progress} records")

Vorteile Update-basiertes Progress Tracking:

  • ✅ Ein Request statt hunderte Queries
  • ✅ Workflow benachrichtigt Client aktiv
  • ✅ Keine Polling-Overhead
  • ✅ Präzise Benachrichtigung genau wenn Milestone erreicht
  • ✅ Server-Push statt Client-Pull

6.4.3 Externe Workflow Handles

Workflows können mit anderen Workflows kommunizieren via externe Handles:

@workflow.defn
class OrchestratorWorkflow:
    """Koordiniert mehrere Worker Workflows"""

    @workflow.run
    async def run(self, worker_ids: List[str]) -> dict:
        results = {}

        for worker_id in worker_ids:
            # Externes Workflow Handle abrufen
            worker_handle = workflow.get_external_workflow_handle_for(
                WorkerWorkflow.run,
                workflow_id=f"worker-{worker_id}"
            )

            # Signal an externes Workflow senden
            await worker_handle.signal(
                WorkerWorkflow.process_task,
                TaskData(task_id=f"task-{worker_id}", priority=1)
            )

            workflow.logger.info(f"Task sent to worker {worker_id}")

        # Warte auf alle Worker
        await asyncio.sleep(timedelta(minutes=10))

        # Optional: Externe Workflows canceln
        # await worker_handle.cancel()

        return {"workers_coordinated": len(worker_ids)}

@workflow.defn
class WorkerWorkflow:
    """Worker Workflow empfängt Signale"""

    @workflow.init
    def __init__(self) -> None:
        self.tasks: List[TaskData] = []

    @workflow.signal
    async def process_task(self, task: TaskData) -> None:
        """Empfange Task vom Orchestrator"""
        result = await workflow.execute_activity(
            process_task_activity,
            task,
            start_to_close_timeout=timedelta(minutes=5),
        )
        self.tasks.append(result)

    @workflow.run
    async def run(self) -> List[str]:
        # Warte auf Tasks oder Timeout
        await workflow.wait_condition(
            lambda: len(self.tasks) >= 5,
            timeout=timedelta(hours=1)
        )
        return self.tasks

Event History bei externen Signalen:

  • SignalExternalWorkflowExecutionInitiated im Sender’s History
  • WorkflowExecutionSignaled im Empfänger’s History

6.4.4 Signal Buffering vor Workflow-Start

Signale die VOR Workflow-Start gesendet werden, werden automatisch gepuffert:

async def start_with_buffered_signals():
    """Signale werden gepuffert bis Workflow startet"""
    client = await Client.connect("localhost:7233")

    # Workflow starten (Worker braucht Zeit zum Aufnehmen)
    handle = await client.start_workflow(
        MyWorkflow.run,
        id="workflow-123",
        task_queue="my-queue",
    )

    # Signale SOFORT senden (werden gepuffert wenn Workflow noch nicht läuft)
    await handle.signal(MyWorkflow.signal_1, "data1")
    await handle.signal(MyWorkflow.signal_2, "data2")
    await handle.signal(MyWorkflow.signal_3, "data3")

    # Wenn Workflow startet: Alle gepufferten Signale in Reihenfolge ausgeliefert

Wichtige Überlegungen:

  • Signale in Reihenfolge gepuffert
  • Alle vor erstem Workflow Task ausgeliefert
  • Workflow-State muss initialisiert sein BEVOR Handler ausführen
  • @workflow.init verwenden um uninitialisierte Variablen zu vermeiden

Buffering Visualisierung:

sequenceDiagram
    participant Client
    participant Frontend as Temporal Frontend
    participant Worker
    participant Workflow

    Client->>Frontend: start_workflow()
    Frontend-->>Client: Workflow ID

    Client->>Frontend: signal_1("data1")
    Frontend->>Frontend: Buffer signal_1

    Client->>Frontend: signal_2("data2")
    Frontend->>Frontend: Buffer signal_2

    Worker->>Frontend: Poll for tasks
    Frontend->>Worker: Workflow Task + buffered signals

    Worker->>Workflow: __init__() [via @workflow.init]
    Worker->>Workflow: signal_1("data1")
    Worker->>Workflow: signal_2("data2")
    Worker->>Workflow: run()

    Note over Workflow: Alle Signale verarbeitet<br/>BEVOR run() startet

6.4.5 Concurrency Safety bei async Handlern

Problem: Race Conditions

# ✗ Problem: Race Condition bei concurrent Updates
@workflow.update
async def withdraw(self, amount: Decimal) -> Decimal:
    # Mehrere Updates können interleaven!
    current = await workflow.execute_activity(
        get_balance, ...
    )  # Point A
    # Anderer Handler könnte hier ausführen!
    new_balance = current - amount  # Point B
    # Und hier!
    await workflow.execute_activity(
        set_balance, new_balance, ...
    )  # Point C
    return new_balance

Lösung: asyncio.Lock

from asyncio import Lock

@workflow.defn
class BankAccountWorkflow:
    @workflow.init
    def __init__(self) -> None:
        self.lock = Lock()  # Concurrency-Schutz
        self.balance = Decimal("1000.00")

    @workflow.update
    async def withdraw(self, amount: Decimal) -> Decimal:
        """Thread-safe Withdrawal mit Lock"""
        async with self.lock:  # Nur ein Handler gleichzeitig
            # Kritischer Bereich
            current_balance = await workflow.execute_activity(
                get_current_balance,
                start_to_close_timeout=timedelta(seconds=10),
            )

            if current_balance < amount:
                raise ValueError("Insufficient funds")

            # Betrag abziehen
            self.balance = current_balance - amount

            # Externes System updaten
            await workflow.execute_activity(
                update_balance,
                self.balance,
                start_to_close_timeout=timedelta(seconds=10),
            )

            return self.balance

    @withdraw.validator
    def validate_withdraw(self, amount: Decimal) -> None:
        if amount <= 0:
            raise ValueError("Amount must be positive")
        if amount > Decimal("10000.00"):
            raise ValueError("Amount exceeds daily limit")

Alternative: Message Queue Pattern

from collections import deque
from typing import Deque

@workflow.defn
class QueueBasedWorkflow:
    """Natürliche Serialisierung via Queue"""

    @workflow.init
    def __init__(self) -> None:
        self.message_queue: Deque[str] = deque()

    @workflow.signal
    def add_message(self, message: str) -> None:
        """Leichtgewichtiger Handler - nur queuen"""
        if len(self.message_queue) >= 1000:
            workflow.logger.warn("Queue full, dropping message")
            return

        self.message_queue.append(message)

    @workflow.run
    async def run(self) -> None:
        """Haupt-Workflow verarbeitet Queue"""
        while True:
            # Warte auf Messages
            await workflow.wait_condition(
                lambda: len(self.message_queue) > 0
            )

            # Verarbeite alle gepufferten Messages
            while self.message_queue:
                message = self.message_queue.popleft()

                # Verarbeitung (natürlich serialisiert)
                await workflow.execute_activity(
                    process_message,
                    message,
                    start_to_close_timeout=timedelta(seconds=30),
                )

            # Prüfe ob fortfahren
            if self.should_shutdown:
                break

Vorteile Queue Pattern:

  • ✅ Natürliche FIFO Serialisierung
  • ✅ Keine Race Conditions
  • ✅ Kann Messages batchen
  • ✅ Keine Locks nötig

Nachteile:

  • ❌ Komplexerer Code
  • ❌ Schwieriger typsicher zu machen
  • ❌ Weniger bequem als direkte Handler

6.4.6 Continue-As-New mit Handlern

Problem: Unfertige Handler bei Continue-As-New

# ⚠️ Problem: Handler könnte bei Continue-As-New abgebrochen werden
@workflow.run
async def run(self) -> None:
    while True:
        # Arbeit erledigen
        self.iteration += 1

        if workflow.info().is_continue_as_new_suggested():
            # PROBLEM: Signal/Update Handler könnten noch laufen!
            workflow.continue_as_new(iteration=self.iteration)

Lösung: Handler-Completion warten

@workflow.defn
class LongRunningWorkflow:
    @workflow.init
    def __init__(self, iteration: int = 0) -> None:
        self.iteration = iteration
        self.total_processed = 0

    @workflow.signal
    async def process_item(self, item: str) -> None:
        """Async Signal Handler"""
        result = await workflow.execute_activity(
            process_item_activity,
            item,
            start_to_close_timeout=timedelta(minutes=5),
        )
        self.total_processed += 1

    @workflow.run
    async def run(self) -> None:
        while True:
            self.iteration += 1

            # Batch-Verarbeitung
            await workflow.execute_activity(
                batch_process,
                start_to_close_timeout=timedelta(minutes=10),
            )

            # Prüfe ob Continue-As-New nötig
            if workflow.info().is_continue_as_new_suggested():
                workflow.logger.info(
                    "Event history approaching limits - Continue-As-New"
                )

                # ✓ WICHTIG: Warte auf alle Handler
                await workflow.wait_condition(
                    lambda: workflow.all_handlers_finished()
                )

                # Jetzt sicher für Continue-As-New
                workflow.continue_as_new(
                    iteration=self.iteration,
                    total_processed=self.total_processed
                )

            # Nächste Iteration
            await asyncio.sleep(timedelta(hours=1))

Idempotenz über Continue-As-New hinweg:

from typing import Set

@workflow.defn
class IdempotentWorkflow:
    @workflow.init
    def __init__(self, processed_update_ids: Set[str] = None) -> None:
        # IDs bereits verarbeiteter Updates
        self.processed_update_ids = processed_update_ids or set()

    @workflow.update
    async def process_payment(self, payment_id: str) -> str:
        """Idempotenter Update über Continue-As-New"""
        # Update ID für Deduplizierung
        update_info = workflow.current_update_info()

        if update_info and update_info.id in self.processed_update_ids:
            workflow.logger.info(f"Skipping duplicate update: {update_info.id}")
            return "already_processed"

        # Payment verarbeiten
        result = await workflow.execute_activity(
            charge_payment,
            payment_id,
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Als verarbeitet markieren
        if update_info:
            self.processed_update_ids.add(update_info.id)

        return result

    @workflow.run
    async def run(self) -> None:
        while True:
            # Workflow-Logik...

            if workflow.info().is_continue_as_new_suggested():
                await workflow.wait_condition(
                    lambda: workflow.all_handlers_finished()
                )

                # IDs an nächste Iteration übergeben
                workflow.continue_as_new(
                    processed_update_ids=self.processed_update_ids
                )

6.5 Häufige Fehler und Anti-Patterns

6.5.1 Uninitialisierte State-Zugriffe

Problem:

# ✗ FALSCH: Handler kann vor run() ausgeführt werden!
@workflow.defn
class BadWorkflow:
    @workflow.run
    async def run(self, name: str) -> str:
        self.name = name  # Initialisiert hier
        await workflow.wait_condition(lambda: self.approved)
        return f"Hello {self.name}"

    @workflow.signal
    def approve(self) -> None:
        self.approved = True
        # Wenn Signal vor run() gesendet: self.name existiert nicht!
        # AttributeError!

Lösung:

# ✓ KORREKT: @workflow.init garantiert Ausführung vor Handlern
@workflow.defn
class GoodWorkflow:
    @workflow.init
    def __init__(self, name: str) -> None:
        self.name = name  # Garantiert zuerst ausgeführt
        self.approved = False

    @workflow.run
    async def run(self, name: str) -> str:
        await workflow.wait_condition(lambda: self.approved)
        return f"Hello {self.name}"

    @workflow.signal
    def approve(self) -> None:
        self.approved = True  # self.name existiert garantiert

6.5.2 State-Mutation in Queries

Problem:

# ✗ FALSCH: Queries müssen read-only sein!
@workflow.query
def get_and_increment_counter(self) -> int:
    self.counter += 1  # ERROR! State-Mutation
    return self.counter

Lösung:

# ✓ KORREKT: Query nur lesen, Update für Mutation
@workflow.query
def get_counter(self) -> int:
    return self.counter  # Read-only

@workflow.update
def increment_counter(self) -> int:
    self.counter += 1  # Mutations in Updates erlaubt
    return self.counter

6.5.3 Async Query Handler

Problem:

# ✗ FALSCH: Queries können nicht async sein!
@workflow.query
async def get_status(self) -> str:  # TypeError!
    return self.status

Lösung:

# ✓ KORREKT: Queries müssen synchron sein
@workflow.query
def get_status(self) -> str:
    return self.status

6.5.4 Nicht auf Handler-Completion warten

Problem:

# ✗ FALSCH: Workflow endet während Handler laufen
@workflow.run
async def run(self) -> str:
    await workflow.execute_activity(...)
    return "Done"  # Handler könnten noch laufen!

Lösung:

# ✓ KORREKT: Auf Handler warten
@workflow.run
async def run(self) -> str:
    await workflow.execute_activity(...)

    # Alle Handler fertigstellen
    await workflow.wait_condition(
        lambda: workflow.all_handlers_finished()
    )

    return "Done"

6.5.5 Exzessive Signal-Volumes

Problem:

# ✗ FALSCH: Zu viele Signale
for i in range(10000):
    await handle.signal(MyWorkflow.process_item, f"item-{i}")
# Überschreitet Event History Limits!

Lösung:

# ✓ BESSER: Batch-Signale
items = [f"item-{i}" for i in range(10000)]
await handle.signal(MyWorkflow.process_batch, items)

# Oder: Child Workflows für hohe Volumes
for i in range(100):
    await workflow.execute_child_workflow(
        ProcessingWorkflow.run,
        batch=items[i*100:(i+1)*100],
        id=f"batch-{i}",
    )

6.5.6 Kontinuierliches Query Polling

Problem:

# ✗ INEFFIZIENT: Tight polling loop
async def wait_for_completion_bad(handle):
    while True:
        status = await handle.query(MyWorkflow.get_status)
        if status == "completed":
            break
        await asyncio.sleep(0.5)  # Verschwendung!

Lösung:

# ✓ BESSER: Update mit wait_condition
@workflow.update
async def wait_for_completion(self, target_status: str) -> str:
    await workflow.wait_condition(lambda: self.status == target_status)
    return self.status

# Client
status = await handle.execute_update(
    MyWorkflow.wait_for_completion,
    "completed"
)

6.6 Praktisches Beispiel: E-Commerce Order Workflow

Vollständiges Beispiel mit Signalen, Queries und Updates:

"""
E-Commerce Order Workflow
Demonstriert Signals, Queries und Updates
"""
from temporalio import workflow, activity
from temporalio.client import Client
from dataclasses import dataclass
from decimal import Decimal
from typing import List, Optional
from datetime import timedelta, datetime, timezone
from enum import Enum
import asyncio

# ==================== Data Models ====================

class OrderStatus(Enum):
    PENDING = "pending"
    PAID = "paid"
    SHIPPED = "shipped"
    DELIVERED = "delivered"
    CANCELLED = "cancelled"

@dataclass
class OrderItem:
    sku: str
    name: str
    quantity: int
    price: Decimal

@dataclass
class PaymentInfo:
    payment_method: str
    amount: Decimal
    card_token: str

@dataclass
class ShippingInfo:
    address: str
    carrier: str
    tracking_number: str

@dataclass
class OrderProgress:
    """Query Response: Order Progress"""
    status: OrderStatus
    items_count: int
    total_amount: str
    payment_status: str
    shipping_status: str

# ==================== Workflow ====================

@workflow.defn
class OrderWorkflow:
    """E-Commerce Order mit Signals, Queries und Updates"""

    @workflow.init
    def __init__(self) -> None:
        self.lock = asyncio.Lock()

        # Order State
        self.order_id: str = ""
        self.customer_id: str = ""
        self.status = OrderStatus.PENDING

        # Items
        self.items: List[OrderItem] = []
        self.total = Decimal("0.00")

        # Payment
        self.payment_transaction_id: Optional[str] = None

        # Shipping
        self.shipping_info: Optional[ShippingInfo] = None

    # ========== Queries: Read-Only State Access ==========

    @workflow.query
    def get_status(self) -> str:
        """Aktueller Order Status"""
        return self.status.value

    @workflow.query
    def get_total(self) -> str:
        """Aktueller Total-Betrag"""
        return str(self.total)

    @workflow.query
    def get_progress(self) -> OrderProgress:
        """Detaillierter Progress"""
        return OrderProgress(
            status=self.status,
            items_count=len(self.items),
            total_amount=str(self.total),
            payment_status=(
                "paid" if self.payment_transaction_id
                else "pending"
            ),
            shipping_status=(
                f"shipped via {self.shipping_info.carrier}"
                if self.shipping_info
                else "not shipped"
            )
        )

    # ========== Updates: Validated State Changes ==========

    @workflow.update
    async def add_item(self, item: OrderItem) -> dict:
        """Item hinzufügen (mit Validierung)"""
        async with self.lock:
            # Inventory prüfen
            available = await workflow.execute_activity(
                check_inventory,
                item,
                start_to_close_timeout=timedelta(seconds=10),
            )

            if not available:
                raise ValueError(f"Item {item.sku} out of stock")

            # Item hinzufügen
            self.items.append(item)
            self.total += item.price * item.quantity

            workflow.logger.info(
                f"Added {item.quantity}x {item.name} - "
                f"Total: ${self.total}"
            )

            return {
                "items": len(self.items),
                "total": str(self.total)
            }

    @add_item.validator
    def validate_add_item(self, item: OrderItem) -> None:
        """Validator: Item nur wenn Order pending"""
        if self.status != OrderStatus.PENDING:
            raise ValueError(
                f"Cannot add items in status: {self.status.value}"
            )

        if item.quantity <= 0:
            raise ValueError("Quantity must be positive")

        if len(self.items) >= 50:
            raise ValueError("Maximum 50 items per order")

    @workflow.update
    async def process_payment(self, payment: PaymentInfo) -> str:
        """Payment verarbeiten (mit Validierung)"""
        async with self.lock:
            # Payment Amount validieren
            if payment.amount != self.total:
                raise ValueError(
                    f"Payment amount {payment.amount} != "
                    f"order total {self.total}"
                )

            # Payment Activity ausführen
            transaction_id = await workflow.execute_activity(
                charge_payment,
                payment,
                start_to_close_timeout=timedelta(seconds=30),
            )

            # State updaten
            self.payment_transaction_id = transaction_id
            self.status = OrderStatus.PAID

            workflow.logger.info(
                f"Payment successful: {transaction_id}"
            )

            return transaction_id

    @process_payment.validator
    def validate_payment(self, payment: PaymentInfo) -> None:
        """Validator: Payment nur wenn pending und Items vorhanden"""
        if self.status != OrderStatus.PENDING:
            raise ValueError(
                f"Cannot process payment in status: {self.status.value}"
            )

        if len(self.items) == 0:
            raise ValueError("Cannot pay for empty order")

        if not payment.card_token or len(payment.card_token) < 10:
            raise ValueError("Invalid card token")

    # ========== Signals: Async Notifications ==========

    @workflow.signal
    async def mark_shipped(self, shipping: ShippingInfo) -> None:
        """Order als shipped markieren"""
        async with self.lock:
            if self.status != OrderStatus.PAID:
                workflow.logger.warn(
                    f"Cannot ship order in status: {self.status.value}"
                )
                return

            # Shipping System updaten
            await workflow.execute_activity(
                update_shipping_system,
                self.order_id,
                shipping,
                start_to_close_timeout=timedelta(seconds=10),
            )

            self.shipping_info = shipping
            self.status = OrderStatus.SHIPPED

            workflow.logger.info(
                f"Order shipped via {shipping.carrier} - "
                f"Tracking: {shipping.tracking_number}"
            )

    @workflow.signal
    def cancel_order(self, reason: str) -> None:
        """Order canceln"""
        if self.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:
            workflow.logger.warn(
                f"Cannot cancel order in status: {self.status.value}"
            )
            return

        self.status = OrderStatus.CANCELLED
        workflow.logger.info(f"Order cancelled: {reason}")

    # ========== Main Workflow ==========

    @workflow.run
    async def run(self, order_id: str, customer_id: str) -> str:
        """Order Workflow Main Logic"""
        self.order_id = order_id
        self.customer_id = customer_id

        workflow.logger.info(
            f"Order {order_id} created for customer {customer_id}"
        )

        # Warte auf Payment (max 7 Tage)
        try:
            await workflow.wait_condition(
                lambda: self.status == OrderStatus.PAID,
                timeout=timedelta(days=7)
            )
        except asyncio.TimeoutError:
            self.status = OrderStatus.CANCELLED
            return f"Order {order_id} cancelled - payment timeout"

        # Warte auf Shipment (max 30 Tage)
        try:
            await workflow.wait_condition(
                lambda: self.status == OrderStatus.SHIPPED,
                timeout=timedelta(days=30)
            )
        except asyncio.TimeoutError:
            workflow.logger.error("Shipment timeout!")
            return f"Order {order_id} paid but not shipped"

        # Simuliere Delivery Tracking
        await asyncio.sleep(timedelta(days=3))

        # Mark als delivered
        self.status = OrderStatus.DELIVERED

        # Delivery Confirmation senden
        await workflow.execute_activity(
            send_delivery_confirmation,
            self.customer_id,
            self.order_id,
            start_to_close_timeout=timedelta(seconds=30),
        )

        workflow.logger.info(f"Order {order_id} delivered successfully")

        return f"Order {order_id} completed - ${self.total} charged"

# ==================== Activities ====================

@activity.defn
async def check_inventory(item: OrderItem) -> bool:
    """Prüfe Inventory-Verfügbarkeit"""
    # Simuliert Inventory-Check
    activity.logger.info(f"Checking inventory for {item.sku}")
    return True

@activity.defn
async def charge_payment(payment: PaymentInfo) -> str:
    """Verarbeite Payment"""
    activity.logger.info(
        f"Charging {payment.amount} via {payment.payment_method}"
    )
    # Simuliert Payment Gateway
    return f"txn_{payment.card_token[:8]}"

@activity.defn
async def update_shipping_system(
    order_id: str,
    shipping: ShippingInfo
) -> None:
    """Update Shipping System"""
    activity.logger.info(
        f"Updating shipping for {order_id} - {shipping.carrier}"
    )

@activity.defn
async def send_delivery_confirmation(
    customer_id: str,
    order_id: str
) -> None:
    """Sende Delivery Confirmation"""
    activity.logger.info(
        f"Sending delivery confirmation to {customer_id} for {order_id}"
    )

# ==================== Client Usage ====================

async def main():
    """Client-seitiger Order Flow"""
    client = await Client.connect("localhost:7233")

    # Order Workflow starten
    order_id = "order-12345"
    handle = await client.start_workflow(
        OrderWorkflow.run,
        order_id,
        "customer-789",
        id=order_id,
        task_queue="orders",
    )

    print(f"Order {order_id} created")

    # Items hinzufügen (Update)
    try:
        result = await handle.execute_update(
            OrderWorkflow.add_item,
            OrderItem(
                sku="LAPTOP-001",
                name="Gaming Laptop",
                quantity=1,
                price=Decimal("1299.99")
            )
        )
        print(f"Item added: {result}")

        result = await handle.execute_update(
            OrderWorkflow.add_item,
            OrderItem(
                sku="MOUSE-001",
                name="Wireless Mouse",
                quantity=2,
                price=Decimal("29.99")
            )
        )
        print(f"Item added: {result}")

    except Exception as e:
        print(f"Failed to add item: {e}")
        return

    # Total abfragen (Query)
    total = await handle.query(OrderWorkflow.get_total)
    print(f"Order total: ${total}")

    # Progress abfragen (Query)
    progress = await handle.query(OrderWorkflow.get_progress)
    print(f"Progress: {progress}")

    # Payment verarbeiten (Update mit Validierung)
    try:
        txn_id = await handle.execute_update(
            OrderWorkflow.process_payment,
            PaymentInfo(
                payment_method="credit_card",
                amount=Decimal(total),
                card_token="tok_1234567890abcdef"
            )
        )
        print(f"Payment processed: {txn_id}")
    except Exception as e:
        print(f"Payment failed: {e}")
        return

    # Status abfragen
    status = await handle.query(OrderWorkflow.get_status)
    print(f"Order status: {status}")

    # Shipment markieren (Signal)
    await handle.signal(
        OrderWorkflow.mark_shipped,
        ShippingInfo(
            address="123 Main St, City, State 12345",
            carrier="UPS",
            tracking_number="1Z999AA10123456784"
        )
    )
    print("Order marked as shipped")

    # Final result abwarten
    result = await handle.result()
    print(f"\nWorkflow completed: {result}")

if __name__ == "__main__":
    asyncio.run(main())

Beispiel Output:

Order order-12345 created
Item added: {'items': 1, 'total': '1299.99'}
Item added: {'items': 2, 'total': '1359.97'}
Order total: $1359.97
Progress: OrderProgress(status=<OrderStatus.PENDING: 'pending'>, items_count=2, total_amount='1359.97', payment_status='pending', shipping_status='not shipped')
Payment processed: txn_1234567890abcdef
Order status: paid
Order marked as shipped

Workflow completed: Order order-12345 completed - $1359.97 charged

6.7 Zusammenfassung

Kernkonzepte

Signals (Signale):

  • Asynchrone, fire-and-forget Zustandsänderungen
  • Erzeugen Event History Einträge
  • Können vor Workflow-Start gepuffert werden
  • Perfekt für Event-driven Patterns und Human-in-the-Loop

Queries (Abfragen):

  • Synchrone, read-only Zustandsabfragen
  • KEINE Event History Einträge
  • Funktionieren auf abgeschlossenen Workflows
  • Ideal für Dashboards und Monitoring

Updates (Aktualisierungen):

  • Synchrone Zustandsänderungen mit Response
  • Optionale Validierung vor Ausführung
  • Event History nur bei erfolgreicher Validierung
  • Beste Wahl für Request-Response Patterns

Entscheidungsbaum

flowchart TD
    A[Workflow Interaktion] --> B{Zustand ändern?}

    B -->|Nein - Nur lesen| C[Query verwenden]

    B -->|Ja - Zustand ändern| D{Response nötig?}

    D -->|Nein| E{Niedrige Latenz<br/>kritisch?}
    E -->|Ja| F[Signal]
    E -->|Nein| F

    D -->|Ja| G{Validierung<br/>nötig?}
    G -->|Ja| H[Update mit Validator]
    G -->|Nein| I[Update ohne Validator]

    C --> J[Vorteile:<br/>- Keine History<br/>- Nach Workflow-Ende<br/>- Niedrige Latenz]

    F --> K[Vorteile:<br/>- Fire-and-forget<br/>- Niedrige Latenz<br/>- Event-driven]

    H --> L[Vorteile:<br/>- Frühe Ablehnung<br/>- Input-Validierung<br/>- Synchrone Response]

    I --> M[Vorteile:<br/>- Synchrone Response<br/>- Fehler-Feedback<br/>- Activity-Ausführung]

    style C fill:#90EE90
    style F fill:#87CEEB
    style H fill:#FFD700
    style I fill:#FFA500

Best Practices Checkliste

Allgemein:

  • @workflow.init für State-Initialisierung verwenden
  • ✅ Dataclasses für typsichere Parameter nutzen
  • ✅ Auf workflow.all_handlers_finished() vor Workflow-Ende warten
  • ✅ Event History Limits beachten (Continue-As-New)

Signale:

  • ✅ Handler leichtgewichtig halten
  • ✅ Idempotenz implementieren
  • ✅ Nicht tausende Signale senden (batchen!)
  • ✅ Signal-with-Start für lazy initialization

Queries:

  • ✅ Nur synchrone (def) Handler
  • ✅ KEINE State-Mutation
  • ✅ Pre-compute komplexe Werte
  • ✅ NICHT kontinuierlich pollen

Updates:

  • ✅ Validators für Input-Validierung
  • asyncio.Lock bei concurrent async Handlern
  • ✅ Update statt Signal+Query Polling
  • ✅ Idempotenz über Continue-As-New

Häufige Anti-Patterns

Anti-PatternProblemLösung
State in run() initialisierenHandler könnten vor run() ausführen@workflow.init verwenden
Async Query HandlerTypeErrorNur def, nicht async def
State in Query ändernVerletzt Read-OnlyUpdate verwenden
Nicht auf Handler wartenWorkflow endet vorzeitigworkflow.all_handlers_finished()
Kontinuierliches Query PollingIneffizientUpdate mit wait_condition
Race Conditions in async HandlernConcurrent executionasyncio.Lock verwenden
Tausende SignaleEvent History LimitBatching oder Child Workflows

Nächste Schritte

In diesem Kapitel haben Sie die drei Kommunikationsmechanismen von Temporal kennengelernt. Im nächsten Teil des Buchs (Kapitel 7-9) werden wir uns mit Resilienz beschäftigen:

  • Kapitel 7: Error Handling und Retry Policies
  • Kapitel 8: Workflow Evolution und Versioning
  • Kapitel 9: Fortgeschrittene Resilienz-Patterns

Die Kommunikationsmuster aus diesem Kapitel bilden die Grundlage für robuste, produktionsreife Temporal-Anwendungen.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 7: Fehlerbehandlung und Retries

Code-Beispiele für dieses Kapitel: examples/part-02/chapter-06/

Kapitel 7: Error Handling und Retry Policies

Einleitung

In verteilten Systemen sind Fehler unvermeidlich: Netzwerkverbindungen brechen ab, externe Services werden langsam oder unerreichbar, Datenbanken geraten unter Last, und Timeouts treten auf. Die Fähigkeit, gracefully mit diesen Fehlern umzugehen, ist entscheidend für resiliente Anwendungen.

Temporal wurde von Grund auf entwickelt, um Fehlerbehandlung zu vereinfachen und zu automatisieren. Das Framework übernimmt einen Großteil der komplexen Retry-Logik, während Entwickler die Kontrolle über kritische Business-Entscheidungen behalten.

In diesem Kapitel lernen Sie:

  • Den fundamentalen Unterschied zwischen Activity- und Workflow-Fehlern
  • Exception-Typen und deren Verwendung im Python SDK
  • Retry Policies konfigurieren und anpassen
  • Timeouts richtig einsetzen (Activity und Workflow)
  • Advanced Error Patterns (SAGA, Circuit Breaker, Dead Letter Queue)
  • Testing und Debugging von Fehlerszenarien
  • Best Practices für produktionsreife Fehlerbehandlung

Warum Error Handling in Temporal anders ist

In traditionellen Systemen müssen Entwickler:

  • Retry-Logik selbst implementieren
  • Exponential Backoff manuell programmieren
  • Idempotenz explizit sicherstellen
  • Fehlerz

ustände in Datenbanken speichern

  • Circuit Breaker selbst bauen

Mit Temporal:

  • Activities haben automatische Retries (konfigurierbar)
  • Exponential Backoff ist eingebaut
  • Event History speichert allen State (kein externes Persistence Layer nötig)
  • Deterministische Replay-Garantien ermöglichen sichere Fehlerbehandlung
  • Deklarative Retry Policies statt imperativer Code

7.1 Error Handling Grundlagen

7.1.1 Activity Errors vs Workflow Errors

Der fundamentalste Unterschied in Temporal’s Error-Modell: Activity-Fehler führen NICHT automatisch zu Workflow-Fehlern. Dies ist ein bewusstes Design-Pattern für Resilienz.

Activity Errors:

  • Jede Python Exception in einer Activity wird automatisch in einen ApplicationError konvertiert
  • Activities haben Default Retry Policies und versuchen automatisch erneut
  • Activity-Fehler werden im Workflow als ActivityError weitergegeben
  • Der Workflow entscheidet, wie mit dem Fehler umgegangen wird

Workflow Errors:

  • Workflows haben KEINE Default Retry Policy
  • Nur explizite ApplicationError Raises führen zum Workflow-Fehler
  • Andere Python Exceptions (z.B. NameError, TypeError) führen zu Workflow Task Retries
  • Non-Temporal Exceptions werden als Bugs betrachtet, die durch Code-Fixes behoben werden können

Visualisierung: Error Flow

flowchart TD
    A[Activity wirft Exception] --> B{Activity Retry Policy}
    B -->|Retry erlaubt| C[Exponential Backoff]
    C --> D[Activity Retry]
    D --> B
    B -->|Max Attempts erreicht| E[ActivityError an Workflow]

    E --> F{Workflow Error Handling}
    F -->|try/except| G[Workflow behandelt Fehler]
    F -->|Nicht gefangen| H[Workflow wirft ApplicationError]

    G --> I[Workflow fährt fort]
    H --> J[Workflow Status: Failed]

    style A fill:#FFB6C1
    style E fill:#FFA500
    style H fill:#FF4444
    style I fill:#90EE90

Code-Beispiel:

from temporalio import workflow, activity
from temporalio.exceptions import ActivityError, ApplicationError
from temporalio.common import RetryPolicy
from datetime import timedelta

@activity.defn
async def risky_operation(data: str) -> str:
    """Activity die fehlschlagen kann"""
    if "invalid" in data:
        # Diese Exception wird zu ActivityError im Workflow
        raise ValueError(f"Invalid data: {data}")

    # Simuliere externe API
    result = await external_api.call(data)
    return result

@workflow.defn
class ErrorHandlingWorkflow:
    @workflow.run
    async def run(self, data: str) -> str:
        try:
            # Activity Retry Policy: Max 3 Versuche
            result = await workflow.execute_activity(
                risky_operation,
                data,
                start_to_close_timeout=timedelta(seconds=10),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            return f"Success: {result}"

        except ActivityError as e:
            # Activity-Fehler abfangen und behandeln
            workflow.logger.error(f"Activity failed after retries: {e}")

            # Entscheidung: Workflow-Fehler oder graceful handling?
            if "critical" in data:
                # Kritischer Fehler → Workflow schlägt fehl
                raise ApplicationError(
                    "Critical operation failed",
                    non_retryable=True
                ) from e
            else:
                # Nicht-kritisch → Fallback
                return f"Failed with fallback: {e.cause.message if e.cause else str(e)}"

Wichtige Erkenntnisse:

  1. Separation of Concerns: Activities führen Work aus (fehleranfällig), Workflows orchestrieren (resilient)
  2. Automatic Retries: Platform kümmert sich um Retries, Sie konfigurieren nur
  3. Explicit Failure: Workflows müssen explizit fehlschlagen via ApplicationError
  4. Graceful Degradation: Workflows können Activity-Fehler abfangen und mit Fallback fortfahren

7.1.2 Exception-Hierarchie im Python SDK

Das Temporal Python SDK hat eine klare Exception-Hierarchie:

TemporalError (Basis für alle Temporal Exceptions)
├── FailureError (Basis für Runtime-Failures)
│   ├── ApplicationError (User-thrown, kontrolliert)
│   ├── ActivityError (Activity fehlgeschlagen)
│   ├── ChildWorkflowError (Child Workflow fehlgeschlagen)
│   ├── CancelledError (Cancellation)
│   ├── TerminatedError (Terminierung)
│   ├── TimeoutError (Timeout)
│   └── ServerError (Server-seitige Fehler)

Exception-Klassen im Detail:

1. ApplicationError - Die primäre Exception für bewusste Fehler

from temporalio.exceptions import ApplicationError

# Einfach
raise ApplicationError("Something went wrong")

# Mit Typ (für Retry Policy)
raise ApplicationError(
    "Payment failed",
    type="PaymentError"
)

# Non-retryable
raise ApplicationError(
    "Invalid input",
    type="ValidationError",
    non_retryable=True  # Keine Retries!
)

# Mit Details (serialisierbar)
raise ApplicationError(
    "Order processing failed",
    type="OrderError",
    details=[{
        "order_id": "12345",
        "reason": "inventory_unavailable",
        "requested": 10,
        "available": 5
    }]
)

# Mit custom retry delay
raise ApplicationError(
    "Rate limited",
    type="RateLimitError",
    next_retry_delay=timedelta(seconds=60)
)

2. ActivityError - Activity Failure Wrapper

try:
    result = await workflow.execute_activity(...)
except ActivityError as e:
    # Zugriff auf Error-Properties
    workflow.logger.error(f"Activity failed: {e.activity_type}")
    workflow.logger.error(f"Activity ID: {e.activity_id}")
    workflow.logger.error(f"Retry state: {e.retry_state}")

    # Zugriff auf die ursprüngliche Exception (cause)
    if isinstance(e.cause, ApplicationError):
        workflow.logger.error(f"Root cause type: {e.cause.type}")
        workflow.logger.error(f"Root cause message: {e.cause.message}")
        workflow.logger.error(f"Details: {e.cause.details}")

3. ChildWorkflowError - Child Workflow Failure Wrapper

try:
    result = await workflow.execute_child_workflow(...)
except ChildWorkflowError as e:
    workflow.logger.error(f"Child workflow {e.workflow_type} failed")
    workflow.logger.error(f"Workflow ID: {e.workflow_id}")
    workflow.logger.error(f"Run ID: {e.run_id}")

    # Nested causes navigieren
    current = e.cause
    while current:
        workflow.logger.error(f"Cause: {type(current).__name__}: {current}")
        if hasattr(current, 'cause'):
            current = current.cause
        else:
            break

4. TimeoutError - Timeout Wrapper

try:
    result = await workflow.execute_activity(...)
except TimeoutError as e:
    # Timeout-Typ ermitteln
    from temporalio.api.enums.v1 import TimeoutType

    if e.type == TimeoutType.TIMEOUT_TYPE_START_TO_CLOSE:
        workflow.logger.error("Activity execution timed out")
    elif e.type == TimeoutType.TIMEOUT_TYPE_HEARTBEAT:
        workflow.logger.error("Activity heartbeat timed out")
        # Last heartbeat details abrufen
        if e.last_heartbeat_details:
            workflow.logger.info(f"Last progress: {e.last_heartbeat_details}")

**Exception-Hierar

chie Visualisierung:**

classDiagram
    TemporalError <|-- FailureError
    FailureError <|-- ApplicationError
    FailureError <|-- ActivityError
    FailureError <|-- ChildWorkflowError
    FailureError <|-- CancelledError
    FailureError <|-- TimeoutError
    FailureError <|-- TerminatedError

    class TemporalError {
        +message: str
    }

    class ApplicationError {
        +type: str
        +message: str
        +details: list
        +non_retryable: bool
        +next_retry_delay: timedelta
        +cause: Exception
    }

    class ActivityError {
        +activity_id: str
        +activity_type: str
        +retry_state: RetryState
        +cause: Exception
    }

    class ChildWorkflowError {
        +workflow_id: str
        +workflow_type: str
        +run_id: str
        +retry_state: RetryState
        +cause: Exception
    }

7.1.3 Retriable vs Non-Retriable Errors

Ein kritisches Konzept: Welche Fehler sollen retry-ed werden, welche nicht?

Retriable Errors (Transiente Fehler):

  • Netzwerk-Timeouts
  • Verbindungsfehler
  • Service temporarily unavailable (503)
  • Rate Limiting (429)
  • Datenbank Deadlocks
  • Temporäre Ressourcen-Knappheit

Non-Retryable Errors (Permanente Fehler):

  • Authentication failures (401, 403)
  • Resource not found (404)
  • Bad Request / Validierung (400)
  • Business Logic Failures (Insufficient funds, Invalid state)
  • Permanente Authorization Errors

Entscheidungsbaum:

flowchart TD
    A[Fehler aufgetreten] --> B{Wird der Fehler<br/>durch Warten behoben?}
    B -->|Ja| C[RETRIABLE]
    B -->|Nein| D{Kann User<br/>das Problem lösen?}

    D -->|Ja| E[NON-RETRIABLE<br/>mit aussagekräftiger Message]
    D -->|Nein| F{Ist es ein Bug<br/>im Code?}

    F -->|Ja| G[NON-RETRIABLE<br/>+ Alert an Ops]
    F -->|Nein| H[NON-RETRIABLE<br/>+ Detailed Error Info]

    C --> I[Retry Policy<br/>mit Exponential Backoff]
    E --> J[Sofortiges Failure<br/>mit Details für User]

    style C fill:#90EE90
    style E fill:#FFB6C1
    style G fill:#FF4444
    style H fill:#FFA500

Implementierungs-Beispiel:

from enum import Enum

class ErrorCategory(Enum):
    TRANSIENT = "transient"      # Retry
    VALIDATION = "validation"     # Don't retry
    AUTH = "authentication"       # Don't retry
    BUSINESS = "business_logic"   # Don't retry
    RATE_LIMIT = "rate_limit"     # Retry mit delay

@activity.defn
async def smart_error_handling(order_id: str) -> str:
    """Activity mit intelligentem Error Handling"""
    try:
        # External API call
        response = await payment_api.charge(order_id)
        return response.transaction_id

    except NetworkError as e:
        # Transient → Allow retry
        raise ApplicationError(
            f"Network error (will retry): {e}",
            type="NetworkError"
        ) from e

    except RateLimitError as e:
        # Rate limit → Retry mit custom delay
        raise ApplicationError(
            "API rate limit exceeded",
            type="RateLimitError",
            next_retry_delay=timedelta(seconds=60)
        ) from e

    except AuthenticationError as e:
        # Auth failure → Don't retry
        raise ApplicationError(
            "Payment service authentication failed",
            type="AuthenticationError",
            non_retryable=True
        ) from e

    except ValidationError as e:
        # Invalid input → Don't retry
        raise ApplicationError(
            f"Invalid order data: {e.errors}",
            type="ValidationError",
            non_retryable=True,
            details=[{"order_id": order_id, "errors": e.errors}]
        ) from e

    except InsufficientFundsError as e:
        # Business logic → Don't retry
        raise ApplicationError(
            "Payment declined: Insufficient funds",
            type="PaymentDeclinedError",
            non_retryable=True,
            details=[{
                "order_id": order_id,
                "amount_requested": e.amount,
                "balance": e.current_balance
            }]
        ) from e

# Im Workflow: Retry Policy mit non_retryable_error_types
@workflow.defn
class SmartOrderWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> dict:
        try:
            transaction_id = await workflow.execute_activity(
                smart_error_handling,
                order_id,
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(
                    initial_interval=timedelta(seconds=1),
                    maximum_attempts=5,
                    backoff_coefficient=2.0,
                    # Diese Error-Typen NICHT retry-en
                    non_retryable_error_types=[
                        "ValidationError",
                        "AuthenticationError",
                        "PaymentDeclinedError"
                    ]
                )
            )
            return {"success": True, "transaction_id": transaction_id}

        except ActivityError as e:
            if isinstance(e.cause, ApplicationError):
                # Detailliertes Error-Handling basierend auf Typ
                error_type = e.cause.type

                if error_type == "PaymentDeclinedError":
                    # Kunde benachrichtigen
                    await self.notify_customer("Payment declined")
                    return {"success": False, "reason": "insufficient_funds"}

                elif error_type == "ValidationError":
                    # Log für Debugging
                    workflow.logger.error(f"Validation failed: {e.cause.details}")
                    return {"success": False, "reason": "invalid_data"}

            # Unbehandelter Fehler → Workflow schlägt fehl
            raise

7.2 Retry Policies

Retry Policies sind das Herzstück von Temporal’s Resilienz. Sie definieren wie und wie oft ein fehlgeschlagener Versuch wiederholt wird.

7.2.1 RetryPolicy Konfiguration

Vollständige Parameter:

from temporalio.common import RetryPolicy
from datetime import timedelta

retry_policy = RetryPolicy(
    # Backoff-Intervall für ersten Retry (Default: 1s)
    initial_interval=timedelta(seconds=1),

    # Multiplikator für jeden weiteren Retry (Default: 2.0)
    backoff_coefficient=2.0,

    # Maximales Backoff-Intervall (Default: 100x initial_interval)
    maximum_interval=timedelta(seconds=100),

    # Maximale Anzahl Versuche (0 = unbegrenzt, Default: 0)
    maximum_attempts=5,

    # Error-Typen die NICHT retry-ed werden
    non_retryable_error_types=["ValidationError", "AuthError"]
)

Parameter-Beschreibung:

ParameterTypDefaultBeschreibung
initial_intervaltimedelta1sWartezeit vor erstem Retry
backoff_coefficientfloat2.0Multiplikator pro Retry
maximum_intervaltimedelta100x initialMax Wartezeit zwischen Retries
maximum_attemptsint0 (∞)Max Anzahl Versuche (inkl. Original)
non_retryable_error_typesList[str]NoneError-Types ohne Retry

7.2.2 Exponential Backoff

Formel:

next_interval = min(
    initial_interval × (backoff_coefficient ^ current_attempt),
    maximum_interval
)

Beispiel-Progression (initial=1s, coefficient=2.0, max=100s):

Versuch 1: Sofort (Original)
Versuch 2: +1s  = 1s  Wartezeit
Versuch 3: +2s  = 3s  Gesamtzeit
Versuch 4: +4s  = 7s  Gesamtzeit
Versuch 5: +8s  = 15s Gesamtzeit
Versuch 6: +16s = 31s Gesamtzeit
Versuch 7: +32s = 63s Gesamtzeit
Versuch 8: +64s = 127s Gesamtzeit (aber gecapped bei 100s)
Versuch 9+: +100s (max_interval)

Visualisierung:

graph LR
    A[Attempt 1<br/>Immediate] -->|Wait 1s| B[Attempt 2]
    B -->|Wait 2s| C[Attempt 3]
    C -->|Wait 4s| D[Attempt 4]
    D -->|Wait 8s| E[Attempt 5]
    E -->|Wait 16s| F[Attempt 6]
    F -->|Wait 32s| G[Attempt 7]
    G -->|Wait 64s| H[Attempt 8]
    H -->|Wait 100s<br/>capped| I[Attempt 9]

    style A fill:#90EE90
    style I fill:#FFB6C1

Warum Exponential Backoff?

  1. Thundering Herd vermeiden: Nicht alle Clients retry gleichzeitig
  2. Service-Erholung: Geben externen Services Zeit zu recovern
  3. Ressourcen-Schonung: Reduziert Load während Ausfall-Perioden
  4. Progressive Degradation: Schnelle erste Retries, dann geduldiger

Code-Beispiel mit verschiedenen Strategien:

# Strategie 1: Aggressive Retries (schnelle Transients)
quick_retry = RetryPolicy(
    initial_interval=timedelta(milliseconds=100),
    maximum_interval=timedelta(seconds=1),
    backoff_coefficient=1.5,
    maximum_attempts=10
)

# Strategie 2: Geduldige Retries (externe Services)
patient_retry = RetryPolicy(
    initial_interval=timedelta(seconds=5),
    maximum_interval=timedelta(minutes=5),
    backoff_coefficient=2.0,
    maximum_attempts=20
)

# Strategie 3: Limited Retries (kritische Operationen)
limited_retry = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    maximum_interval=timedelta(seconds=10),
    backoff_coefficient=2.0,
    maximum_attempts=3  # Nur 3 Versuche
)

# Strategie 4: Custom Delay (Rate Limiting)
@activity.defn
async def rate_limited_activity() -> str:
    try:
        return await external_api.call()
    except RateLimitError as e:
        # Custom delay basierend auf API Response
        retry_after = e.retry_after_seconds
        raise ApplicationError(
            "Rate limited",
            next_retry_delay=timedelta(seconds=retry_after)
        ) from e

7.2.3 Default Retry-Verhalten

Activities:

  • Haben automatisch eine Default Retry Policy
  • Retry unbegrenzt bis Erfolg
  • Default initial_interval: 1 Sekunde
  • Default backoff_coefficient: 2.0
  • Default maximum_interval: 100 Sekunden
  • Default maximum_attempts: 0 (unbegrenzt)

Workflows:

  • Haben KEINE Default Retry Policy
  • Müssen explizit konfiguriert werden wenn Retries gewünscht
  • Design-Philosophie: Workflows sollten Issues durch Activities behandeln

Child Workflows:

  • Können Retry Policies konfiguriert bekommen
  • Unabhängig vom Parent Workflow

Beispiel: Defaults überschreiben

# Activity OHNE Retries
await workflow.execute_activity(
    one_shot_activity,
    start_to_close_timeout=timedelta(seconds=10),
    retry_policy=RetryPolicy(maximum_attempts=1)  # Nur 1 Versuch
)

# Activity MIT custom Retries
await workflow.execute_activity(
    my_activity,
    start_to_close_timeout=timedelta(seconds=10),
    retry_policy=RetryPolicy(
        initial_interval=timedelta(seconds=2),
        maximum_attempts=5
    )
)

# Workflow MIT Retries (vom Client)
await client.execute_workflow(
    MyWorkflow.run,
    args=["data"],
    id="workflow-id",
    task_queue="my-queue",
    retry_policy=RetryPolicy(
        maximum_interval=timedelta(seconds=10),
        maximum_attempts=3
    )
)

7.2.4 Retry Policy für Activities, Workflows und Child Workflows

1. Activity Retry Policy:

@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> str:
        return await workflow.execute_activity(
            my_activity,
            "arg",
            start_to_close_timeout=timedelta(seconds=10),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError"]
            )
        )

2. Workflow Retry Policy (vom Client):

from temporalio.client import Client

client = await Client.connect("localhost:7233")

# Workflow mit Retry
handle = await client.execute_workflow(
    MyWorkflow.run,
    "argument",
    id="workflow-id",
    task_queue="my-queue",
    retry_policy=RetryPolicy(
        maximum_interval=timedelta(seconds=10),
        maximum_attempts=3
    )
)

3. Child Workflow Retry Policy:

@workflow.defn
class ParentWorkflow:
    @workflow.run
    async def run(self) -> str:
        # Child mit Retry Policy
        result = await workflow.execute_child_workflow(
            ChildWorkflow.run,
            "arg",
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=2),
                maximum_attempts=3
            )
        )
        return result

Vergleichstabelle:

AspektActivityWorkflowChild Workflow
Default PolicyJa (unbegrenzt)NeinNein
KonfigurationBei execute_activityBeim Client-StartBei execute_child_workflow
ScopePro Activity-CallGesamte ExecutionPro Child Workflow
EmpfehlungFast immer RetrySelten RetryManchmal Retry

7.3 Activity Error Handling

7.3.1 Exceptions werfen und fangen

In Activities werfen:

from temporalio import activity
from temporalio.exceptions import ApplicationError

@activity.defn
async def process_payment(amount: float, card_token: str) -> str:
    """Payment Activity mit detailliertem Error Handling"""
    attempt = activity.info().attempt

    activity.logger.info(
        f"Processing payment (attempt {attempt})",
        extra={"amount": amount, "card_token": card_token[:4] + "****"}
    )

    try:
        # Call Payment Service
        result = await payment_service.charge(amount, card_token)
        return result.transaction_id

    except InsufficientFundsError as e:
        # Business Logic Error → Don't retry
        raise ApplicationError(
            "Payment declined: Insufficient funds",
            type="InsufficientFundsError",
            non_retryable=True,
            details=[{
                "amount": amount,
                "available_balance": e.available_balance,
                "shortfall": amount - e.available_balance
            }]
        ) from e

    except NetworkTimeoutError as e:
        # Transient Network Error → Allow retry mit custom delay
        delay = min(5 * attempt, 30)  # 5s, 10s, 15s, ..., max 30s
        raise ApplicationError(
            f"Network timeout on attempt {attempt}",
            type="NetworkError",
            next_retry_delay=timedelta(seconds=delay)
        ) from e

    except CardDeclinedError as e:
        # Permanent Card Issue → Don't retry
        raise ApplicationError(
            f"Card declined: {e.reason}",
            type="CardDeclinedError",
            non_retryable=True,
            details=[{"reason": e.reason, "card_last4": card_token[-4:]}]
        ) from e

    except Exception as e:
        # Unknown Error → Retry mit logging
        activity.logger.error(
            f"Unexpected error processing payment",
            extra={"error_type": type(e).__name__},
            exc_info=True
        )
        raise ApplicationError(
            f"Payment processing failed: {type(e).__name__}",
            type="UnexpectedError"
        ) from e

Im Workflow fangen:

@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, amount: float, card_token: str) -> dict:
        """Workflow mit umfassendem Activity Error Handling"""

        try:
            transaction_id = await workflow.execute_activity(
                process_payment,
                args=[amount, card_token],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(
                    maximum_attempts=3,
                    non_retryable_error_types=[
                        "InsufficientFundsError",
                        "CardDeclinedError"
                    ]
                )
            )

            # Success
            return {
                "success": True,
                "transaction_id": transaction_id
            }

        except ActivityError as e:
            # Activity failed nach allen Retries
            workflow.logger.error(
                f"Payment activity failed",
                extra={
                    "activity_type": e.activity_type,
                    "retry_state": e.retry_state,
                    "attempt": activity.info().attempt if e.cause else 0
                }
            )

            # Zugriff auf Root Cause
            if isinstance(e.cause, ApplicationError):
                error_type = e.cause.type
                error_message = e.cause.message
                error_details = e.cause.details

                workflow.logger.error(
                    f"Root cause: {error_type}: {error_message}",
                    extra={"details": error_details}
                )

                # Typ-spezifisches Handling
                if error_type == "InsufficientFundsError":
                    # Kunde benachrichtigen
                    await workflow.execute_activity(
                        send_notification,
                        args=[{
                            "type": "payment_declined",
                            "reason": "insufficient_funds",
                            "amount": amount
                        }],
                        start_to_close_timeout=timedelta(seconds=10)
                    )

                    return {
                        "success": False,
                        "error": "insufficient_funds",
                        "details": error_details
                    }

                elif error_type == "CardDeclinedError":
                    return {
                        "success": False,
                        "error": "card_declined",
                        "details": error_details
                    }

            # Unbehandelte Fehler → Workflow schlägt fehl
            raise ApplicationError(
                f"Payment failed: {e}",
                non_retryable=True
            ) from e

7.3.2 Heartbeats für Long-Running Activities

Heartbeats erfüllen drei kritische Funktionen:

  1. Progress Tracking: Signalisiert Fortschritt an Temporal Service
  2. Cancellation Detection: Ermöglicht Activity Cancellation
  3. Resumption Support: Bei Retry kann Activity von letztem Heartbeat fortfahren

Heartbeat Implementation:

import asyncio
from temporalio import activity

@activity.defn
async def long_batch_processing(total_items: int) -> str:
    """Long-running Activity mit Heartbeats"""
    processed = 0

    activity.logger.info(f"Starting batch processing: {total_items} items")

    try:
        for item_id in range(total_items):
            # Check for cancellation
            if activity.is_cancelled():
                activity.logger.info(f"Activity cancelled after {processed} items")
                raise asyncio.CancelledError("Activity cancelled by user")

            # Process item
            await process_single_item(item_id)
            processed += 1

            # Send heartbeat with progress
            activity.heartbeat({
                "processed": processed,
                "total": total_items,
                "percent": (processed / total_items) * 100,
                "current_item": item_id
            })

            # Log progress alle 10%
            if processed % (total_items // 10) == 0:
                activity.logger.info(
                    f"Progress: {processed}/{total_items} "
                    f"({processed/total_items*100:.0f}%)"
                )

        activity.logger.info(f"Batch processing completed: {processed} items")
        return f"Processed {processed} items successfully"

    except asyncio.CancelledError:
        # Cleanup bei Cancellation
        await cleanup_partial_work(processed)
        activity.logger.info(f"Cleaned up after processing {processed} items")
        raise  # Must re-raise

Resumable Activity (mit Heartbeat Details):

@activity.defn
async def resumable_batch_processing(total_items: int) -> str:
    """Activity die bei Retry von letztem Heartbeat fortfährt"""

    # Check für vorherigen Fortschritt
    heartbeat_details = activity.info().heartbeat_details
    start_from = 0

    if heartbeat_details:
        # Resume von letztem Heartbeat
        last_progress = heartbeat_details[0]
        start_from = last_progress.get("processed", 0)
        activity.logger.info(f"Resuming from item {start_from}")
    else:
        activity.logger.info("Starting fresh batch processing")

    processed = start_from

    for item_id in range(start_from, total_items):
        # Process item
        await process_single_item(item_id)
        processed += 1

        # Heartbeat mit aktuellem Fortschritt
        activity.heartbeat({
            "processed": processed,
            "total": total_items,
            "last_item_id": item_id,
            "timestamp": time.time()
        })

    return f"Processed {processed} items (resumed from {start_from})"

Heartbeat Timeout konfigurieren:

@workflow.defn
class BatchWorkflow:
    @workflow.run
    async def run(self, total_items: int) -> str:
        return await workflow.execute_activity(
            long_batch_processing,
            args=[total_items],
            start_to_close_timeout=timedelta(minutes=30),  # Gesamtzeit
            heartbeat_timeout=timedelta(seconds=30),       # Max Zeit zwischen Heartbeats
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

Wichtige Heartbeat-Regeln:

  1. Throttling: Heartbeats werden automatisch gedrosselt (ca. 30-60s)
  2. Cancellation: Nur Activities mit heartbeat_timeout können gecancelt werden
  3. Resumption: Heartbeat Details persistieren über Retries
  4. Performance: Heartbeats sollten nicht zu frequent sein (alle paar Sekunden reicht)

7.3.3 Activity Timeouts

Es gibt vier Activity Timeout-Typen, die verschiedene Aspekte kontrollieren:

Timeout-Übersicht:

gantt
    title Activity Timeout Timeline
    dateFormat ss
    axisFormat %S

    section Queue
    Schedule           :milestone, 00, 0s
    Schedule-To-Start  :active, 00, 05

    section Execution
    Start              :milestone, 05, 0s
    Start-To-Close     :active, 05, 30
    Heartbeat Interval :crit, 05, 10
    Heartbeat Interval :crit, 15, 10
    Heartbeat Interval :crit, 25, 10
    Close              :milestone, 35, 0s

    section Overall
    Schedule-To-Close  :done, 00, 35

1. Start-To-Close Timeout (Empfohlen!)

  • Maximale Zeit für einzelne Activity Task Execution
  • Gilt pro Retry-Versuch
  • Triggers Retry bei Überschreitung
  • WICHTIG: Dieser Timeout ist stark empfohlen!
await workflow.execute_activity(
    my_activity,
    args=["data"],
    start_to_close_timeout=timedelta(seconds=30)  # Jeder Versuch max 30s
)

2. Schedule-To-Close Timeout

  • Maximale Zeit für gesamte Activity Execution (inkl. aller Retries)
  • Stoppt alle weiteren Retries bei Überschreitung
  • Triggers KEIN Retry (Budget erschöpft)
await workflow.execute_activity(
    my_activity,
    args=["data"],
    start_to_close_timeout=timedelta(seconds=10),   # Pro Versuch
    schedule_to_close_timeout=timedelta(minutes=5)  # Gesamt über alle Versuche
)

3. Schedule-To-Start Timeout

  • Maximale Zeit von Scheduling bis Worker Pickup
  • Erkennt Worker Crashes oder Queue Congestion
  • Triggers KEIN Retry (würde in gleiche Queue zurück)
await workflow.execute_activity(
    my_activity,
    args=["data"],
    start_to_close_timeout=timedelta(seconds=30),
    schedule_to_start_timeout=timedelta(minutes=5)  # Max 5 Min in Queue
)

4. Heartbeat Timeout

  • Maximale Zeit zwischen Activity Heartbeats
  • Erforderlich für Activity Cancellation
  • Triggers Retry bei Überschreitung
await workflow.execute_activity(
    long_running_activity,
    args=[1000],
    start_to_close_timeout=timedelta(minutes=30),
    heartbeat_timeout=timedelta(seconds=30)  # Heartbeat alle 30s erforderlich
)

Vollständiges Beispiel:

@workflow.defn
class TimeoutDemoWorkflow:
    @workflow.run
    async def run(self) -> dict:
        results = {}

        # Scenario 1: Quick API call
        results["api"] = await workflow.execute_activity(
            quick_api_call,
            start_to_close_timeout=timedelta(seconds=5),
            schedule_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=5)
        )

        # Scenario 2: Long-running with heartbeat
        results["batch"] = await workflow.execute_activity(
            long_batch_process,
            args=[1000],
            start_to_close_timeout=timedelta(minutes=30),
            heartbeat_timeout=timedelta(seconds=30),
            schedule_to_close_timeout=timedelta(hours=2)
        )

        # Scenario 3: With queue monitoring
        results["scalable"] = await workflow.execute_activity(
            scalable_activity,
            start_to_close_timeout=timedelta(seconds=10),
            schedule_to_start_timeout=timedelta(minutes=5),
            schedule_to_close_timeout=timedelta(minutes=10)
        )

        return results

Timeout vs Retry Interaktion:

Timeout TypTriggers Retry?Use Case
Start-To-Close✅ JaEinzelne Execution überwachen
Schedule-To-Close❌ NeinGesamt-Budget kontrollieren
Schedule-To-Start❌ NeinQueue Issues erkennen
Heartbeat✅ JaLong-running Progress überwachen

Best Practices:

  1. Immer Start-To-Close setzen (Temporal empfiehlt dies stark)
  2. Schedule-To-Close optional für Budget-Kontrolle
  3. Schedule-To-Start bei Scaling-Concerns
  4. Heartbeat für Long-Running (> 1 Minute)

7.3.4 Activity Cancellation

Activity Cancellation ermöglicht graceful shutdown von laufenden Activities.

Requirements:

  1. Activity muss Heartbeats senden
  2. Heartbeat Timeout muss gesetzt sein
  3. Activity muss asyncio.CancelledError behandeln

Cancellable Activity Implementation:

import asyncio
from temporalio import activity

@activity.defn
async def cancellable_long_operation(data_size: int) -> str:
    """Activity mit Cancellation Support"""
    processed = 0

    activity.logger.info(f"Starting operation: {data_size} items")

    try:
        while processed < data_size:
            # Check Cancellation
            if activity.is_cancelled():
                activity.logger.info(
                    f"Cancellation detected at {processed}/{data_size}"
                )
                raise asyncio.CancelledError("Operation cancelled by user")

            # Do work chunk
            await process_chunk(processed, min(processed + 100, data_size))
            processed += 100

            # Send heartbeat (enables cancellation detection)
            activity.heartbeat({
                "processed": processed,
                "total": data_size,
                "percent": (processed / data_size) * 100
            })

            # Small sleep to avoid tight loop
            await asyncio.sleep(0.5)

        activity.logger.info("Operation completed successfully")
        return f"Processed {processed} items"

    except asyncio.CancelledError:
        # Cleanup logic
        activity.logger.info(f"Cleaning up after processing {processed} items")
        await cleanup_resources(processed)

        # Save state for potential resume
        await save_progress(processed)

        # Must re-raise to signal cancellation
        raise

    except Exception as e:
        activity.logger.error(f"Operation failed: {e}")
        await cleanup_resources(processed)
        raise

Workflow-seitige Cancellation:

@workflow.defn
class CancellableWorkflow:
    @workflow.run
    async def run(self, data_size: int, timeout_seconds: int) -> str:
        # Start activity (nicht sofort await)
        activity_handle = workflow.start_activity(
            cancellable_long_operation,
            args=[data_size],
            start_to_close_timeout=timedelta(minutes=30),
            heartbeat_timeout=timedelta(seconds=30),  # Required!
        )

        try:
            # Wait mit Custom Timeout
            result = await asyncio.wait_for(
                activity_handle,
                timeout=timeout_seconds
            )
            return result

        except asyncio.TimeoutError:
            # Timeout → Cancel Activity
            workflow.logger.info(f"Timeout after {timeout_seconds}s - cancelling activity")
            activity_handle.cancel()

            try:
                # Wait for cancellation to complete
                await activity_handle
            except asyncio.CancelledError:
                workflow.logger.info("Activity cancelled successfully")

            return "Operation timed out and was cancelled"

Client-seitige Cancellation:

from temporalio.client import Client

# Start Workflow
client = await Client.connect("localhost:7233")
handle = await client.start_workflow(
    CancellableWorkflow.run,
    args=[10000, 60],
    id="cancellable-workflow-1",
    task_queue="my-queue"
)

# Cancel von extern
await asyncio.sleep(30)  # Nach 30 Sekunden canceln
await handle.cancel()

# Check result
try:
    result = await handle.result()
except asyncio.CancelledError:
    print("Workflow was cancelled")

7.4 Workflow Error Handling

7.4.1 Try/Except in Workflows

Workflows können Activity-Fehler abfangen und behandeln:

Basic Pattern:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        inventory_id = None
        payment_id = None

        try:
            # Step 1: Reserve Inventory
            workflow.logger.info("Reserving inventory...")
            inventory_id = await workflow.execute_activity(
                reserve_inventory,
                args=[order.items],
                start_to_close_timeout=timedelta(seconds=30)
            )

            # Step 2: Process Payment
            workflow.logger.info("Processing payment...")
            payment_id = await workflow.execute_activity(
                process_payment,
                args=[order.payment_info, order.total],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(
                    maximum_attempts=3,
                    non_retryable_error_types=["InsufficientFundsError"]
                )
            )

            # Step 3: Ship Order
            workflow.logger.info("Shipping order...")
            tracking = await workflow.execute_activity(
                ship_order,
                args=[order.address, inventory_id],
                start_to_close_timeout=timedelta(minutes=5)
            )

            return OrderResult(success=True, tracking=tracking)

        except ActivityError as e:
            workflow.logger.error(f"Order failed: {e}")

            # Compensating Transactions (SAGA Pattern)
            if payment_id:
                workflow.logger.info("Refunding payment...")
                await workflow.execute_activity(
                    refund_payment,
                    args=[payment_id],
                    start_to_close_timeout=timedelta(seconds=30)
                )

            if inventory_id:
                workflow.logger.info("Releasing inventory...")
                await workflow.execute_activity(
                    release_inventory,
                    args=[inventory_id],
                    start_to_close_timeout=timedelta(seconds=30)
                )

            # Determine failure reason
            if isinstance(e.cause, ApplicationError):
                if e.cause.type == "InsufficientFundsError":
                    return OrderResult(
                        success=False,
                        error="Payment declined: Insufficient funds"
                    )

            # Re-raise für unbehandelte Fehler
            raise

7.4.2 ApplicationError vs Non-Determinism Errors

ApplicationError (Bewusster Workflow-Fehler):

@workflow.defn
class ValidationWorkflow:
    @workflow.run
    async def run(self, data: dict) -> str:
        # Business Logic Validation
        if not data.get("required_field"):
            # Explizites Workflow Failure
            raise ApplicationError(
                "Invalid workflow input: missing required_field",
                type="ValidationError",
                non_retryable=True,
                details=[{"received_data": data}]
            )

        # Workflow fährt fort
        return "Success"

Non-Determinism Errors (Bug im Workflow Code):

import random
import datetime

@workflow.defn
class BadWorkflow:
    @workflow.run
    async def run(self) -> str:
        # ❌ FALSCH: Non-deterministic!
        random_value = random.random()  # Anders bei Replay!
        now = datetime.datetime.now()   # Anders bei Replay!

        # ✅ RICHTIG: Temporal APIs verwenden
        random_value = workflow.random().random()
        now = workflow.now()

        return "Success"

Unterschiede:

AspektApplicationErrorNon-Determinism Error
ZweckBusiness Logic FailureCode Bug
UrsacheBewusster RaiseGeänderter/Non-Det Code
RetryKonfigurierbarTask Retry unendlich
FixIm Workflow-Code behandelnCode fixen + redeploy
HistoryExecution → FailedTask retry loop

Non-Determinism vermeiden:

  1. Kein non-deterministic Code:

    • random.random()
    • datetime.now()
    • uuid.uuid4()
    • ❌ External I/O im Workflow
    • workflow.random()
    • workflow.now()
    • workflow.uuid4()
  2. Sandbox nutzen (aktiviert per Default):

from temporalio.worker import Worker

worker = Worker(
    client,
    task_queue="my-queue",
    workflows=[MyWorkflow],
    activities=[my_activity],
    # Sandbox enabled by default - schützt vor Non-Determinism
)

7.4.3 Child Workflow Error Handling

Child Workflows mit Error Handling:

@workflow.defn
class ParentWorkflow:
    @workflow.run
    async def run(self, orders: list[Order]) -> dict:
        """Parent verarbeitet mehrere Child Workflows"""
        successful = []
        failed = []

        for order in orders:
            try:
                # Execute Child Workflow
                result = await workflow.execute_child_workflow(
                    OrderWorkflow.run,
                    args=[order],
                    retry_policy=RetryPolicy(
                        maximum_attempts=3,
                        initial_interval=timedelta(seconds=2)
                    ),
                    # Parent Close Policy
                    parent_close_policy=ParentClosePolicy.ABANDON
                )

                successful.append({
                    "order_id": order.id,
                    "result": result
                })
                workflow.logger.info(f"Order {order.id} completed")

            except ChildWorkflowError as e:
                workflow.logger.error(f"Order {order.id} failed: {e}")

                # Navigate nested error causes
                root_cause = e.cause
                while hasattr(root_cause, 'cause') and root_cause.cause:
                    root_cause = root_cause.cause

                failed.append({
                    "order_id": order.id,
                    "error": str(e),
                    "root_cause": str(root_cause)
                })

        return {
            "successful_count": len(successful),
            "failed_count": len(failed),
            "successful": successful,
            "failed": failed
        }

Parent Close Policies:

from temporalio.common import ParentClosePolicy

# Terminate child when parent closes
parent_close_policy=ParentClosePolicy.TERMINATE

# Cancel child when parent closes
parent_close_policy=ParentClosePolicy.REQUEST_CANCEL

# Let child continue independently (default)
parent_close_policy=ParentClosePolicy.ABANDON

7.5 Advanced Error Patterns

7.5.1 SAGA Pattern für Distributed Transactions

Das SAGA Pattern implementiert verteilte Transaktionen durch Compensation Actions.

Konzept:

  • Jeder Schritt hat eine entsprechende Compensation (Undo)
  • Bei Fehler werden alle bisherigen Schritte kompensiert
  • Reihenfolge: Reverse Order (LIFO)

Vollständige SAGA Implementation:

from dataclasses import dataclass, field
from typing import Optional, List, Tuple, Callable

@dataclass
class BookingRequest:
    user_id: str
    car_id: str
    hotel_id: str
    flight_id: str

@dataclass
class BookingResult:
    success: bool
    car_booking: Optional[str] = None
    hotel_booking: Optional[str] = None
    flight_booking: Optional[str] = None
    error: Optional[str] = None

# Forward Activities
@activity.defn
async def book_car(car_id: str) -> str:
    """Book car reservation"""
    if "invalid" in car_id:
        raise ValueError("Invalid car ID")
    activity.logger.info(f"Car booked: {car_id}")
    return f"car_booking_{car_id}"

@activity.defn
async def book_hotel(hotel_id: str) -> str:
    """Book hotel reservation"""
    if "invalid" in hotel_id:
        raise ValueError("Invalid hotel ID")
    activity.logger.info(f"Hotel booked: {hotel_id}")
    return f"hotel_booking_{hotel_id}"

@activity.defn
async def book_flight(flight_id: str) -> str:
    """Book flight reservation"""
    if "invalid" in flight_id:
        raise ValueError("Invalid flight ID")
    activity.logger.info(f"Flight booked: {flight_id}")
    return f"flight_booking_{flight_id}"

# Compensation Activities
@activity.defn
async def undo_book_car(booking_id: str) -> None:
    """Cancel car reservation"""
    activity.logger.info(f"Cancelling car booking: {booking_id}")
    await asyncio.sleep(0.5)

@activity.defn
async def undo_book_hotel(booking_id: str) -> None:
    """Cancel hotel reservation"""
    activity.logger.info(f"Cancelling hotel booking: {booking_id}")
    await asyncio.sleep(0.5)

@activity.defn
async def undo_book_flight(booking_id: str) -> None:
    """Cancel flight reservation"""
    activity.logger.info(f"Cancelling flight booking: {booking_id}")
    await asyncio.sleep(0.5)

# SAGA Workflow
@workflow.defn
class TripBookingSaga:
    """SAGA Pattern für Trip Booking"""

    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Track completed steps mit Compensations
        compensations: List[Tuple[Callable, str]] = []
        result = BookingResult(success=False)

        try:
            # Step 1: Book Car
            workflow.logger.info("Step 1: Booking car...")
            result.car_booking = await workflow.execute_activity(
                book_car,
                args=[request.car_id],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            compensations.append((undo_book_car, result.car_booking))
            workflow.logger.info(f"✓ Car booked: {result.car_booking}")

            # Step 2: Book Hotel
            workflow.logger.info("Step 2: Booking hotel...")
            result.hotel_booking = await workflow.execute_activity(
                book_hotel,
                args=[request.hotel_id],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(
                    maximum_attempts=3,
                    non_retryable_error_types=["ValueError"]
                )
            )
            compensations.append((undo_book_hotel, result.hotel_booking))
            workflow.logger.info(f"✓ Hotel booked: {result.hotel_booking}")

            # Step 3: Book Flight
            workflow.logger.info("Step 3: Booking flight...")
            result.flight_booking = await workflow.execute_activity(
                book_flight,
                args=[request.flight_id],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            compensations.append((undo_book_flight, result.flight_booking))
            workflow.logger.info(f"✓ Flight booked: {result.flight_booking}")

            # All steps successful!
            result.success = True
            workflow.logger.info("🎉 Trip booking completed successfully")
            return result

        except Exception as e:
            # Fehler → Execute Compensations in REVERSE order
            workflow.logger.error(f"❌ Booking failed: {e}. Executing compensations...")
            result.error = str(e)

            for compensation_activity, booking_id in reversed(compensations):
                try:
                    await workflow.execute_activity(
                        compensation_activity,
                        args=[booking_id],
                        start_to_close_timeout=timedelta(seconds=30),
                        retry_policy=RetryPolicy(
                            maximum_attempts=5,  # Compensations robuster!
                            initial_interval=timedelta(seconds=2)
                        )
                    )
                    workflow.logger.info(
                        f"✓ Compensation successful: {compensation_activity.__name__}"
                    )
                except Exception as comp_error:
                    # Log aber fortfahren mit anderen Compensations
                    workflow.logger.error(
                        f"⚠ Compensation failed: {compensation_activity.__name__}: {comp_error}"
                    )

            workflow.logger.info("All compensations completed")
            return result

SAGA Flow Visualisierung:

sequenceDiagram
    participant W as Workflow
    participant A1 as Book Car Activity
    participant A2 as Book Hotel Activity
    participant A3 as Book Flight Activity
    participant C3 as Undo Flight
    participant C2 as Undo Hotel
    participant C1 as Undo Car

    W->>A1: book_car()
    A1-->>W: car_booking_123
    Note over W: Add undo_book_car to stack

    W->>A2: book_hotel()
    A2-->>W: hotel_booking_456
    Note over W: Add undo_book_hotel to stack

    W->>A3: book_flight()
    A3--xW: Flight booking FAILED

    Note over W: Execute compensations<br/>in REVERSE order

    W->>C2: undo_book_hotel(456)
    C2-->>W: Hotel cancelled

    W->>C1: undo_book_car(123)
    C1-->>W: Car cancelled

    Note over W: All compensations done

SAGA Best Practices:

  1. Idempotenz: Alle Forward & Compensation Activities müssen idempotent sein
  2. Compensation Resilience: Compensations mit aggressiveren Retry Policies
  3. Partial Success Tracking: Genau tracken welche Steps erfolgreich waren
  4. Compensation Logging: Ausführliches Logging für Debugging
  5. State Preservation: Workflow State nutzen für SAGA Progress

7.5.2 Circuit Breaker Pattern

Circuit Breaker verhindert Cascade Failures durch Blocking bei wiederholten Fehlern.

States:

  • CLOSED: Normal operation (Requests gehen durch)
  • OPEN: Blocking requests (Service hat Probleme)
  • HALF_OPEN: Testing recovery (Einzelne Requests testen)

Implementation:

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreakerState:
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    last_failure_time: Optional[datetime] = None
    success_count: int = 0

@workflow.defn
class CircuitBreakerWorkflow:
    def __init__(self) -> None:
        self.circuit = CircuitBreakerState()
        self.failure_threshold = 5
        self.timeout = timedelta(seconds=60)
        self.half_open_success_threshold = 2

    @workflow.run
    async def run(self, requests: list[str]) -> dict:
        """Process requests mit Circuit Breaker"""
        results = []

        for request in requests:
            try:
                result = await self.call_with_circuit_breaker(request)
                results.append({"request": request, "result": result, "status": "success"})
            except ApplicationError as e:
                results.append({"request": request, "error": str(e), "status": "failed"})

        return {
            "total": len(requests),
            "successful": sum(1 for r in results if r["status"] == "success"),
            "failed": sum(1 for r in results if r["status"] == "failed"),
            "results": results
        }

    async def call_with_circuit_breaker(self, request: str) -> str:
        """Call mit Circuit Breaker Protection"""

        # Check circuit state
        if self.circuit.state == CircuitState.OPEN:
            # Check timeout
            time_since_failure = workflow.now() - self.circuit.last_failure_time

            if time_since_failure < self.timeout:
                # Circuit still open
                raise ApplicationError(
                    f"Circuit breaker is OPEN (failures: {self.circuit.failure_count})",
                    type="CircuitBreakerOpen",
                    non_retryable=True
                )
            else:
                # Try half-open
                self.circuit.state = CircuitState.HALF_OPEN
                self.circuit.success_count = 0
                workflow.logger.info("Circuit breaker entering HALF_OPEN state")

        # Attempt the call
        try:
            result = await workflow.execute_activity(
                external_service_call,
                args=[request],
                start_to_close_timeout=timedelta(seconds=10),
                retry_policy=RetryPolicy(maximum_attempts=1)  # No retries!
            )

            # Success
            await self.on_success()
            return result

        except ActivityError as e:
            # Failure
            await self.on_failure()
            raise

    async def on_success(self) -> None:
        """Handle successful call"""
        if self.circuit.state == CircuitState.HALF_OPEN:
            self.circuit.success_count += 1

            if self.circuit.success_count >= self.half_open_success_threshold:
                # Enough successes → Close circuit
                self.circuit.state = CircuitState.CLOSED
                self.circuit.failure_count = 0
                workflow.logger.info("✓ Circuit breaker CLOSED")

        elif self.circuit.state == CircuitState.CLOSED:
            # Reset failure count
            self.circuit.failure_count = 0

    async def on_failure(self) -> None:
        """Handle failed call"""
        self.circuit.failure_count += 1
        self.circuit.last_failure_time = workflow.now()

        if self.circuit.state == CircuitState.HALF_OPEN:
            # Failure in half-open → Reopen
            self.circuit.state = CircuitState.OPEN
            workflow.logger.warning("⚠ Circuit breaker reopened due to failure")

        elif self.circuit.failure_count >= self.failure_threshold:
            # Too many failures → Open circuit
            self.circuit.state = CircuitState.OPEN
            workflow.logger.warning(
                f"⚠ Circuit breaker OPENED after {self.circuit.failure_count} failures"
            )

Circuit Breaker State Machine:

stateDiagram-v2
    [*] --> Closed: Initial State

    Closed --> Open: Failure threshold reached
    Closed --> Closed: Success (reset counter)

    Open --> HalfOpen: Timeout elapsed
    Open --> Open: Requests blocked

    HalfOpen --> Closed: Success threshold reached
    HalfOpen --> Open: Any failure

    note right of Closed
        Normal operation
        Track failures
    end note

    note right of Open
        Block all requests
        Wait for timeout
    end note

    note right of HalfOpen
        Test recovery
        Limited requests
    end note

7.5.3 Dead Letter Queue Pattern

DLQ Pattern routet persistierend fehlerhafte Items in separate Queue für manuelle Verarbeitung.

@dataclass
class ProcessingItem:
    id: str
    data: str
    retry_count: int = 0
    errors: list[str] = field(default_factory=list)

@workflow.defn
class DLQWorkflow:
    """Workflow mit Dead Letter Queue Pattern"""

    def __init__(self) -> None:
        self.max_retries = 3
        self.dlq_items: List[ProcessingItem] = []

    @workflow.run
    async def run(self, items: list[ProcessingItem]) -> dict:
        successful = []
        failed = []

        for item in items:
            try:
                result = await self.process_with_dlq(item)
                successful.append({"id": item.id, "result": result})
            except ApplicationError as e:
                workflow.logger.error(f"Item {item.id} sent to DLQ: {e}")
                failed.append(item.id)

        # Send DLQ items to persistent storage
        if self.dlq_items:
            await self.send_to_dlq(self.dlq_items)

        return {
            "successful": len(successful),
            "failed": len(failed),
            "dlq_count": len(self.dlq_items),
            "results": successful
        }

    async def process_with_dlq(self, item: ProcessingItem) -> str:
        """Process mit DLQ fallback"""
        while item.retry_count < self.max_retries:
            try:
                # Attempt processing
                result = await workflow.execute_activity(
                    process_item,
                    args=[item.data],
                    start_to_close_timeout=timedelta(seconds=30),
                    retry_policy=RetryPolicy(maximum_attempts=1)
                )
                return result

            except ActivityError as e:
                item.retry_count += 1
                item.errors.append(str(e))

                if item.retry_count < self.max_retries:
                    # Exponential backoff
                    wait_time = 2 ** item.retry_count
                    await asyncio.sleep(timedelta(seconds=wait_time))
                    workflow.logger.warning(
                        f"Retrying item {item.id} (attempt {item.retry_count})"
                    )
                else:
                    # Max retries → DLQ
                    workflow.logger.error(
                        f"Item {item.id} failed {item.retry_count} times - sending to DLQ"
                    )
                    self.dlq_items.append(item)

                    raise ApplicationError(
                        f"Item {item.id} sent to DLQ after {item.retry_count} failures",
                        type="MaxRetriesExceeded",
                        details=[{
                            "item_id": item.id,
                            "retry_count": item.retry_count,
                            "errors": item.errors
                        }]
                    )

        raise ApplicationError("Unexpected state")

    async def send_to_dlq(self, items: List[ProcessingItem]) -> None:
        """Send items to Dead Letter Queue"""
        await workflow.execute_activity(
            write_to_dlq,
            args=[items],
            start_to_close_timeout=timedelta(seconds=60),
            retry_policy=RetryPolicy(
                maximum_attempts=5,  # DLQ writes must be reliable!
                initial_interval=timedelta(seconds=5)
            )
        )
        workflow.logger.info(f"✓ Sent {len(items)} items to DLQ")

@activity.defn
async def write_to_dlq(items: List[ProcessingItem]) -> None:
    """Write failed items to DLQ storage"""
    for item in items:
        activity.logger.error(
            f"DLQ item: {item.id}",
            extra={
                "retry_count": item.retry_count,
                "errors": item.errors,
                "data": item.data
            }
        )
        # Write to database, SQS, file, etc.
        await dlq_storage.write(item)

7.6 Zusammenfassung

Kernkonzepte Error Handling:

  1. Activity vs Workflow Errors: Activities haben Default Retries, Workflows nicht
  2. Exception-Hierarchie: ApplicationError für bewusste Fehler, ActivityError als Wrapper
  3. Retry Policies: Deklarative Konfiguration mit Exponential Backoff
  4. Timeouts: 4 Activity-Timeouts, 3 Workflow-Timeouts
  5. Advanced Patterns: SAGA für Compensations, Circuit Breaker für Cascades, DLQ für Persistent Failures

Best Practices Checkliste:

  • ✅ Start-To-Close Timeout immer setzen
  • ✅ Non-Retryable Errors explizit markieren
  • ✅ Idempotenz für alle Activities implementieren
  • ✅ SAGA Pattern für Distributed Transactions
  • ✅ Heartbeats für Long-Running Activities
  • ✅ Circuit Breaker bei externen Services
  • ✅ DLQ für persistierende Failures
  • ✅ Ausführliches Error Logging mit Context
  • ✅ Replay Tests für Non-Determinism
  • ✅ Monitoring und Alerting

Im nächsten Kapitel (Kapitel 8) werden wir uns mit Workflow Evolution und Versioning beschäftigen - wie Sie Workflows sicher ändern können während sie laufen.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 8: SAGA Pattern

Code-Beispiele für dieses Kapitel: examples/part-03/chapter-07/

Kapitel 8: Workflow Evolution und Versioning

Einleitung

Eine der größten Herausforderungen in verteilten Systemen ist die Evolution von langlebigem Code. Während traditionelle Web-Services einfach neu deployed werden können, laufen Temporal Workflows oft über Tage, Wochen, Monate oder sogar Jahre. Was passiert, wenn Sie den Code ändern müssen, während tausende Workflows noch laufen?

Temporal löst dieses Problem durch ein ausgeklügeltes Versioning-System, das Determinismus erhält während gleichzeitig Code-Evolution ermöglicht wird. Ohne Versioning würden Code-Änderungen laufende Workflows brechen. Mit Versioning können Sie sicher deployen, Features hinzufügen und Bugs fixen – ohne existierende Executions zu gefährden.

Das Grundproblem

Scenario: Sie haben 10,000 laufende Order-Workflows. Jeder läuft 30 Tage. Sie müssen einen zusätzlichen Fraud-Check hinzufügen.

Ohne Versioning:

# Alter Code
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)
        return Result(payment=payment)

# Neuer Code - deployed auf laufende Workflows
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)
        # NEU: Fraud Check hinzugefügt
        fraud = await workflow.execute_activity(check_fraud, ...)  # ❌ BREAKS REPLAY!
        return Result(payment=payment)

Problem: Wenn ein alter Workflow replayedwird, erwartet das System die gleiche Befehlsfolge. Die neue Activity check_fraud existiert aber nicht in der History → Non-Determinism Error.

Mit Versioning:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # Versioning: Neue Workflows nutzen neuen Code, alte den alten
        if workflow.patched("add-fraud-check"):
            fraud = await workflow.execute_activity(check_fraud, ...)  # ✅ SAFE!

        return Result(payment=payment)

Jetzt können beide Versionen parallel laufen!

Lernziele

Nach diesem Kapitel können Sie:

  • Verstehen warum Determinismus Versioning erforderlich macht
  • Die Patching API verwenden (workflow.patched())
  • Worker Versioning mit Build IDs implementieren
  • Sichere vs. unsichere Code-Änderungen identifizieren
  • Replay Tests schreiben
  • Migrations-Patterns für Breaking Changes anwenden
  • Version Sprawl vermeiden
  • Workflows sicher über Jahre hinweg evolutionieren

8.1 Versioning Fundamentals

8.1.1 Determinismus und Replay

Was ist Determinismus?

Ein Workflow ist deterministisch, wenn jede Execution die gleichen Commands in der gleichen Reihenfolge produziert bei gleichem Input. Diese Eigenschaft ermöglicht:

  • Workflow Replay nach Worker Crashes
  • Lange schlafende Workflows (Monate/Jahre)
  • Zuverlässige Workflow-Relocation zwischen Workers
  • State-Rekonstruktion aus Event History

Wie Replay funktioniert:

sequenceDiagram
    participant History as Event History
    participant Worker as Worker
    participant Code as Workflow Code

    Note over History: Workflow executed<br/>Commands recorded

    Worker->>History: Fetch Events
    History-->>Worker: Return Event 1-N

    Worker->>Code: Execute run()
    Code->>Code: Generate Commands

    Worker->>Worker: Compare:<br/>Commands vs Events

    alt Commands match Events
        Worker->>Worker: ✓ Replay successful
    else Commands differ
        Worker->>Worker: ✗ Non-Determinism Error
    end

Kritischer Punkt: Das System führt nicht die Commands aus der History erneut aus, sondern verwendet die aufgezeichneten Results um State zu rekonstruieren. Wenn der neue Code eine andere Befehlsfolge produziert → Fehler.

Beispiel: Non-Determinism Error

# Version 1 (deployed, 1000 workflows running)
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        # Command 1: ScheduleActivityTask (process_payment)
        payment = await workflow.execute_activity(process_payment, ...)
        # Command 2: CompleteWorkflowExecution
        return Result(payment=payment)

# Version 2 (deployed while v1 workflows still running)
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        # Command 1: ScheduleActivityTask (validate_order) ← NEU!
        await workflow.execute_activity(validate_order, ...)
        # Command 2: ScheduleActivityTask (process_payment)
        payment = await workflow.execute_activity(process_payment, ...)
        # Command 3: CompleteWorkflowExecution
        return Result(payment=payment)

Was passiert beim Replay:

Event History (v1):
  Event 1: WorkflowExecutionStarted
  Event 2: ActivityTaskScheduled (process_payment)
  Event 3: ActivityTaskCompleted
  Event 4: WorkflowExecutionCompleted

Replay mit v2 Code:
  ✗ Erwartet: ScheduleActivityTask (validate_order)
  ✗ Gefunden: ActivityTaskScheduled (process_payment)

  → NondeterminismError!

8.1.2 Warum Versioning komplex ist

Langlebigkeit von Workflows:

gantt
    title Workflow Lifetime vs Code Changes
    dateFormat YYYY-MM-DD
    axisFormat %b %d

    section Workflow 1
    Running :w1, 2025-01-01, 30d

    section Workflow 2
    Running :w2, 2025-01-15, 30d

    section Code Deployments
    Version 1.0 :milestone, v1, 2025-01-01, 0d
    Version 1.1 :milestone, v2, 2025-01-10, 0d
    Version 1.2 :milestone, v3, 2025-01-20, 0d
    Version 2.0 :milestone, v4, 2025-01-30, 0d

Workflow 1 durchlebt 4 Code-Versionen während seiner Laufzeit!

Herausforderungen:

  1. Backwards Compatibility: Neue Code-Version muss alte Workflows replyen können
  2. Version Sprawl: Zu viele Versionen → Code-Komplexität
  3. Testing: Replay-Tests für alle Versionen
  4. Cleanup: Wann können alte Versionen entfernt werden?
  5. Documentation: Welche Version macht was?

8.1.3 Drei Versioning-Ansätze

Temporal bietet drei Hauptstrategien:

1. Patching API (Code-Level Versioning)

if workflow.patched("my-change"):
    # Neuer Code-Pfad
    await new_implementation()
else:
    # Alter Code-Pfad
    await old_implementation()

Vorteile:

  • Granulare Kontrolle
  • Beide Pfade im gleichen Code
  • Funktioniert sofort

Nachteile:

  • Code-Komplexität wächst
  • Manuelle Verwaltung
  • Version Sprawl bei vielen Changes

2. Worker Versioning (Infrastructure-Level)

worker = Worker(
    client,
    task_queue="orders",
    workflows=[OrderWorkflow],
    deployment_config=WorkerDeploymentConfig(
        deployment_name="order-service",
        build_id="v2.0.0",  # Version identifier
    )
)

Vorteile:

  • Saubere Code-Trennung
  • Automatisches Routing
  • Gradual Rollout möglich

Nachteile:

  • Infrastruktur-Overhead (mehrere Worker-Pools)
  • Noch in Public Preview
  • Komplexere Deployments

3. Workflow-Name Versioning (Cutover)

@workflow.defn(name="ProcessOrder_v2")
class ProcessOrderWorkflowV2:
    # Völlig neue Implementation
    pass

# Alter Workflow bleibt für Kompatibilität
@workflow.defn(name="ProcessOrder")
class ProcessOrderWorkflowV1:
    # Legacy code
    pass

Vorteile:

  • Klare Trennung
  • Einfach zu verstehen
  • Keine Patching-Logik

Nachteile:

  • Code-Duplizierung
  • Kann laufende Workflows nicht versionieren
  • Client-Code muss updaten

Wann welchen Ansatz?

ScenarioEmpfohlener Ansatz
Kleine Änderungen, wenige VersionenPatching API
Häufige Updates, viele VersionenWorker Versioning
Komplettes RedesignWorkflow-Name Versioning
< 10 laufende WorkflowsWorkflow-Name Versioning
Breaking Changes in DatenstrukturWorker Versioning

8.2 Patching API

Der Python SDK nutzt workflow.patched() für Code-Level Versioning.

8.2.1 Grundlagen

API:

from temporalio import workflow

if workflow.patched(patch_id: str) -> bool:
    # Neuer Code-Pfad
    pass
else:
    # Alter Code-Pfad
    pass

Verhalten:

SituationRückgabewertReason
Erste Execution (neu)TrueMarker wird hinzugefügt, neuer Code läuft
Replay MIT MarkerTrueMarker in History, neuer Code läuft
Replay OHNE MarkerFalseAlter Workflow, alter Code läuft

Beispiel:

from temporalio import workflow
from datetime import timedelta

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        # Payment verarbeiten
        payment = await workflow.execute_activity(
            process_payment,
            args=[order],
            start_to_close_timeout=timedelta(minutes=5),
        )

        # Patch: Fraud Check hinzufügen
        if workflow.patched("add-fraud-check-v1"):
            # Neuer Code-Pfad (nach Deployment)
            workflow.logger.info("Running fraud check (new version)")
            fraud_result = await workflow.execute_activity(
                check_fraud,
                args=[order, payment],
                start_to_close_timeout=timedelta(minutes=2),
            )

            if not fraud_result.is_safe:
                raise FraudDetectedError(f"Fraud detected: {fraud_result.reason}")
        else:
            # Alter Code-Pfad (für Replay alter Workflows)
            workflow.logger.info("Skipping fraud check (old version)")

        return Result(payment=payment)

Was passiert:

flowchart TD
    A[Workflow run] --> B{patched aufgerufen}
    B --> C{Marker in History?}

    C -->|Marker vorhanden| D[Return True<br/>Neuer Code]
    C -->|Kein Marker| E{Erste Execution?}

    E -->|Ja| F[Marker hinzufügen<br/>Return True<br/>Neuer Code]
    E -->|Nein = Replay| G[Return False<br/>Alter Code]

    D --> H[Workflow fährt fort]
    F --> H
    G --> H

    style D fill:#90EE90
    style F fill:#90EE90
    style G fill:#FFB6C1

8.2.2 Drei-Schritte-Prozess

Schritt 1: Patch einführen

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # STEP 1: Patch mit if/else
        if workflow.patched("add-fraud-check-v1"):
            fraud = await workflow.execute_activity(check_fraud, ...)
        # Else-Block leer = alter Code macht nichts

        return Result(payment=payment)

Deployment: Alle neuen Workflows nutzen Fraud Check, alte nicht.

Schritt 2: Patch deprecaten

Nach allen alten Workflows sind abgeschlossen:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # STEP 2: deprecate_patch() + nur neuer Code
        workflow.deprecate_patch("add-fraud-check-v1")
        fraud = await workflow.execute_activity(check_fraud, ...)

        return Result(payment=payment)

Zweck von deprecate_patch():

  • Fügt Marker hinzu OHNE Replay zu brechen
  • Erlaubt Entfernung des if/else
  • Brücke zwischen Patching und Clean Code

Schritt 3: Patch entfernen

Nach Retention Period ist abgelaufen:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # STEP 3: Clean code - kein Versioning mehr
        fraud = await workflow.execute_activity(check_fraud, ...)

        return Result(payment=payment)

Timeline:

Tag 0:   Deploy Patch (Step 1)
         - Neue Workflows: Fraud Check
         - Alte Workflows: Kein Fraud Check

Tag 30:  Alle alten Workflows abgeschlossen
         - Verify: Keine laufenden Workflows ohne Patch

Tag 31:  Deploy deprecate_patch (Step 2)
         - Code hat nur noch neuen Pfad
         - Kompatibel mit alter History

Tag 61:  Retention Period abgelaufen
         - Alte Histories gelöscht

Tag 68:  Remove Patch (Step 3 + Safety Margin)
         - Clean Code ohne Versioning Calls

8.2.3 Mehrere Patches / Nested Patches

Pattern für multiple Versionen:

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # Version 3 (neueste)
        if workflow.patched("add-notifications-v3"):
            await self._send_notifications(order, payment)
            fraud = await workflow.execute_activity(check_fraud_v2, ...)

        # Version 2
        elif workflow.patched("add-fraud-check-v2"):
            fraud = await workflow.execute_activity(check_fraud_v2, ...)

        # Version 1 (älteste)
        else:
            # Original code - kein Fraud Check
            pass

        return Result(payment=payment)

Wichtige Regel: Neuesten Code ZUERST (top of if-block).

Warum? Frische Executions sollen immer die neueste Version nutzen. Wenn ältere Versionen zuerst geprüft werden, könnte eine neue Execution fälschlicherweise einen älteren Pfad nehmen.

Beispiel - FALSCH:

# ✗ FALSCH: Alte Version zuerst
if workflow.patched("v1"):
    # Version 1 code
elif workflow.patched("v2"):
    # Version 2 code - neue Executions könnten v1 nehmen!

Beispiel - RICHTIG:

# ✓ RICHTIG: Neue Version zuerst
if workflow.patched("v2"):
    # Version 2 code - neue Executions nehmen diesen
elif workflow.patched("v1"):
    # Version 1 code
else:
    # Version 0 (original)

8.2.4 Best Practices für Patch IDs

Gute Naming Conventions:

# ✓ GUT: Beschreibend + Versionnummer
workflow.patched("add-fraud-check-v1")
workflow.patched("change-payment-params-v2")
workflow.patched("remove-legacy-validation-v1")

# ✓ GUT: Datum für Tracking
workflow.patched("refactor-2025-01-15")

# ✓ GUT: Ticket-Referenz
workflow.patched("JIRA-1234-add-validation")

# ✗ SCHLECHT: Nicht beschreibend
workflow.patched("patch1")
workflow.patched("fix")
workflow.patched("update")

# ✗ SCHLECHT: Keine Version Info
workflow.patched("add-fraud-check")  # Was wenn wir v2 brauchen?

Dokumentation im Code:

@workflow.defn
class OrderWorkflow:
    """
    Order Processing Workflow.

    Versioning History:
    - v1 (2024-01-01): Initial implementation
    - v2 (2024-06-15): Added fraud check
      Patch: "add-fraud-check-v1"
      Deployed: 2024-06-15
      Deprecated: 2024-08-01
      Removed: 2024-10-01
    - v3 (2024-09-01): Multi-currency support
      Patch: "multi-currency-v1"
      Deployed: 2024-09-01
      Status: ACTIVE
    """

    @workflow.run
    async def run(self, order: Order) -> Result:
        # Patch: add-fraud-check-v1
        # Added: 2024-06-15
        # Status: REMOVED (2024-10-01)
        # All workflows now have fraud check
        fraud = await workflow.execute_activity(check_fraud, ...)

        # Patch: multi-currency-v1
        # Added: 2024-09-01
        # Status: ACTIVE
        if workflow.patched("multi-currency-v1"):
            currency = order.currency
        else:
            currency = "USD"  # Default für alte Workflows

        # ... rest of workflow

8.3 Sichere vs. Unsichere Code-Änderungen

8.3.1 Was kann OHNE Versioning geändert werden?

Kategorie 1: Activity Implementation

# ✓ SICHER: Activity-Logik ändern
@activity.defn
async def process_payment(payment: Payment) -> Receipt:
    # ALLE Änderungen hier sind safe:
    # - Database Schema ändern
    # - API Endpoints ändern
    # - Error Handling anpassen
    # - Business Logic updaten
    # - Performance optimieren
    pass

Warum sicher? Activities werden außerhalb des Replay-Mechanismus ausgeführt. Nur das Result wird in der History gespeichert, nicht die Logic.

Kategorie 2: Workflow Logging

@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> None:
        workflow.logger.info("Starting")  # ✓ SAFE to add/remove/change
        result = await workflow.execute_activity(...)
        workflow.logger.debug(f"Result: {result}")  # ✓ SAFE

Warum sicher? Logging erzeugt keine Events in der History.

Kategorie 3: Query Handler (read-only)

@workflow.defn
class MyWorkflow:
    @workflow.query
    def get_status(self) -> str:  # ✓ SAFE to add
        return self._status

    @workflow.query
    def get_progress(self) -> dict:  # ✓ SAFE to modify
        return {"processed": self._processed, "total": self._total}

Kategorie 4: Signal Handler hinzufügen

@workflow.defn
class MyWorkflow:
    @workflow.signal
    async def new_signal(self, data: str) -> None:  # ✓ SAFE wenn noch nie gesendet
        self._data = data

Wichtig: Nur safe wenn das Signal noch nie gesendet wurde!

Kategorie 5: Dataclass Fields mit Defaults

@dataclass
class WorkflowInput:
    name: str
    age: int
    email: str = ""      # ✓ SAFE to add with default
    phone: str = ""      # ✓ SAFE to add with default

Forward Compatible: Alte Workflows können neue Dataclass-Version deserializen.

8.3.2 Was BRICHT Determinismus?

Kategorie 1: Activity Calls hinzufügen/entfernen

# VORHER
@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> None:
        result1 = await workflow.execute_activity(activity1, ...)
        result2 = await workflow.execute_activity(activity2, ...)

# NACHHER - ❌ BREAKS DETERMINISM
@workflow.defn
class MyWorkflow:
    @workflow.run
    async def run(self) -> None:
        result1 = await workflow.execute_activity(activity1, ...)
        # activity2 entfernt - BREAKS REPLAY
        result3 = await workflow.execute_activity(activity3, ...)  # Neu - BREAKS REPLAY

Warum broken? Event History erwartet ScheduleActivityTask für activity2, bekommt aber activity3.

Kategorie 2: Activity Reihenfolge ändern

# VORHER
result1 = await workflow.execute_activity(activity1, ...)
result2 = await workflow.execute_activity(activity2, ...)

# NACHHER - ❌ BREAKS DETERMINISM
result2 = await workflow.execute_activity(activity2, ...)  # Reihenfolge getauscht
result1 = await workflow.execute_activity(activity1, ...)

Kategorie 3: Activity Parameter ändern

# VORHER
await workflow.execute_activity(
    process_order,
    args=[order_id, customer_id],  # 2 Parameter
    ...
)

# NACHHER - ❌ BREAKS DETERMINISM
await workflow.execute_activity(
    process_order,
    args=[order_id, customer_id, payment_method],  # 3 Parameter
    ...
)

Kategorie 4: Conditional Logic ändern

# VORHER
if amount > 100:
    await workflow.execute_activity(large_order, ...)
else:
    await workflow.execute_activity(small_order, ...)

# NACHHER - ❌ BREAKS DETERMINISM
if amount > 500:  # Threshold geändert
    await workflow.execute_activity(large_order, ...)

Bei Replay: Ein Workflow mit amount=300 nahm vorher den large_order Pfad, jetzt nimmt er small_order → Non-Determinism.

Kategorie 5: Sleep/Timer ändern

# VORHER
await asyncio.sleep(300)  # 5 Minuten

# NACHHER - ❌ BREAKS DETERMINISM
await asyncio.sleep(600)  # 10 Minuten - anderer Timer
# Oder Timer entfernen

Kategorie 6: Non-Deterministic Functions

# ❌ FALSCH - Non-Deterministic
import random
import datetime
import uuid

@workflow.defn
class BadWorkflow:
    @workflow.run
    async def run(self) -> None:
        random_val = random.randint(1, 100)  # ❌ WRONG
        current_time = datetime.datetime.now()  # ❌ WRONG
        unique_id = str(uuid.uuid4())  # ❌ WRONG

Beim Replay: Unterschiedliche Werte → unterschiedliche Commands → Non-Determinism.

8.3.3 Deterministische Alternativen

Python SDK Deterministic APIs:

from temporalio import workflow

@workflow.defn
class DeterministicWorkflow:
    @workflow.run
    async def run(self) -> None:
        # ✓ RICHTIG - Deterministic time
        current_time = workflow.now()
        timestamp_ns = workflow.time_ns()

        # ✓ RICHTIG - Deterministic random
        rng = workflow.random()
        random_number = rng.randint(1, 100)
        random_float = rng.random()

        # ✓ RICHTIG - UUID via Activity
        unique_id = await workflow.execute_activity(
            generate_uuid,
            schedule_to_close_timeout=timedelta(seconds=5),
        )

        # ✓ RICHTIG - Deterministic logging
        workflow.logger.info(f"Processing at {current_time}")

@activity.defn
async def generate_uuid() -> str:
    """Activities können non-deterministisch sein"""
    return str(uuid.uuid4())

Warum funktioniert das?

  • workflow.now(): Gibt Workflow Start Time zurück (konstant bei Replay)
  • workflow.random(): Seeded RNG basierend auf History
  • Activities: Results aus History, nicht neu ausgeführt

Decision Matrix:

flowchart TD
    A[Code-Änderung geplant] --> B{Ändert Commands<br/>oder deren Reihenfolge?}

    B -->|Nein| C[✓ SAFE<br/>Ohne Versioning]
    B -->|Ja| D{Nur Activity<br/>Implementation?}

    D -->|Ja| C
    D -->|Nein| E[❌ UNSAFE<br/>Versioning erforderlich]

    C --> F[Deploy direkt]
    E --> G[Patching API<br/>oder<br/>Worker Versioning]

8.3.4 Safe Change Pattern mit Versioning

Beispiel: Activity hinzufügen

# Schritt 1: Original Code
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)
        return Result(payment=payment)

# Schritt 2: Mit Patching neue Activity hinzufügen
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # Patch: Neue Activity
        if workflow.patched("add-fraud-check-v1"):
            fraud = await workflow.execute_activity(check_fraud, ...)
            if not fraud.is_safe:
                raise FraudDetectedError()

        return Result(payment=payment)

# Schritt 3: Später deprecate
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        workflow.deprecate_patch("add-fraud-check-v1")
        fraud = await workflow.execute_activity(check_fraud, ...)
        if not fraud.is_safe:
            raise FraudDetectedError()

        return Result(payment=payment)

# Schritt 4: Schließlich clean code
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        payment = await workflow.execute_activity(process_payment, ...)

        # Clean code - Fraud Check ist Standard
        fraud = await workflow.execute_activity(check_fraud, ...)
        if not fraud.is_safe:
            raise FraudDetectedError()

        return Result(payment=payment)

8.4 Worker Versioning (Build IDs)

Worker Versioning ist der moderne Ansatz für Workflow-Versionierung (Public Preview, GA erwartet Q4 2025).

8.4.1 Konzepte

Build ID: Eindeutiger Identifier für eine Worker-Version

from temporalio.worker import Worker, WorkerDeploymentConfig

worker = Worker(
    client,
    task_queue="orders",
    workflows=[OrderWorkflow],
    activities=[process_payment, check_fraud],
    deployment_config=WorkerDeploymentConfig(
        deployment_name="order-service",
        build_id="v1.5.2",  # Semantic versioning
    )
)

Workflow Pinning: Workflow bleibt auf ursprünglicher Worker-Version

Vorteile:

  • Eliminiert Code-Level Patching
  • Workflows können Breaking Changes enthalten
  • Einfacheres Code-Management

Nachteil:

  • Muss mehrere Worker-Pools laufen lassen
  • Alte Versionen blockieren bis alle Workflows complete

8.4.2 Deployment Strategies

Blue-Green Deployment:

graph LR
    A[Traffic] --> B{Router}
    B -->|100%| C[Blue Workers<br/>v1.0.0]
    B -.->|0%| D[Green Workers<br/>v2.0.0]

    style C fill:#87CEEB
    style D fill:#90EE90

Nach Cutover:

graph LR
    A[Traffic] --> B{Router}
    B -.->|0%| C[Blue Workers<br/>v1.0.0<br/>Draining]
    B -->|100%| D[Green Workers<br/>v2.0.0]

    style C fill:#FFB6C1
    style D fill:#90EE90

Eigenschaften:

  • Zwei simultane Versionen
  • Klarer Cutover-Point
  • Instant Rollback möglich
  • Einfach zu verstehen

Rainbow Deployment:

graph TD
    A[Traffic] --> B{Router}
    B -->|Pinned| C[v1.0.0<br/>Draining]
    B -->|5%| D[v1.5.0<br/>Active]
    B -->|90%| E[v2.0.0<br/>Current]
    B -->|5%| F[v2.1.0<br/>Ramping]

    style C fill:#FFB6C1
    style D fill:#FFD700
    style E fill:#90EE90
    style F fill:#87CEEB

Eigenschaften:

  • Viele simultane Versionen
  • Graduelle Migration
  • Workflow Pinning optimal
  • Komplexer aber flexibler

8.4.3 Gradual Rollout

Ramp Percentages:

# Start: 1% Traffic zu neuer Version
temporal task-queue versioning insert-assignment-rule \
  --task-queue orders \
  --build-id v2.0.0 \
  --percentage 1

# Monitoring...
# Error rate OK? → Increase

# 5% Traffic
temporal task-queue versioning insert-assignment-rule \
  --task-queue orders \
  --build-id v2.0.0 \
  --percentage 5

# 25% Traffic
temporal task-queue versioning insert-assignment-rule \
  --task-queue orders \
  --build-id v2.0.0 \
  --percentage 25

# 100% Traffic
temporal task-queue versioning insert-assignment-rule \
  --task-queue orders \
  --build-id v2.0.0 \
  --percentage 100

Ablauf:

1% → Monitor 1 day → 5% → Monitor 1 day → 25% → Monitor → 100%

Was wird monitored:

  • Error Rate
  • Latency
  • Completion Rate
  • Activity Failures

8.4.4 Version Lifecycle States

stateDiagram-v2
    [*] --> Inactive: Deploy new version
    Inactive --> Active: Set as target
    Active --> Ramping: Set ramp %
    Ramping --> Current: Set to 100%
    Current --> Draining: New version deployed
    Draining --> Drained: All workflows complete
    Drained --> [*]: Decommission workers

    note right of Inactive
        Version existiert
        Kein Traffic
    end note

    note right of Ramping
        X% neuer Traffic
        Testing rollout
    end note

    note right of Current
        100% neuer Traffic
        Production version
    end note

    note right of Draining
        Nur pinned workflows
        Keine neuen workflows
    end note

8.4.5 Python Worker Configuration

from temporalio.common import WorkerDeploymentVersion
from temporalio.worker import Worker, WorkerDeploymentConfig

async def create_worker(build_id: str):
    """Create worker with versioning"""
    client = await Client.connect("localhost:7233")

    worker = Worker(
        client,
        task_queue="orders",
        workflows=[OrderWorkflowV2],  # Neue Version
        activities=[process_payment_v2, check_fraud_v2],
        deployment_config=WorkerDeploymentConfig(
            deployment_name="order-service",
            build_id=build_id,
        ),
        # Optional: Max concurrent workflows
        max_concurrent_workflow_tasks=100,
    )

    return worker

# Deploy v1 workers
worker_v1 = await create_worker("v1.5.2")

# Deploy v2 workers (parallel)
worker_v2 = await create_worker("v2.0.0")

# Beide Workers laufen parallel
await asyncio.gather(
    worker_v1.run(),
    worker_v2.run(),
)

8.5 Testing Versioned Workflows

8.5.1 Replay Testing

Zweck: Verifizieren dass neuer Code kompatibel mit existierenden Workflow Histories ist.

Basic Replay Test:

import json
import pytest
from temporalio.client import WorkflowHistory
from temporalio.worker import Replayer

from workflows import OrderWorkflow

@pytest.mark.asyncio
async def test_replay_workflow_history():
    """Test dass neuer Code alte Histories replyen kann"""

    # History von Production laden
    with open("tests/histories/order_workflow_history.json") as f:
        history_json = json.load(f)

    # Replayer mit NEUEM Workflow-Code
    replayer = Replayer(workflows=[OrderWorkflow])

    # Replay - wirft Exception bei Non-Determinism
    await replayer.replay_workflow(
        WorkflowHistory.from_json("test-workflow-id", history_json)
    )

    # Test passed = Neuer Code ist kompatibel!

History von Production fetchen:

# CLI: History als JSON exportieren
temporal workflow show \
  --workflow-id order-12345 \
  --namespace production \
  --output json > workflow_history.json

Programmatisch:

from temporalio.client import Client

async def fetch_workflow_history(workflow_id: str) -> dict:
    """Fetch history für Replay Testing"""
    client = await Client.connect("localhost:7233")

    handle = client.get_workflow_handle(workflow_id)
    history = await handle.fetch_history()

    return history.to_json()

8.5.2 CI/CD Integration

GitHub Actions Workflow:

# .github/workflows/replay-test.yml
name: Temporal Replay Tests

on:
  pull_request:
    paths:
      - 'workflows/**'
      - 'activities/**'

jobs:
  replay-test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Download production histories
        env:
          TEMPORAL_ADDRESS: ${{ secrets.TEMPORAL_PROD_ADDRESS }}
          TEMPORAL_NAMESPACE: production
        run: |
          # Download recent workflow histories
          python scripts/download_histories.py \
            --workflow-type OrderWorkflow \
            --limit 50 \
            --output tests/histories/

      - name: Run replay tests
        run: |
          pytest tests/test_replay.py -v

      - name: Fail on non-determinism
        run: |
          if [ $? -ne 0 ]; then
            echo "❌ Non-determinism detected! Do not merge."
            exit 1
          fi
          echo "✅ All replay tests passed"

Replay Test Script:

# tests/test_replay.py
import json
import pytest
from pathlib import Path
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory

from workflows import OrderWorkflow, PaymentWorkflow

ALL_WORKFLOWS = [OrderWorkflow, PaymentWorkflow]

@pytest.mark.asyncio
async def test_replay_all_production_histories():
    """Test neuen Code gegen Production Histories"""

    histories_dir = Path("tests/histories")

    if not histories_dir.exists():
        pytest.skip("No histories to test")

    replayer = Replayer(workflows=ALL_WORKFLOWS)

    failed_replays = []

    for history_file in histories_dir.glob("*.json"):
        with open(history_file) as f:
            history_data = json.load(f)

        try:
            await replayer.replay_workflow(
                WorkflowHistory.from_json(
                    history_file.stem,
                    history_data
                )
            )
            print(f"✓ Successfully replayed {history_file.name}")

        except Exception as e:
            failed_replays.append({
                "file": history_file.name,
                "error": str(e)
            })
            print(f"✗ Failed to replay {history_file.name}: {e}")

    # Fail test wenn irgendein Replay fehlschlug
    if failed_replays:
        error_msg = "Non-determinism detected in:\n"
        for failure in failed_replays:
            error_msg += f"  - {failure['file']}: {failure['error']}\n"
        pytest.fail(error_msg)

Script zum History Download:

# scripts/download_histories.py
import asyncio
import json
import argparse
from pathlib import Path
from temporalio.client import Client

async def download_histories(
    workflow_type: str,
    limit: int,
    output_dir: Path,
):
    """Download recent workflow histories für Testing"""

    client = await Client.connect("localhost:7233")

    # Query für laufende Workflows
    query = f'WorkflowType="{workflow_type}" AND ExecutionStatus="Running"'

    workflows = client.list_workflows(query=query)

    count = 0
    async for workflow in workflows:
        if count >= limit:
            break

        # Fetch history
        handle = client.get_workflow_handle(workflow.id)
        history = await handle.fetch_history()

        # Save to file
        output_file = output_dir / f"{workflow.id}.json"
        with open(output_file, "w") as f:
            json.dump(history.to_json(), f, indent=2)

        print(f"Downloaded: {workflow.id}")
        count += 1

    print(f"\nTotal downloaded: {count} histories")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--workflow-type", required=True)
    parser.add_argument("--limit", type=int, default=50)
    parser.add_argument("--output", required=True)

    args = parser.parse_args()

    output_dir = Path(args.output)
    output_dir.mkdir(parents=True, exist_ok=True)

    asyncio.run(download_histories(
        args.workflow_type,
        args.limit,
        output_dir,
    ))

8.5.3 Testing Version Transitions

@pytest.mark.asyncio
async def test_patched_workflow_new_execution():
    """Test dass neue Workflows neuen Code-Pfad nutzen"""

    async with await WorkflowEnvironment.start_time_skipping() as env:
        async with Worker(
            env.client,
            task_queue="test-queue",
            workflows=[OrderWorkflow],
            activities=[process_payment, check_fraud],
        ):
            # Neue Workflow-Execution
            result = await env.client.execute_workflow(
                OrderWorkflow.run,
                order,
                id="test-new-workflow",
                task_queue="test-queue",
            )

            # Verify: Fraud check wurde ausgeführt
            assert result.fraud_checked is True

@pytest.mark.asyncio
async def test_patched_workflow_replay():
    """Test dass alte Workflows alten Code-Pfad nutzen"""

    # History VOR Patch laden
    with open("tests/pre_patch_history.json") as f:
        old_history = json.load(f)

    # Replay sollte erfolgreich sein
    replayer = Replayer(workflows=[OrderWorkflow])
    await replayer.replay_workflow(
        WorkflowHistory.from_json("old-workflow", old_history)
    )

    # Success = Alter Pfad wurde korrekt gefolgt

8.6 Migration Patterns

8.6.1 Multi-Step Backward-Compatible Migration

Scenario: Activity Parameter ändern

Step 1: Optional Fields

from dataclasses import dataclass
from typing import Optional

# Alte Struktur
@dataclass
class PaymentParams:
    order_id: str
    amount: float

# Step 1: Neue Fields optional hinzufügen
@dataclass
class PaymentParams:
    order_id: str
    amount: float
    payment_method: Optional[str] = None  # NEU, optional
    currency: Optional[str] = "USD"        # NEU, mit Default

Step 2: Activity handhabt beide

@activity.defn
async def process_payment(params: PaymentParams) -> Payment:
    """Handle alte und neue Parameter"""

    # Defaults für alte Calls
    payment_method = params.payment_method or "credit_card"
    currency = params.currency or "USD"

    # Process mit neuer Logic
    result = await payment_processor.process(
        order_id=params.order_id,
        amount=params.amount,
        method=payment_method,
        currency=currency,
    )

    return result

Step 3: Workflow nutzt neue Parameters (mit Patching)

@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> Result:
        if workflow.patched("payment-params-v2"):
            # Neuer Code - nutzt neue Parameters
            payment = await workflow.execute_activity(
                process_payment,
                args=[PaymentParams(
                    order_id=order.id,
                    amount=order.total,
                    payment_method=order.payment_method,
                    currency=order.currency,
                )],
                schedule_to_close_timeout=timedelta(minutes=5),
            )
        else:
            # Alter Code - nutzt alte Parameters
            payment = await workflow.execute_activity(
                process_payment,
                args=[PaymentParams(
                    order_id=order.id,
                    amount=order.total,
                )],
                schedule_to_close_timeout=timedelta(minutes=5),
            )

        return Result(payment=payment)

Step 4: Nach Migration Fields required machen

# Nach allen alten Workflows complete
@dataclass
class PaymentParams:
    order_id: str
    amount: float
    payment_method: str  # Jetzt required
    currency: str = "USD"  # Required mit Default

8.6.2 Continue-As-New mit Versioning

Pattern: Long-running Entity Workflows die periodisch Version Updates bekommen.

@workflow.defn
class EntityWorkflow:
    """
    Entity Workflow läuft unbegrenzt mit Continue-As-New.
    Updates automatisch auf neue Versionen.
    """

    def __init__(self) -> None:
        self._state: dict = {}
        self._iteration = 0

    @workflow.run
    async def run(self, initial_state: dict) -> None:
        self._state = initial_state

        while True:
            # Process iteration
            await self._process_iteration()

            self._iteration += 1

            # Continue-As-New alle 100 Iterationen
            if (
                self._iteration >= 100
                or workflow.info().is_continue_as_new_suggested()
            ):
                workflow.logger.info(
                    f"Continue-As-New after {self._iteration} iterations"
                )

                # Continue-As-New picked automatisch neue Version auf!
                workflow.continue_as_new(self._state)

            await asyncio.sleep(60)

    async def _process_iteration(self):
        """Versioned iteration logic"""

        # Version 2: Validation hinzugefügt
        if workflow.patched("add-validation-v2"):
            await self._validate_state()

        # Core logic
        await workflow.execute_activity(
            process_entity_state,
            args=[self._state],
            schedule_to_close_timeout=timedelta(minutes=5),
        )

Vorteile:

  • Natural version upgrade points
  • History bleibt bounded
  • Keine Manual Migration nötig

8.7 Zusammenfassung

Kernkonzepte:

  1. Determinismus: Temporal’s Replay-Mechanismus erfordert dass Workflows deterministisch sind
  2. Versioning erforderlich: Jede Code-Änderung die Commands ändert braucht Versioning
  3. Drei Ansätze: Patching API, Worker Versioning, Workflow-Name Versioning
  4. Safe Changes: Activity Implementation, Logging, Queries können ohne Versioning geändert werden
  5. Unsafe Changes: Activity Calls hinzufügen/entfernen, Reihenfolge ändern, Parameter ändern

Patching API Workflow:

1. workflow.patched("id") → if/else blocks
2. workflow.deprecate_patch("id") → nur neuer Code
3. Remove patch call → clean code

Best Practices:

  • ✅ Replay Tests in CI/CD
  • ✅ Production Histories regelmäßig testen
  • ✅ Max 3 active Patches pro Workflow
  • ✅ Dokumentation für jedes Patch
  • ✅ Cleanup Timeline planen
  • ✅ Monitoring für Version Adoption

Häufige Fehler:

  • ❌ Versioning vergessen
  • random.random() statt workflow.random()
  • ❌ Kein Replay Testing
  • ❌ Version Sprawl (zu viele Patches)
  • ❌ Alte Versionen nicht entfernen

Im nächsten Kapitel (Kapitel 9) werden wir Fortgeschrittene Resilienz-Patterns behandeln - komplexe Patterns für Production-Ready Systeme.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 9: Workflow-Evolution und Versionierung

Code-Beispiele für dieses Kapitel: examples/part-03/chapter-08/

Kapitel 9: Fortgeschrittene Resilienz-Patterns

Einleitung

In den vorherigen Kapiteln haben Sie die Grundlagen von Error Handling, Retry Policies und Workflow Evolution kennengelernt. Diese Konzepte bilden das Fundament für resiliente Temporal-Anwendungen. Doch in Production-Systemen stoßen Sie auf komplexere Herausforderungen: Workflows die Monate laufen, Hunderttausende Events akkumulieren, Millionen parallele Executions koordinieren oder mit externen Rate Limits umgehen müssen.

Dieses Kapitel behandelt fortgeschrittene Patterns, die Temporal von einer robusten Orchestration-Platform zu einem hochskalierbaren, produktionsreifen System machen.

Wann brauchen Sie Advanced Patterns?

Continue-As-New: Ihr Workflow läuft 6 Monate und hat 500,000 Events in der History Child Workflows: Sie müssen 10,000 Sub-Tasks parallel orchestrieren Parallel Execution: Batch-Verarbeitung von 100,000 Orders pro Stunde Rate Limiting: Externe API erlaubt nur 100 Requests/Minute Graceful Degradation: Service-Ausfälle dürfen Business nicht stoppen State Management: Workflow-State ist 5 MB groß Advanced Recovery: Manuelle Intervention bei kritischen Fehlern erforderlich

Lernziele

Nach diesem Kapitel können Sie:

  • Continue-As-New korrekt einsetzen um unbegrenzt lange Workflows zu ermöglichen
  • Child Workflows für Skalierung und Isolation nutzen
  • Parallele Activity Execution effizient orchestrieren
  • Rate Limiting auf Worker- und Activity-Level implementieren
  • Graceful Degradation mit Fallback-Mechanismen bauen
  • Large State und Payloads in Workflows handhaben
  • Human-in-the-Loop Patterns für manuelle Approvals implementieren
  • Production-Ready Monitoring und Observability aufsetzen

9.1 Continue-As-New Pattern

9.1.1 Das Event History Problem

Jede Workflow-Execution speichert ihre komplette Event History: jede Activity, jeder Timer, jedes Signal, jede State-Transition. Diese History wächst mit jeder Operation.

Problem: History hat praktische Limits:

Workflow läuft 365 Tage
├─ 1 Activity pro Stunde
├─ 1 Signal alle 10 Minuten
├─ State Updates jede Minute
└─ Total Events: ~500,000

Event History Size: ~50 MB
Performance Impact: Replay dauert Minuten!

Temporal Limits:

  • Empfohlenes Maximum: 50,000 Events
  • Hard Limit: Konfigurierbar (Default 50,000-200,000)
  • Performance Degradation: Ab 10,000 Events merklich

Was passiert bei zu großer History:

graph TD
    A[Workflow mit 100k Events] --> B[Worker fetcht History]
    B --> C[50 MB Download]
    C --> D[Replay dauert 5 Minuten]
    D --> E{Timeout?}
    E -->|Ja| F[Workflow Task Timeout]
    E -->|Nein| G[Extrem langsam]

    F --> H[Retry Loop]
    H --> B

    style F fill:#FF4444
    style G fill:#FFA500

9.1.2 Continue-As-New Lösung

Konzept: “Reboot” des Workflows mit neuem State, History wird zurückgesetzt.

from temporalio import workflow
import asyncio

@workflow.defn
class LongRunningEntityWorkflow:
    """
    Entity Workflow läuft unbegrenzt mit Continue-As-New.
    Beispiel: IoT Device Management, User Session, Subscription
    """

    def __init__(self) -> None:
        self._events_processed = 0
        self._state = {}
        self._iteration = 0

    @workflow.run
    async def run(self, entity_id: str, initial_state: dict) -> None:
        """Main workflow - läuft bis Continue-As-New"""

        self._state = initial_state
        workflow.logger.info(
            f"Entity workflow started (iteration {self._iteration})",
            extra={"entity_id": entity_id}
        )

        while True:
            # Warte auf Signals oder Timer
            await workflow.wait_condition(
                lambda: len(self._pending_events) > 0,
                timeout=timedelta(hours=1)
            )

            # Process events
            for event in self._pending_events:
                await self._process_event(event)
                self._events_processed += 1

            self._pending_events.clear()

            # Check für Continue-As-New
            if self._should_continue_as_new():
                workflow.logger.info(
                    f"Continuing as new (processed {self._events_processed} events)",
                    extra={"iteration": self._iteration}
                )

                # Continue-As-New mit updated State
                workflow.continue_as_new(
                    args=[entity_id, self._state]
                )

    def _should_continue_as_new(self) -> bool:
        """Decision logic für Continue-As-New"""

        # Option 1: Nach fixer Anzahl Events
        if self._events_processed >= 1000:
            return True

        # Option 2: Temporal's Suggestion (basierend auf History Size)
        if workflow.info().is_continue_as_new_suggested():
            return True

        # Option 3: Nach Zeit
        elapsed = workflow.now() - workflow.info().start_time
        if elapsed > timedelta(days=7):
            return True

        return False

    @workflow.signal
    async def process_event(self, event: dict) -> None:
        """Signal Handler für Events"""
        self._pending_events.append(event)

Was passiert bei Continue-As-New:

sequenceDiagram
    participant W1 as Workflow Run 1
    participant Server as Temporal Server
    participant W2 as Workflow Run 2

    W1->>W1: Process 1000 events
    W1->>W1: History = 5000 events
    W1->>W1: continue_as_new(state)

    W1->>Server: CompleteWorkflowExecution<br/>+ StartWorkflowExecution
    Note over Server: Atomically complete<br/>old run and start new

    Server->>W2: Start with fresh history
    W2->>W2: State = inherited
    W2->>W2: History = 0 events

    Note over W2: Clean slate!<br/>Performance restored

Wichtige Eigenschaften:

  1. Atomic: Altes Workflow-Ende + Neues Workflow-Start = eine Operation
  2. Same Workflow ID: Workflow ID bleibt gleich
  3. New Run ID: Jeder Continue bekommt neue Run ID
  4. State Migration: Übergebe explizit State via args
  5. Fresh History: Event History startet bei 0

9.1.3 State Migration

Best Practices für State Übergabe:

from dataclasses import dataclass, asdict
from typing import Optional, List
import json

@dataclass
class EntityState:
    """Serializable State für Continue-As-New"""
    entity_id: str
    balance: float
    transactions: List[dict]
    metadata: dict
    version: int = 1  # State Schema Version

    def to_dict(self) -> dict:
        """Serialize für Continue-As-New"""
        return asdict(self)

    @classmethod
    def from_dict(cls, data: dict) -> "EntityState":
        """Deserialize mit Version Handling"""
        version = data.get("version", 1)

        if version == 1:
            return cls(**data)
        elif version == 2:
            # Migration logic für v1 -> v2
            return cls._migrate_v1_to_v2(data)
        else:
            raise ValueError(f"Unknown state version: {version}")

@workflow.defn
class AccountWorkflow:
    def __init__(self) -> None:
        self._state: Optional[EntityState] = None

    @workflow.run
    async def run(self, entity_id: str, state_dict: Optional[dict] = None) -> None:
        # Initialize oder restore State
        if state_dict:
            self._state = EntityState.from_dict(state_dict)
            workflow.logger.info("Restored from Continue-As-New")
        else:
            self._state = EntityState(
                entity_id=entity_id,
                balance=0.0,
                transactions=[],
                metadata={}
            )
            workflow.logger.info("Fresh workflow start")

        # ... workflow logic ...

        # Continue-As-New
        if self._should_continue():
            workflow.continue_as_new(
                args=[entity_id, self._state.to_dict()]
            )

Größenbeschränkungen beachten:

# ❌ FALSCH: State zu groß
@dataclass
class BadState:
    large_list: List[dict]  # 10 MB!

# Continue-As-New schlägt fehl wenn State > 2 MB

# ✅ RICHTIG: State compact halten
@dataclass
class GoodState:
    summary: dict  # Nur Zusammenfassung
    last_checkpoint: str

# Details in Activities/externe Storage auslagern

9.1.4 Frequency-Based vs Suggested Continue-As-New

Pattern 1: Frequency-Based (Deterministisch)

@workflow.defn
class FrequencyBasedWorkflow:
    def __init__(self) -> None:
        self._counter = 0

    @workflow.run
    async def run(self, state: dict) -> None:
        while True:
            await self._process_batch()
            self._counter += 1

            # Continue alle 100 Batches
            if self._counter >= 100:
                workflow.logger.info("Continue-As-New after 100 batches")
                workflow.continue_as_new(state)

            await asyncio.sleep(timedelta(minutes=5))

Vorteile: Vorhersehbar, testbar Nachteile: Ignoriert tatsächliche History Size

Pattern 2: Suggested Continue-As-New (Dynamisch)

@workflow.defn
class SuggestedBasedWorkflow:
    @workflow.run
    async def run(self, state: dict) -> None:
        while True:
            await self._process_batch()

            # Temporal schlägt Continue vor wenn nötig
            if workflow.info().is_continue_as_new_suggested():
                workflow.logger.info(
                    "Continue-As-New suggested by Temporal",
                    extra={"history_size": workflow.info().get_current_history_length()}
                )
                workflow.continue_as_new(state)

            await asyncio.sleep(timedelta(minutes=5))

Vorteile: Adaptiv, optimal für History Size Nachteile: Non-deterministisch (Suggestion kann bei Replay anders sein)

Best Practice: Hybrid Approach

@workflow.defn
class HybridWorkflow:
    @workflow.run
    async def run(self, state: dict) -> None:
        iteration = 0

        while True:
            await self._process_batch()
            iteration += 1

            # Continue wenn EINE der Bedingungen erfüllt
            should_continue = (
                iteration >= 1000  # Max Iterations
                or workflow.info().is_continue_as_new_suggested()  # History zu groß
                or workflow.now() - workflow.info().start_time > timedelta(days=30)  # Max Time
            )

            if should_continue:
                workflow.continue_as_new(state)

            await asyncio.sleep(timedelta(hours=1))

9.2 Child Workflows

9.2.1 Wann Child Workflows nutzen?

Child Workflows sind eigenständige Workflow-Executions, gestartet von einem Parent Workflow. Sie bieten Event History Isolation und Independent Lifecycle.

Use Cases:

1. Fan-Out Pattern

# Parent orchestriert 1000 Child Workflows
await asyncio.gather(*[
    workflow.execute_child_workflow(ProcessOrder.run, order)
    for order in orders
])

2. Long-Running Sub-Tasks

# Child läuft Wochen, Parent monitored nur
child_handle = await workflow.start_child_workflow(
    DataPipelineWorkflow.run,
    dataset_id
)
# Parent kann weiter arbeiten

3. Retry Isolation

# Child hat eigene Retry Policy, unabhängig vom Parent
await workflow.execute_child_workflow(
    RiskyOperation.run,
    data,
    retry_policy=RetryPolicy(maximum_attempts=10)
)

4. Multi-Tenant Isolation

# Jeder Tenant bekommt eigenen Child Workflow
for tenant in tenants:
    await workflow.execute_child_workflow(
        TenantProcessor.run,
        tenant,
        id=f"tenant-{tenant.id}"  # Separate Workflow ID
    )

9.2.2 Parent vs Child Event History

Kritischer Unterschied: Child Workflows haben separate Event Histories.

graph TD
    subgraph Parent Workflow History
        P1[WorkflowExecutionStarted]
        P2[ChildWorkflowExecutionStarted]
        P3[ChildWorkflowExecutionCompleted]
        P4[WorkflowExecutionCompleted]

        P1 --> P2
        P2 --> P3
        P3 --> P4
    end

    subgraph Child Workflow History
        C1[WorkflowExecutionStarted]
        C2[ActivityTaskScheduled x 1000]
        C3[ActivityTaskCompleted x 1000]
        C4[WorkflowExecutionCompleted]

        C1 --> C2
        C2 --> C3
        C3 --> C4
    end

    P2 -.->|Spawns| C1
    C4 -.->|Returns Result| P3

    style Parent fill:#E6F3FF
    style Child fill:#FFE6E6

Parent History: Nur Start/Complete Events für Child Child History: Komplette Execution Details

Vorteil: Parent bleibt schlank, auch wenn Child komplex ist.

9.2.3 Child Workflow Patterns

Pattern 1: Fire-and-Forget

@workflow.defn
class ParentWorkflow:
    @workflow.run
    async def run(self, tasks: List[Task]) -> str:
        """Start Children und warte NICHT auf Completion"""

        for task in tasks:
            # start_child_workflow gibt Handle zurück ohne zu warten
            handle = await workflow.start_child_workflow(
                ProcessTaskWorkflow.run,
                args=[task],
                id=f"task-{task.id}",
                parent_close_policy=ParentClosePolicy.ABANDON
            )

            workflow.logger.info(f"Started child {task.id}")

        # Parent beendet, Children laufen weiter!
        return "All children started"

Pattern 2: Wait-All (Fan-Out/Fan-In)

@workflow.defn
class BatchProcessorWorkflow:
    @workflow.run
    async def run(self, items: List[Item]) -> dict:
        """Parallel processing mit Warten auf alle Results"""

        # Start alle Children parallel
        child_futures = [
            workflow.execute_child_workflow(
                ProcessItemWorkflow.run,
                item,
                id=f"item-{item.id}"
            )
            for item in items
        ]

        # Warte auf ALLE
        results = await asyncio.gather(*child_futures, return_exceptions=True)

        # Analyze results
        successful = [r for r in results if not isinstance(r, Exception)]
        failed = [r for r in results if isinstance(r, Exception)]

        return {
            "total": len(items),
            "successful": len(successful),
            "failed": len(failed),
            "results": successful
        }

Pattern 3: Throttled Parallel Execution

import asyncio
from temporalio import workflow

@workflow.defn
class ThrottledBatchWorkflow:
    @workflow.run
    async def run(self, items: List[Item]) -> dict:
        """Process mit max 10 parallelen Children"""

        semaphore = asyncio.Semaphore(10)  # Max 10 concurrent
        results = []

        async def process_with_semaphore(item: Item):
            async with semaphore:
                return await workflow.execute_child_workflow(
                    ProcessItemWorkflow.run,
                    item,
                    id=f"item-{item.id}"
                )

        # Start alle, aber Semaphore limitiert Parallelität
        results = await asyncio.gather(*[
            process_with_semaphore(item)
            for item in items
        ])

        return {"processed": len(results)}

Pattern 4: Rolling Window

@workflow.defn
class RollingWindowWorkflow:
    @workflow.run
    async def run(self, items: List[Item]) -> dict:
        """Process in Batches von 100"""

        batch_size = 100
        results = []

        for i in range(0, len(items), batch_size):
            batch = items[i:i + batch_size]

            workflow.logger.info(f"Processing batch {i//batch_size + 1}")

            # Process Batch parallel
            batch_results = await asyncio.gather(*[
                workflow.execute_child_workflow(
                    ProcessItemWorkflow.run,
                    item,
                    id=f"item-{item.id}"
                )
                for item in batch
            ])

            results.extend(batch_results)

            # Optional: Pause zwischen Batches
            await asyncio.sleep(timedelta(seconds=5))

        return {"total_processed": len(results)}

9.2.4 Parent Close Policies

Was passiert mit Children wenn Parent beendet/canceled/terminated wird?

from temporalio.common import ParentClosePolicy

# Policy 1: TERMINATE - Kill children wenn parent schließt
await workflow.start_child_workflow(
    ChildWorkflow.run,
    args=[data],
    parent_close_policy=ParentClosePolicy.TERMINATE
)

# Policy 2: REQUEST_CANCEL - Cancellation request an children
await workflow.start_child_workflow(
    ChildWorkflow.run,
    args=[data],
    parent_close_policy=ParentClosePolicy.REQUEST_CANCEL
)

# Policy 3: ABANDON - Children laufen weiter (default)
await workflow.start_child_workflow(
    ChildWorkflow.run,
    args=[data],
    parent_close_policy=ParentClosePolicy.ABANDON
)

Decision Matrix:

ScenarioEmpfohlene Policy
Parent verwaltet Child Lifecycle vollständigTERMINATE
Child kann gracefully cancelnREQUEST_CANCEL
Child ist unabhängigABANDON
Data Pipeline (Parent orchestriert)TERMINATE
Long-running Entity WorkflowABANDON
User Session ManagementREQUEST_CANCEL

9.3 Parallel Execution Patterns

9.3.1 Activity Parallelism mit asyncio.gather()

Basic Parallel Activities:

from temporalio import workflow
import asyncio
from datetime import timedelta

@workflow.defn
class ParallelActivitiesWorkflow:
    @workflow.run
    async def run(self, order_id: str) -> dict:
        """Execute multiple activities in parallel"""

        # Start all activities concurrently
        inventory_future = workflow.execute_activity(
            reserve_inventory,
            args=[order_id],
            start_to_close_timeout=timedelta(seconds=30)
        )

        payment_future = workflow.execute_activity(
            process_payment,
            args=[order_id],
            start_to_close_timeout=timedelta(seconds=30)
        )

        shipping_quote_future = workflow.execute_activity(
            get_shipping_quote,
            args=[order_id],
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Wait for all to complete
        inventory, payment, shipping = await asyncio.gather(
            inventory_future,
            payment_future,
            shipping_quote_future
        )

        return {
            "inventory_reserved": inventory,
            "payment_processed": payment,
            "shipping_cost": shipping
        }

Warum parallel?

Sequential:
├─ reserve_inventory: 2s
├─ process_payment: 3s
└─ get_shipping_quote: 1s
Total: 6s

Parallel:
├─ reserve_inventory: 2s ┐
├─ process_payment: 3s   ├─ Concurrent
└─ get_shipping_quote: 1s┘
Total: 3s (longest activity)

9.3.2 Partial Failure Handling

Problem: Was wenn eine Activity fehlschlägt?

@workflow.defn
class PartialFailureWorkflow:
    @workflow.run
    async def run(self, items: List[str]) -> dict:
        """Handle partial failures gracefully"""

        # Execute all, capture exceptions
        results = await asyncio.gather(*[
            workflow.execute_activity(
                process_item,
                args=[item],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            for item in items
        ], return_exceptions=True)

        # Separate successful from failed
        successful = []
        failed = []

        for i, result in enumerate(results):
            if isinstance(result, Exception):
                failed.append({
                    "item": items[i],
                    "error": str(result)
                })
            else:
                successful.append({
                    "item": items[i],
                    "result": result
                })

        workflow.logger.info(
            f"Batch complete: {len(successful)} success, {len(failed)} failed"
        )

        # Decide: Fail workflow if any failed?
        if failed and len(failed) / len(items) > 0.1:  # >10% failure rate
            raise ApplicationError(
                f"Batch processing failed: {len(failed)} items",
                details=[{"failed_items": failed}]
            )

        return {
            "successful": successful,
            "failed": failed,
            "success_rate": len(successful) / len(items)
        }

9.3.3 Dynamic Parallelism mit Batching

Problem: 10,000 Items zu verarbeiten → 10,000 Activities spawnen?

Lösung: Batching

@workflow.defn
class BatchedParallelWorkflow:
    @workflow.run
    async def run(self, items: List[Item]) -> dict:
        """Process large dataset mit batching"""

        batch_size = 100  # Activity verarbeitet 100 Items
        max_parallel = 10  # Max 10 Activities parallel

        # Split in batches
        batches = [
            items[i:i + batch_size]
            for i in range(0, len(items), batch_size)
        ]

        workflow.logger.info(f"Processing {len(items)} items in {len(batches)} batches")

        all_results = []

        # Process batches mit concurrency limit
        for batch_group in self._chunk_list(batches, max_parallel):
            # Execute batch group parallel
            batch_results = await asyncio.gather(*[
                workflow.execute_activity(
                    process_batch,
                    args=[batch],
                    start_to_close_timeout=timedelta(minutes=5),
                    retry_policy=RetryPolicy(maximum_attempts=3)
                )
                for batch in batch_group
            ])

            all_results.extend(batch_results)

        return {
            "batches_processed": len(batches),
            "items_processed": len(items)
        }

    def _chunk_list(self, lst: List, chunk_size: int) -> List[List]:
        """Split list into chunks"""
        return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

Performance Comparison:

Scenario: 10,000 Items

Approach 1: Sequential
└─ 10,000 activities x 1s = 10,000s (~3 hours)

Approach 2: Unbounded Parallel
└─ 10,000 activities spawned
└─ Worker overload, Temporal Server pressure

Approach 3: Batched (100 items/batch, 10 parallel)
└─ 100 batches, 10 parallel
└─ ~100s total time

9.3.4 MapReduce Pattern

Full MapReduce Workflow:

from typing import Any, Callable, List
from temporalio import workflow
import asyncio

@workflow.defn
class MapReduceWorkflow:
    """MapReduce Pattern für verteilte Verarbeitung"""

    @workflow.run
    async def run(
        self,
        dataset: List[Any],
        map_activity: str,
        reduce_activity: str,
        chunk_size: int = 100,
        max_parallel: int = 10
    ) -> Any:
        """
        Map-Reduce Execution:
        1. Split dataset in chunks
        2. Map: Process chunks parallel
        3. Reduce: Aggregate results
        """

        # ========== MAP PHASE ==========
        workflow.logger.info(f"MAP: Processing {len(dataset)} items")

        # Split in chunks
        chunks = [
            dataset[i:i + chunk_size]
            for i in range(0, len(dataset), chunk_size)
        ]

        # Map: Process all chunks parallel (mit limit)
        map_results = []
        for chunk_group in self._chunk_list(chunks, max_parallel):
            results = await asyncio.gather(*[
                workflow.execute_activity(
                    map_activity,
                    args=[chunk],
                    start_to_close_timeout=timedelta(minutes=5),
                    retry_policy=RetryPolicy(maximum_attempts=3)
                )
                for chunk in chunk_group
            ])
            map_results.extend(results)

        workflow.logger.info(f"MAP complete: {len(map_results)} results")

        # ========== REDUCE PHASE ==========
        workflow.logger.info(f"REDUCE: Aggregating {len(map_results)} results")

        # Reduce: Aggregate all map results
        final_result = await workflow.execute_activity(
            reduce_activity,
            args=[map_results],
            start_to_close_timeout=timedelta(minutes=10),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        workflow.logger.info("REDUCE complete")

        return final_result

    def _chunk_list(self, lst: List, chunk_size: int) -> List[List]:
        return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]


# Activities für MapReduce
@activity.defn
async def map_activity(chunk: List[dict]) -> dict:
    """Map function: Process chunk"""
    total = 0
    for item in chunk:
        # Process item
        total += item.get("value", 0)

    return {"chunk_sum": total, "count": len(chunk)}

@activity.defn
async def reduce_activity(map_results: List[dict]) -> dict:
    """Reduce function: Aggregate"""
    grand_total = sum(r["chunk_sum"] for r in map_results)
    total_count = sum(r["count"] for r in map_results)

    return {
        "total": grand_total,
        "count": total_count,
        "average": grand_total / total_count if total_count > 0 else 0
    }

9.4 Rate Limiting und Throttling

9.4.1 Warum Rate Limiting?

Externe API Limits:

Third-Party API:
├─ 100 requests/minute
├─ 1000 requests/hour
└─ 10,000 requests/day

Ohne Rate Limiting → Exceeding limits → API errors → Workflow failures

9.4.2 Worker-Level Rate Limiting

Global Rate Limit über Worker Configuration:

from temporalio.worker import Worker

worker = Worker(
    client,
    task_queue="api-calls",
    workflows=[APIWorkflow],
    activities=[call_external_api],
    # Max concurrent activity executions
    max_concurrent_activities=10,
    # Max concurrent activity tasks
    max_concurrent_activity_tasks=50,
)

Limitation: Gilt pro Worker. Bei Scale-out (multiple Workers) → multiply limit.

9.4.3 Activity-Level Rate Limiting mit Semaphore

Workflow-Managed Semaphore:

import asyncio
from temporalio import workflow

@workflow.defn
class RateLimitedWorkflow:
    """Workflow mit Activity Rate Limiting"""

    def __init__(self) -> None:
        # Semaphore: Max 5 concurrent API calls
        self._api_semaphore = asyncio.Semaphore(5)

    @workflow.run
    async def run(self, requests: List[dict]) -> List[dict]:
        """Process requests mit rate limit"""

        async def call_with_limit(request: dict):
            async with self._api_semaphore:
                return await workflow.execute_activity(
                    call_external_api,
                    args=[request],
                    start_to_close_timeout=timedelta(seconds=30),
                    retry_policy=RetryPolicy(
                        maximum_attempts=3,
                        initial_interval=timedelta(seconds=2)
                    )
                )

        # All requests, aber max 5 concurrent
        results = await asyncio.gather(*[
            call_with_limit(req)
            for req in requests
        ])

        return results

9.4.4 Token Bucket Rate Limiter

Advanced Pattern für präzise Rate Limits:

from dataclasses import dataclass
from datetime import timedelta
from temporalio import workflow
import asyncio

@dataclass
class TokenBucket:
    """Token Bucket für Rate Limiting"""
    capacity: int  # Max tokens
    refill_rate: float  # Tokens pro Sekunde
    tokens: float  # Current tokens
    last_refill: float  # Last refill timestamp

    def consume(self, tokens: int, current_time: float) -> bool:
        """Attempt to consume tokens"""
        # Refill tokens basierend auf elapsed time
        elapsed = current_time - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + (elapsed * self.refill_rate)
        )
        self.last_refill = current_time

        # Check if enough tokens
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True

        return False

    def time_until_available(self, tokens: int) -> float:
        """Sekunden bis genug Tokens verfügbar"""
        if self.tokens >= tokens:
            return 0.0

        needed = tokens - self.tokens
        return needed / self.refill_rate

@workflow.defn
class TokenBucketWorkflow:
    """Rate Limiting mit Token Bucket"""

    def __init__(self) -> None:
        # 100 requests/minute = 1.67 requests/second
        self._bucket = TokenBucket(
            capacity=100,
            refill_rate=100 / 60,  # 1.67/s
            tokens=100,
            last_refill=workflow.time_ns() / 1e9
        )

    @workflow.run
    async def run(self, requests: List[dict]) -> List[dict]:
        """Process mit token bucket rate limit"""
        results = []

        for request in requests:
            # Wait for token availability
            await self._wait_for_token()

            # Execute activity
            result = await workflow.execute_activity(
                call_api,
                args=[request],
                start_to_close_timeout=timedelta(seconds=30)
            )

            results.append(result)

        return results

    async def _wait_for_token(self) -> None:
        """Wait until token available"""
        while True:
            current_time = workflow.time_ns() / 1e9

            if self._bucket.consume(1, current_time):
                # Token consumed
                return

            # Wait until token available
            wait_time = self._bucket.time_until_available(1)
            await asyncio.sleep(timedelta(seconds=wait_time))

9.4.5 Sliding Window Rate Limiter

Pattern für zeitbasierte Limits (z.B. 1000/hour):

from collections import deque
from temporalio import workflow
import asyncio

@workflow.defn
class SlidingWindowWorkflow:
    """Rate Limiting mit Sliding Window"""

    def __init__(self) -> None:
        # Track requests in sliding window
        self._request_timestamps: deque = deque()
        self._max_requests = 1000
        self._window_seconds = 3600  # 1 hour

    @workflow.run
    async def run(self, requests: List[dict]) -> List[dict]:
        results = []

        for request in requests:
            # Wait for slot in window
            await self._wait_for_window_slot()

            # Execute
            result = await workflow.execute_activity(
                call_api,
                args=[request],
                start_to_close_timeout=timedelta(seconds=30)
            )

            results.append(result)

        return results

    async def _wait_for_window_slot(self) -> None:
        """Wait until request slot available in window"""
        while True:
            current_time = workflow.time_ns() / 1e9
            cutoff_time = current_time - self._window_seconds

            # Remove expired timestamps
            while self._request_timestamps and self._request_timestamps[0] < cutoff_time:
                self._request_timestamps.popleft()

            # Check if slot available
            if len(self._request_timestamps) < self._max_requests:
                self._request_timestamps.append(current_time)
                return

            # Wait until oldest request expires
            oldest = self._request_timestamps[0]
            wait_until = oldest + self._window_seconds
            wait_seconds = wait_until - current_time

            if wait_seconds > 0:
                await asyncio.sleep(timedelta(seconds=wait_seconds))

9.5 Graceful Degradation

9.5.1 Fallback Pattern

Konzept: Bei Service-Ausfall auf Fallback-Mechanismus zurückfallen.

from temporalio import workflow
from temporalio.exceptions import ActivityError, ApplicationError
import asyncio

@workflow.defn
class GracefulDegradationWorkflow:
    """Workflow mit Fallback-Mechanismen"""

    @workflow.run
    async def run(self, user_id: str) -> dict:
        """
        Get user recommendations:
        1. Try ML-based recommendations (primary)
        2. Fallback to rule-based (secondary)
        3. Fallback to popular items (tertiary)
        """

        # Try primary: ML Recommendations
        try:
            workflow.logger.info("Attempting ML recommendations")
            recommendations = await workflow.execute_activity(
                get_ml_recommendations,
                args=[user_id],
                start_to_close_timeout=timedelta(seconds=10),
                retry_policy=RetryPolicy(
                    maximum_attempts=2,
                    initial_interval=timedelta(seconds=1)
                )
            )

            return {
                "recommendations": recommendations,
                "source": "ml",
                "degraded": False
            }

        except ActivityError as e:
            workflow.logger.warning(f"ML recommendations failed: {e}")

        # Fallback 1: Rule-Based
        try:
            workflow.logger.info("Falling back to rule-based recommendations")
            recommendations = await workflow.execute_activity(
                get_rule_based_recommendations,
                args=[user_id],
                start_to_close_timeout=timedelta(seconds=5),
                retry_policy=RetryPolicy(maximum_attempts=2)
            )

            return {
                "recommendations": recommendations,
                "source": "rules",
                "degraded": True
            }

        except ActivityError as e:
            workflow.logger.warning(f"Rule-based recommendations failed: {e}")

        # Fallback 2: Popular Items (always works)
        workflow.logger.info("Falling back to popular items")
        recommendations = await workflow.execute_activity(
            get_popular_items,
            start_to_close_timeout=timedelta(seconds=3),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        return {
            "recommendations": recommendations,
            "source": "popular",
            "degraded": True
        }

9.5.2 Circuit Breaker mit Fallback

Integration von Circuit Breaker + Fallback:

from enum import Enum
from dataclasses import dataclass
from typing import Optional
from temporalio import workflow

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    timeout_seconds: int = 60
    half_open_attempts: int = 2

@workflow.defn
class CircuitBreakerFallbackWorkflow:
    """Circuit Breaker mit automatischem Fallback"""

    def __init__(self) -> None:
        self._circuit_state = CircuitState.CLOSED
        self._failure_count = 0
        self._last_failure_time: Optional[float] = None
        self._config = CircuitBreakerConfig()

    @workflow.run
    async def run(self, requests: List[dict]) -> List[dict]:
        results = []

        for request in requests:
            result = await self._execute_with_circuit_breaker(request)
            results.append(result)

        return results

    async def _execute_with_circuit_breaker(self, request: dict) -> dict:
        """Execute mit Circuit Breaker Protection + Fallback"""

        # Check circuit state
        await self._update_circuit_state()

        if self._circuit_state == CircuitState.OPEN:
            # Circuit OPEN → Immediate fallback
            workflow.logger.warning("Circuit OPEN - using fallback")
            return await self._execute_fallback(request)

        # Try primary service
        try:
            result = await workflow.execute_activity(
                call_primary_service,
                args=[request],
                start_to_close_timeout=timedelta(seconds=10),
                retry_policy=RetryPolicy(maximum_attempts=1)  # No retries
            )

            # Success → Reset circuit
            await self._on_success()
            return result

        except ActivityError as e:
            # Failure → Update circuit
            await self._on_failure()

            workflow.logger.warning(f"Primary service failed: {e}")

            # Use fallback
            return await self._execute_fallback(request)

    async def _execute_fallback(self, request: dict) -> dict:
        """Fallback execution"""
        return await workflow.execute_activity(
            call_fallback_service,
            args=[request],
            start_to_close_timeout=timedelta(seconds=5),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

    async def _update_circuit_state(self) -> None:
        """Update circuit state basierend auf Timeouts"""
        if self._circuit_state == CircuitState.OPEN:
            current_time = workflow.time_ns() / 1e9

            if self._last_failure_time:
                elapsed = current_time - self._last_failure_time

                if elapsed > self._config.timeout_seconds:
                    # Timeout elapsed → Try half-open
                    self._circuit_state = CircuitState.HALF_OPEN
                    workflow.logger.info("Circuit HALF_OPEN - testing recovery")

    async def _on_success(self) -> None:
        """Handle successful call"""
        if self._circuit_state == CircuitState.HALF_OPEN:
            # Success in half-open → Close circuit
            self._circuit_state = CircuitState.CLOSED
            self._failure_count = 0
            workflow.logger.info("Circuit CLOSED - service recovered")

        elif self._circuit_state == CircuitState.CLOSED:
            # Reset failure count
            self._failure_count = 0

    async def _on_failure(self) -> None:
        """Handle failed call"""
        self._failure_count += 1
        self._last_failure_time = workflow.time_ns() / 1e9

        if self._circuit_state == CircuitState.HALF_OPEN:
            # Failure in half-open → Reopen
            self._circuit_state = CircuitState.OPEN
            workflow.logger.warning("Circuit OPEN - service still failing")

        elif self._failure_count >= self._config.failure_threshold:
            # Too many failures → Open circuit
            self._circuit_state = CircuitState.OPEN
            workflow.logger.warning(
                f"Circuit OPEN after {self._failure_count} failures"
            )

9.6 State Management Patterns

9.6.1 Large Payload Problem

Problem: Workflow State > 2 MB → Performance Issues

# ❌ FALSCH: Large state in workflow
@workflow.defn
class BadWorkflow:
    def __init__(self) -> None:
        self._large_dataset: List[dict] = []  # Can grow to 10 MB!

    @workflow.run
    async def run(self) -> None:
        # Load large data
        self._large_dataset = await workflow.execute_activity(...)
        # State wird bei jedem Workflow Task übertragen → slow!

Temporal Limits:

  • Payload Size Limit: 2 MB (default, konfigurierbar bis 4 MB)
  • History Size Limit: 50 MB (empfohlen)
  • Performance: State wird bei jedem Task serialisiert/deserialisiert

9.6.2 External Storage Pattern

Lösung: Large Data in externes Storage, nur Reference in Workflow.

from dataclasses import dataclass
from typing import Optional
from temporalio import workflow

@dataclass
class DataReference:
    """Reference zu externem Storage"""
    storage_key: str
    size_bytes: int
    checksum: str
    storage_type: str = "s3"  # s3, gcs, blob, etc.

@workflow.defn
class ExternalStorageWorkflow:
    """Workflow mit External Storage Pattern"""

    def __init__(self) -> None:
        self._data_ref: Optional[DataReference] = None

    @workflow.run
    async def run(self, dataset_id: str) -> dict:
        """Process large dataset via external storage"""

        # Step 1: Load data und store externally
        workflow.logger.info(f"Loading dataset {dataset_id}")
        self._data_ref = await workflow.execute_activity(
            load_and_store_dataset,
            args=[dataset_id],
            start_to_close_timeout=timedelta(minutes=10),
            heartbeat_timeout=timedelta(seconds=30)
        )

        workflow.logger.info(
            f"Dataset stored: {self._data_ref.storage_key} "
            f"({self._data_ref.size_bytes} bytes)"
        )

        # Step 2: Process via reference
        result = await workflow.execute_activity(
            process_dataset_from_storage,
            args=[self._data_ref],
            start_to_close_timeout=timedelta(minutes=30),
            heartbeat_timeout=timedelta(minutes=1)
        )

        # Step 3: Cleanup external storage
        await workflow.execute_activity(
            cleanup_storage,
            args=[self._data_ref.storage_key],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return result

# Activities
@activity.defn
async def load_and_store_dataset(dataset_id: str) -> DataReference:
    """Load large dataset und store in S3"""
    # Load from database
    data = await database.load_dataset(dataset_id)

    # Store in S3
    storage_key = f"datasets/{dataset_id}/{uuid.uuid4()}.json"
    await s3_client.put_object(
        Bucket="workflow-data",
        Key=storage_key,
        Body=json.dumps(data)
    )

    return DataReference(
        storage_key=storage_key,
        size_bytes=len(json.dumps(data)),
        checksum=hashlib.md5(json.dumps(data).encode()).hexdigest(),
        storage_type="s3"
    )

@activity.defn
async def process_dataset_from_storage(ref: DataReference) -> dict:
    """Process dataset from external storage"""
    # Load from S3
    response = await s3_client.get_object(
        Bucket="workflow-data",
        Key=ref.storage_key
    )
    data = json.loads(response['Body'].read())

    # Process
    result = process_data(data)

    return result

9.6.3 Compression Pattern

Alternative: State komprimieren bei Storage.

import gzip
import base64
from typing import Any
from temporalio import workflow

@workflow.defn
class CompressedStateWorkflow:
    """Workflow mit State Compression"""

    def __init__(self) -> None:
        self._compressed_state: Optional[str] = None

    @workflow.run
    async def run(self) -> dict:
        # Load large state
        large_state = await workflow.execute_activity(
            load_large_state,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Compress state
        self._compressed_state = self._compress_state(large_state)

        workflow.logger.info(
            f"State compressed: "
            f"{len(json.dumps(large_state))} → {len(self._compressed_state)} bytes"
        )

        # Later: Decompress when needed
        state = self._decompress_state(self._compressed_state)

        return {"status": "complete"}

    def _compress_state(self, state: Any) -> str:
        """Compress state für storage"""
        json_bytes = json.dumps(state).encode('utf-8')
        compressed = gzip.compress(json_bytes)
        return base64.b64encode(compressed).decode('ascii')

    def _decompress_state(self, compressed: str) -> Any:
        """Decompress state"""
        compressed_bytes = base64.b64decode(compressed.encode('ascii'))
        json_bytes = gzip.decompress(compressed_bytes)
        return json.loads(json_bytes.decode('utf-8'))

9.6.4 Incremental State Updates

Pattern: Nur Änderungen tracken statt kompletten State.

from dataclasses import dataclass, field
from typing import Dict, List, Set
from temporalio import workflow

@dataclass
class IncrementalState:
    """State mit incremental updates"""
    processed_ids: Set[str] = field(default_factory=set)
    failed_ids: Set[str] = field(default_factory=set)
    metadata: Dict[str, Any] = field(default_factory=dict)

    # Track only changes
    _added_since_checkpoint: List[str] = field(default_factory=list)
    _checkpoint_size: int = 0

    def add_processed(self, item_id: str):
        """Add processed item"""
        if item_id not in self.processed_ids:
            self.processed_ids.add(item_id)
            self._added_since_checkpoint.append(item_id)

    def should_checkpoint(self, threshold: int = 1000) -> bool:
        """Check if checkpoint needed"""
        return len(self._added_since_checkpoint) >= threshold

@workflow.defn
class IncrementalStateWorkflow:
    """Workflow mit incremental state updates"""

    def __init__(self) -> None:
        self._state = IncrementalState()

    @workflow.run
    async def run(self, items: List[str]) -> dict:
        for item in items:
            # Process item
            await workflow.execute_activity(
                process_item,
                args=[item],
                start_to_close_timeout=timedelta(seconds=30)
            )

            self._state.add_processed(item)

            # Checkpoint state periodically
            if self._state.should_checkpoint():
                await self._checkpoint_state()

        return {
            "processed": len(self._state.processed_ids),
            "failed": len(self._state.failed_ids)
        }

    async def _checkpoint_state(self):
        """Persist state checkpoint"""
        await workflow.execute_activity(
            save_checkpoint,
            args=[{
                "processed_ids": list(self._state.processed_ids),
                "failed_ids": list(self._state.failed_ids),
                "metadata": self._state.metadata
            }],
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Reset incremental tracking
        self._state._added_since_checkpoint.clear()

        workflow.logger.info(
            f"Checkpoint saved: {len(self._state.processed_ids)} processed"
        )

9.7 Human-in-the-Loop Patterns

9.7.1 Manual Approval Pattern

Use Case: Kritische Workflows brauchen menschliche Genehmigung.

from temporalio import workflow
import asyncio
from datetime import timedelta

@workflow.defn
class ApprovalWorkflow:
    """Workflow mit Manual Approval"""

    def __init__(self) -> None:
        self._approval_granted: bool = False
        self._rejection_reason: Optional[str] = None

    @workflow.run
    async def run(self, request: dict) -> dict:
        """Execute mit approval requirement"""

        # Step 1: Validation
        workflow.logger.info("Validating request")
        validation = await workflow.execute_activity(
            validate_request,
            args=[request],
            start_to_close_timeout=timedelta(seconds=30)
        )

        if not validation.is_valid:
            return {"status": "rejected", "reason": "validation_failed"}

        # Step 2: Request Approval
        workflow.logger.info("Requesting manual approval")
        await workflow.execute_activity(
            send_approval_request,
            args=[request, workflow.info().workflow_id],
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Step 3: Wait for approval (max 7 days)
        try:
            await workflow.wait_condition(
                lambda: self._approval_granted or self._rejection_reason is not None,
                timeout=timedelta(days=7)
            )
        except asyncio.TimeoutError:
            workflow.logger.warning("Approval timeout - auto-rejecting")
            return {"status": "timeout", "reason": "no_approval_within_7_days"}

        # Check approval result
        if self._rejection_reason:
            workflow.logger.info(f"Request rejected: {self._rejection_reason}")
            return {"status": "rejected", "reason": self._rejection_reason}

        # Step 4: Execute approved action
        workflow.logger.info("Approval granted - executing")
        result = await workflow.execute_activity(
            execute_approved_action,
            args=[request],
            start_to_close_timeout=timedelta(minutes=30)
        )

        return {"status": "approved", "result": result}

    @workflow.signal
    async def approve(self) -> None:
        """Signal: Approve request"""
        self._approval_granted = True
        workflow.logger.info("Approval signal received")

    @workflow.signal
    async def reject(self, reason: str) -> None:
        """Signal: Reject request"""
        self._rejection_reason = reason
        workflow.logger.info(f"Rejection signal received: {reason}")

    @workflow.query
    def get_status(self) -> dict:
        """Query: Current approval status"""
        return {
            "approved": self._approval_granted,
            "rejected": self._rejection_reason is not None,
            "rejection_reason": self._rejection_reason,
            "waiting": not self._approval_granted and not self._rejection_reason
        }

Client-seitige Approval:

from temporalio.client import Client

async def approve_workflow(workflow_id: str):
    """Approve workflow extern"""
    client = await Client.connect("localhost:7233")

    handle = client.get_workflow_handle(workflow_id)

    # Send approval signal
    await handle.signal(ApprovalWorkflow.approve)

    print(f"Workflow {workflow_id} approved")

async def reject_workflow(workflow_id: str, reason: str):
    """Reject workflow extern"""
    client = await Client.connect("localhost:7233")

    handle = client.get_workflow_handle(workflow_id)

    # Send rejection signal
    await handle.signal(ApprovalWorkflow.reject, reason)

    print(f"Workflow {workflow_id} rejected: {reason}")

9.7.2 Multi-Step Approval Chain

Pattern: Mehrstufige Genehmigungskette.

from enum import Enum
from dataclasses import dataclass
from typing import List, Optional

class ApprovalLevel(Enum):
    MANAGER = "manager"
    DIRECTOR = "director"
    VP = "vp"
    CEO = "ceo"

@dataclass
class ApprovalStep:
    level: ApprovalLevel
    approved_by: Optional[str] = None
    approved_at: Optional[float] = None
    rejected: bool = False
    rejection_reason: Optional[str] = None

@workflow.defn
class MultiStepApprovalWorkflow:
    """Workflow mit Multi-Level Approval Chain"""

    def __init__(self) -> None:
        self._approval_chain: List[ApprovalStep] = []
        self._current_step = 0

    @workflow.run
    async def run(self, request: dict, amount: float) -> dict:
        """Execute mit approval chain basierend auf amount"""

        # Determine approval chain basierend auf amount
        if amount < 10000:
            levels = [ApprovalLevel.MANAGER]
        elif amount < 100000:
            levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR]
        elif amount < 1000000:
            levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR, ApprovalLevel.VP]
        else:
            levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR, ApprovalLevel.VP, ApprovalLevel.CEO]

        # Initialize approval chain
        self._approval_chain = [ApprovalStep(level=level) for level in levels]

        workflow.logger.info(
            f"Approval chain: {[step.level.value for step in self._approval_chain]}"
        )

        # Process each approval step
        for i, step in enumerate(self._approval_chain):
            self._current_step = i

            workflow.logger.info(f"Requesting {step.level.value} approval")

            # Notify approver
            await workflow.execute_activity(
                notify_approver,
                args=[step.level.value, request, workflow.info().workflow_id],
                start_to_close_timeout=timedelta(seconds=30)
            )

            # Wait for approval (24 hours timeout per level)
            try:
                await workflow.wait_condition(
                    lambda: step.approved_by is not None or step.rejected,
                    timeout=timedelta(hours=24)
                )
            except asyncio.TimeoutError:
                workflow.logger.warning(f"{step.level.value} approval timeout")
                return {
                    "status": "timeout",
                    "level": step.level.value,
                    "reason": "approval_timeout"
                }

            # Check result
            if step.rejected:
                workflow.logger.info(
                    f"Rejected at {step.level.value}: {step.rejection_reason}"
                )
                return {
                    "status": "rejected",
                    "level": step.level.value,
                    "reason": step.rejection_reason
                }

            workflow.logger.info(
                f"{step.level.value} approved by {step.approved_by}"
            )

        # All approvals granted → Execute
        workflow.logger.info("All approvals granted - executing")
        result = await workflow.execute_activity(
            execute_approved_action,
            args=[request],
            start_to_close_timeout=timedelta(minutes=30)
        )

        return {
            "status": "approved",
            "approval_chain": [
                {
                    "level": step.level.value,
                    "approved_by": step.approved_by,
                    "approved_at": step.approved_at
                }
                for step in self._approval_chain
            ],
            "result": result
        }

    @workflow.signal
    async def approve_step(self, level: str, approver: str) -> None:
        """Approve current step"""
        current_step = self._approval_chain[self._current_step]

        if current_step.level.value == level:
            current_step.approved_by = approver
            current_step.approved_at = workflow.time_ns() / 1e9
            workflow.logger.info(f"Step approved: {level} by {approver}")

    @workflow.signal
    async def reject_step(self, level: str, reason: str) -> None:
        """Reject current step"""
        current_step = self._approval_chain[self._current_step]

        if current_step.level.value == level:
            current_step.rejected = True
            current_step.rejection_reason = reason
            workflow.logger.info(f"Step rejected: {level} - {reason}")

    @workflow.query
    def get_approval_status(self) -> dict:
        """Query approval status"""
        return {
            "current_step": self._current_step,
            "total_steps": len(self._approval_chain),
            "steps": [
                {
                    "level": step.level.value,
                    "approved": step.approved_by is not None,
                    "rejected": step.rejected,
                    "pending": step.approved_by is None and not step.rejected
                }
                for step in self._approval_chain
            ]
        }

9.8 Zusammenfassung

Kernkonzepte:

  1. Continue-As-New: Unbegrenzte Workflow-Laufzeit durch History-Reset
  2. Child Workflows: Event History Isolation und unabhängiger Lifecycle
  3. Parallel Execution: Effiziente Batch-Verarbeitung mit asyncio.gather()
  4. Rate Limiting: Token Bucket, Sliding Window, Semaphore-Patterns
  5. Graceful Degradation: Fallback-Mechanismen und Circuit Breaker
  6. State Management: External Storage, Compression, Incremental Updates
  7. Human-in-the-Loop: Manual Approvals und Multi-Level Chains

Best Practices Checkliste:

  • ✅ Continue-As-New bei >10,000 Events oder Temporal Suggestion
  • ✅ Child Workflows für Event History Isolation
  • ✅ Batching bei >1000 parallelen Activities
  • ✅ Rate Limiting für externe APIs implementieren
  • ✅ Fallback-Mechanismen für kritische Services
  • ✅ State < 2 MB halten, sonst External Storage
  • ✅ Approval Workflows mit Timeout
  • ✅ Monitoring für alle Advanced Patterns

Wann welches Pattern?

ScenarioPattern
Workflow > 30 TageContinue-As-New
>1000 Sub-TasksChild Workflows
Batch ProcessingParallel Execution + Batching
Externe API mit LimitsRate Limiting (Token Bucket)
Service-Ausfälle möglichGraceful Degradation
State > 2 MBExternal Storage
Manuelle GenehmigungHuman-in-the-Loop

Im nächsten Kapitel (Teil IV: Betrieb) werden wir uns mit Production Deployments, Monitoring und Operations beschäftigen - wie Sie Temporal-Systeme in Production betreiben.


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 10: Produktions-Deployment

Code-Beispiele für dieses Kapitel: examples/part-03/chapter-09/

Kapitel 10: Deployment und Production Best Practices

Einleitung

Temporal in der Entwicklung zum Laufen zu bringen ist einfach – temporal server start-dev und los geht’s. Aber der Sprung in die Production ist eine andere Herausforderung. Sie müssen über High Availability, Zero-Downtime Deployments, Kapazitätsplanung, Disaster Recovery und vieles mehr nachdenken.

Dieses Kapitel behandelt alles, was Sie wissen müssen, um Temporal sicher und zuverlässig in Production zu betreiben. Von Deployment-Strategien über Worker-Management bis hin zu Temporal Server-Konfiguration.

Das Grundproblem

Scenario: Sie haben 50,000 laufende Workflows in Production. Ein neues Feature muss deployed werden. Aber:

  • Workflows laufen über Wochen/Monate
  • Worker dürfen nicht einfach beendet werden (laufende Activities!)
  • Code-Änderungen müssen abwärtskompatibel sein
  • Zero Downtime ist Pflicht
  • Rollback muss möglich sein

Ohne Best Practices:

# FALSCH: Workers einfach neu starten
kubectl delete pods -l app=temporal-worker  # ❌ Kills running activities!
kubectl apply -f new-worker.yaml

Ergebnis:

  • Activities werden abgebrochen
  • Workflows müssen retries durchführen
  • Potentieller Datenverlust
  • Service-Degradation

Mit Best Practices:

# RICHTIG: Graceful shutdown + Rolling deployment
kubectl set image deployment/temporal-worker worker=v2.0.0  # ✅
# Kubernetes terminiert Pods graceful
# Workers beenden aktuelle Tasks
# Neue Workers starten parallel

Lernziele

Nach diesem Kapitel können Sie:

  • Verschiedene Deployment-Strategien (Blue-Green, Canary, Rolling) anwenden
  • Workers graceful shutdown und zero-downtime deployments durchführen
  • Temporal Server selbst hosten oder Cloud nutzen
  • High Availability und Disaster Recovery implementieren
  • Capacity Planning durchführen
  • Production Checklisten anwenden
  • Monitoring und Alerting aufsetzen (Details in Kapitel 11)
  • Security Best Practices implementieren (Details in Kapitel 13)

10.1 Worker Deployment Strategies

10.1.1 Graceful Shutdown

Warum wichtig?

Workers führen Activities aus, die Minuten oder Stunden dauern können. Ein abruptes Beenden würde:

  • Laufende Activities abbrechen
  • Externe State inkonsistent lassen
  • Unnötige Retries auslösen

Lifecycle eines Workers:

stateDiagram-v2
    [*] --> Starting: Start Worker
    Starting --> Running: Connected to Temporal
    Running --> Draining: SIGTERM received
    Draining --> Stopped: All tasks completed
    Stopped --> [*]

    note right of Draining
        - Accepts no new tasks
        - Completes running tasks
        - Timeout: 30s - 5min
    end note

Python Implementation:

"""
Graceful Worker Shutdown
"""

import asyncio
import signal
from temporalio.client import Client
from temporalio.worker import Worker
import logging

logger = logging.getLogger(__name__)

class GracefulWorker:
    """Worker with graceful shutdown support"""

    def __init__(
        self,
        client: Client,
        task_queue: str,
        workflows: list,
        activities: list,
        shutdown_timeout: float = 300.0  # 5 minutes
    ):
        self.client = client
        self.task_queue = task_queue
        self.workflows = workflows
        self.activities = activities
        self.shutdown_timeout = shutdown_timeout
        self.worker: Worker | None = None
        self._shutdown_event = asyncio.Event()

    async def run(self):
        """Run worker with graceful shutdown handling"""

        # Setup signal handlers
        loop = asyncio.get_running_loop()

        def signal_handler(sig):
            logger.info(f"Received signal {sig}, initiating graceful shutdown...")
            self._shutdown_event.set()

        # Handle SIGTERM (Kubernetes pod termination)
        loop.add_signal_handler(signal.SIGTERM, lambda: signal_handler("SIGTERM"))
        # Handle SIGINT (Ctrl+C)
        loop.add_signal_handler(signal.SIGINT, lambda: signal_handler("SIGINT"))

        # Create and start worker
        logger.info(f"Starting worker on task queue: {self.task_queue}")

        async with Worker(
            self.client,
            task_queue=self.task_queue,
            workflows=self.workflows,
            activities=self.activities,
            # Important: Enable graceful shutdown
            graceful_shutdown_timeout=timedelta(seconds=self.shutdown_timeout)
        ) as self.worker:

            logger.info("✓ Worker started and polling for tasks")

            # Wait for shutdown signal
            await self._shutdown_event.wait()

            logger.info("Shutdown signal received, draining tasks...")
            logger.info(f"Waiting up to {self.shutdown_timeout}s for tasks to complete")

            # Worker will automatically:
            # 1. Stop accepting new tasks
            # 2. Wait for running tasks to complete
            # 3. Timeout after graceful_shutdown_timeout

        logger.info("✓ Worker stopped gracefully")

# Usage
async def main():
    client = await Client.connect("localhost:7233")

    worker = GracefulWorker(
        client=client,
        task_queue="production-queue",
        workflows=[MyWorkflow],
        activities=[my_activity],
        shutdown_timeout=300.0
    )

    await worker.run()

if __name__ == "__main__":
    asyncio.run(main())

Kubernetes Deployment mit Graceful Shutdown:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

  template:
    metadata:
      labels:
        app: temporal-worker
        version: v2.0.0

    spec:
      containers:
      - name: worker
        image: myregistry/temporal-worker:v2.0.0

        # Resource limits
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

        # Graceful shutdown configuration
        lifecycle:
          preStop:
            exec:
              # Optional: Custom pre-stop hook
              # Worker already handles SIGTERM gracefully
              command: ["/bin/sh", "-c", "sleep 5"]

        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5

        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

      # Termination grace period (must be > graceful_shutdown_timeout!)
      terminationGracePeriodSeconds: 360  # 6 minutes

Best Practices:

DO:

  • Set graceful_shutdown_timeout > longest expected activity duration
  • Set Kubernetes terminationGracePeriodSeconds > graceful_shutdown_timeout + buffer
  • Log shutdown progress for observability
  • Monitor drain duration metrics
  • Test graceful shutdown in staging

DON’T:

  • Use SIGKILL for routine shutdowns
  • Set timeout too short (activities will be aborted)
  • Ignore health check failures
  • Skip testing shutdown behavior

10.1.2 Rolling Deployment

Pattern: Schrittweises Ersetzen alter Workers durch neue.

Vorteile:

  • ✅ Zero Downtime
  • ✅ Automatisches Rollback bei Fehlern
  • ✅ Kapazität bleibt konstant
  • ✅ Standard in Kubernetes

Nachteile:

  • ⚠️ Zwei Versionen laufen parallel
  • ⚠️ Code muss backward-compatible sein
  • ⚠️ Langsamer als Blue-Green

Flow:

graph TD
    A[3 Workers v1.0] --> B[2 Workers v1.0<br/>1 Worker v2.0]
    B --> C[1 Worker v1.0<br/>2 Workers v2.0]
    C --> D[3 Workers v2.0]

    style A fill:#e1f5ff
    style D fill:#d4f1d4

Kubernetes RollingUpdate:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2  # Max 2 workers down at once
      maxSurge: 2        # Max 2 extra workers during rollout

  template:
    spec:
      containers:
      - name: worker
        image: myregistry/temporal-worker:v2.0.0

Deployment Process:

# 1. Update image
kubectl set image deployment/temporal-worker \
  worker=myregistry/temporal-worker:v2.0.0

# 2. Monitor rollout
kubectl rollout status deployment/temporal-worker

# Output:
# Waiting for deployment "temporal-worker" rollout to finish: 2 out of 10 new replicas have been updated...
# Waiting for deployment "temporal-worker" rollout to finish: 5 out of 10 new replicas have been updated...
# deployment "temporal-worker" successfully rolled out

# 3. If problems occur, rollback
kubectl rollout undo deployment/temporal-worker

Health Check during Rollout:

"""
Health check endpoint for Kubernetes probes
"""

from fastapi import FastAPI
from temporalio.worker import Worker

app = FastAPI()

worker: Worker | None = None

@app.get("/health")
async def health():
    """Liveness probe: Is the process alive?"""
    return {"status": "ok"}

@app.get("/ready")
async def ready():
    """Readiness probe: Is worker ready to accept tasks?"""
    if worker is None or not worker.is_running:
        return {"status": "not_ready"}, 503

    return {"status": "ready"}

# Run alongside worker
async def run_worker_with_health_check():
    import uvicorn

    # Start health check server
    config = uvicorn.Config(app, host="0.0.0.0", port=8080)
    server = uvicorn.Server(config)

    # Run both concurrently
    await asyncio.gather(
        server.serve(),
        run_worker()  # Your worker logic
    )

10.1.3 Blue-Green Deployment

Pattern: Zwei identische Environments (Blue = alt, Green = neu). Traffic wird komplett umgeschaltet.

Vorteile:

  • ✅ Instant Rollback (switch back)
  • ✅ Beide Versionen getestet vor Switch
  • ✅ Zero Downtime
  • ✅ Volle Kontrolle über Cutover

Nachteile:

  • ⚠️ Doppelte Ressourcen während Deployment
  • ⚠️ Komplexere Infrastruktur
  • ⚠️ Database Schema muss kompatibel sein

Flow:

graph LR
    A[Traffic 100%] --> B[Blue Workers v1.0]

    C[Green Workers v2.0<br/>Deployed & Tested] -.-> D[Switch Traffic]

    D --> E[Traffic 100%]
    E --> F[Green Workers v2.0]

    G[Blue Workers v1.0] -.-> H[Keep for Rollback]

    style B fill:#e1f5ff
    style F fill:#d4f1d4
    style G fill:#ffe1e1

Implementation with Worker Versioning:

"""
Blue-Green Deployment mit Worker Versioning (Build IDs)
"""

from temporalio.client import Client
from temporalio.worker import Worker

async def deploy_green_workers():
    """Deploy new GREEN workers with new Build ID"""

    client = await Client.connect("localhost:7233")

    # Green workers mit neuem Build ID
    worker = Worker(
        client,
        task_queue="production-queue",
        workflows=[MyWorkflowV2],  # New version
        activities=[my_activity_v2],
        build_id="v2.0.0",  # GREEN Build ID
        use_worker_versioning=True
    )

    await worker.run()

async def cutover_to_green():
    """Switch traffic from BLUE to GREEN"""

    client = await Client.connect("localhost:7233")

    # Make v2.0.0 the default for new workflows
    await client.update_worker_build_id_compatibility(
        task_queue="production-queue",
        operation=BuildIdOperation.add_new_default("v2.0.0")
    )

    print("✓ Traffic switched to GREEN (v2.0.0)")
    print("  - New workflows → v2.0.0")
    print("  - Running workflows → continue on v1.0.0")

async def rollback_to_blue():
    """Rollback to BLUE version"""

    client = await Client.connect("localhost:7233")

    # Revert to v1.0.0
    await client.update_worker_build_id_compatibility(
        task_queue="production-queue",
        operation=BuildIdOperation.promote_set_by_build_id("v1.0.0")
    )

    print("✓ Rolled back to BLUE (v1.0.0)")

Kubernetes Setup:

# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker-blue
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: temporal-worker
        color: blue
        version: v1.0.0
    spec:
      containers:
      - name: worker
        image: myregistry/temporal-worker:v1.0.0
        env:
        - name: BUILD_ID
          value: "v1.0.0"

---

# Green deployment (new version, ready for cutover)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker-green
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: temporal-worker
        color: green
        version: v2.0.0
    spec:
      containers:
      - name: worker
        image: myregistry/temporal-worker:v2.0.0
        env:
        - name: BUILD_ID
          value: "v2.0.0"

Deployment Procedure:

# 1. Deploy GREEN alongside BLUE
kubectl apply -f worker-green.yaml

# 2. Verify GREEN health
kubectl get pods -l color=green
kubectl logs -l color=green --tail=100

# 3. Run smoke tests on GREEN
python scripts/smoke_test.py --build-id v2.0.0

# 4. Cutover traffic to GREEN
python scripts/cutover.py --to green

# 5. Monitor GREEN for issues
# ... wait 1-2 hours ...

# 6. If all good, decommission BLUE
kubectl delete deployment temporal-worker-blue

# 7. If issues, instant rollback
python scripts/cutover.py --to blue

10.1.4 Canary Deployment

Pattern: Neue Version auf kleinem Prozentsatz der Traffic (z.B. 5%), dann schrittweise erhöhen.

Vorteile:

  • ✅ Minimal Risk (nur 5% betroffen)
  • ✅ Frühe Fehler-Erkennung
  • ✅ Gradual Rollout
  • ✅ A/B Testing möglich

Nachteile:

  • ⚠️ Komplexere Observability
  • ⚠️ Längerer Deployment-Prozess
  • ⚠️ Requires Traffic Splitting

Flow:

graph TD
    A[100% v1.0] --> B[95% v1.0 + 5% v2.0<br/>Canary]
    B --> C{Metrics OK?}
    C -->|Yes| D[80% v1.0 + 20% v2.0]
    C -->|No| E[Rollback to 100% v1.0]
    D --> F[50% v1.0 + 50% v2.0]
    F --> G[100% v2.0]

    style E fill:#ffe1e1
    style G fill:#d4f1d4

Implementation mit Worker Versioning:

"""
Canary Deployment mit schrittweisem Rollout
"""

from temporalio.client import Client
import asyncio

async def canary_rollout():
    """Gradual canary rollout: 5% → 20% → 50% → 100%"""

    client = await Client.connect("localhost:7233")

    stages = [
        {"canary_pct": 5, "wait_minutes": 30},
        {"canary_pct": 20, "wait_minutes": 60},
        {"canary_pct": 50, "wait_minutes": 120},
        {"canary_pct": 100, "wait_minutes": 0},
    ]

    for stage in stages:
        pct = stage["canary_pct"]
        wait = stage["wait_minutes"]

        print(f"\n🚀 Stage: {pct}% canary traffic to v2.0.0")

        # Adjust worker replicas based on percentage
        canary_replicas = max(1, int(10 * pct / 100))
        stable_replicas = 10 - canary_replicas

        # Scale workers (using kubectl or k8s API)
        await scale_workers("blue", stable_replicas)
        await scale_workers("green", canary_replicas)

        print(f"  - Blue (v1.0.0): {stable_replicas} replicas")
        print(f"  - Green (v2.0.0): {canary_replicas} replicas")

        if wait > 0:
            print(f"⏳ Monitoring for {wait} minutes...")
            await asyncio.sleep(wait * 60)

            # Check metrics
            metrics = await check_canary_metrics()

            if not metrics["healthy"]:
                print("❌ Canary metrics unhealthy, rolling back!")
                await rollback()
                return

            print("✅ Canary metrics healthy, continuing rollout")

    print("\n🎉 Canary rollout completed successfully!")
    print("   100% traffic now on v2.0.0")

async def check_canary_metrics() -> dict:
    """Check if canary version is healthy"""
    # Check:
    # - Error rate
    # - Latency p99
    # - Success rate
    # - Custom business metrics

    return {
        "healthy": True,
        "error_rate": 0.01,
        "latency_p99": 450,
        "success_rate": 99.9
    }

Kubernetes + Argo Rollouts:

# Canary with Argo Rollouts (advanced)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: temporal-worker
spec:
  replicas: 10

  strategy:
    canary:
      steps:
      - setWeight: 5      # 5% canary
      - pause: {duration: 30m}

      - setWeight: 20     # 20% canary
      - pause: {duration: 1h}

      - setWeight: 50     # 50% canary
      - pause: {duration: 2h}

      - setWeight: 100    # Full rollout

      # Auto-rollback on metrics failure
      analysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: temporal-worker

  template:
    spec:
      containers:
      - name: worker
        image: myregistry/temporal-worker:v2.0.0

10.1.5 Deployment Strategy Decision Matrix

FaktorRollingBlue-GreenCanary
ComplexityLowMediumHigh
Rollback SpeedSlowInstantFast
Resource CostLowHigh (2x)Medium
RiskMediumLowVery Low
Best For- Routine updates
- Small teams
- Critical releases
- Need instant rollback
- High-risk changes
- A/B testing
Temporal FeatureStandardWorker VersioningWorker Versioning

Empfehlung:

def choose_deployment_strategy(change_type: str) -> str:
    """Decision tree for deployment strategy"""

    if change_type == "hotfix":
        return "rolling"  # Fast, simple

    elif change_type == "major_release":
        return "blue-green"  # Safe, instant rollback

    elif change_type == "experimental_feature":
        return "canary"  # Gradual, low risk

    elif change_type == "routine_update":
        return "rolling"  # Standard, cost-effective

    else:
        return "canary"  # When in doubt, go safe

10.2 Temporal Server Deployment

10.2.1 Temporal Cloud vs Self-Hosted

Decision Matrix:

FaktorTemporal CloudSelf-Hosted
Setup TimeMinutesDays/Weeks
Operational OverheadNoneHigh
CostPay-per-useInfrastructure + Team
ControlLimitedFull
ComplianceSOC2, HIPAAYour responsibility
CustomizationLimitedUnlimited
ScalingAutomaticManual
Best For- Startups
- Focus on business logic
- Fast time-to-market
- Enterprise
- Strict compliance
- Full control needs

Temporal Cloud:

"""
Connecting to Temporal Cloud
"""

from temporalio.client import Client, TLSConfig

async def connect_to_cloud():
    """Connect to Temporal Cloud"""

    client = await Client.connect(
        # Your Temporal Cloud namespace
        target_host="my-namespace.tmprl.cloud:7233",

        # Namespace
        namespace="my-namespace.account-id",

        # TLS configuration (required for Cloud)
        tls=TLSConfig(
            client_cert=open("client-cert.pem", "rb").read(),
            client_private_key=open("client-key.pem", "rb").read(),
        )
    )

    return client

# Usage
client = await connect_to_cloud()

Pros:

  • ✅ No infrastructure management
  • ✅ Automatic scaling
  • ✅ Built-in monitoring
  • ✅ Multi-region support
  • ✅ 99.99% SLA

Cons:

  • ❌ Less control over configuration
  • ❌ Pay-per-action pricing
  • ❌ Vendor lock-in

10.2.2 Self-Hosted: Docker Compose

For: Development, small deployments

# docker-compose.yml
version: '3.8'

services:
  # PostgreSQL (persistence)
  postgresql:
    image: postgres:13
    environment:
      POSTGRES_PASSWORD: temporal
      POSTGRES_USER: temporal
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks:
      - temporal-network

  # Temporal Server
  temporal:
    image: temporalio/auto-setup:latest
    depends_on:
      - postgresql
    environment:
      - DB=postgresql
      - DB_PORT=5432
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
      - POSTGRES_SEEDS=postgresql
      - DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-sql.yaml
    ports:
      - "7233:7233"  # gRPC
      - "8233:8233"  # HTTP
    volumes:
      - ./dynamicconfig:/etc/temporal/config/dynamicconfig
    networks:
      - temporal-network

  # Temporal Web UI
  temporal-ui:
    image: temporalio/ui:latest
    depends_on:
      - temporal
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
    ports:
      - "8080:8080"
    networks:
      - temporal-network

volumes:
  postgres-data:

networks:
  temporal-network:
    driver: bridge

Start:

docker-compose up -d

Pros:

  • ✅ Simple setup
  • ✅ Good for dev/test
  • ✅ All-in-one

Cons:

  • ❌ Not production-grade
  • ❌ Single point of failure
  • ❌ No HA

10.2.3 Self-Hosted: Kubernetes (Production)

For: Production, high availability

Architecture:

graph TB
    subgraph "Temporal Cluster"
        Frontend[Frontend Service<br/>gRPC API]
        History[History Service<br/>Workflow Orchestration]
        Matching[Matching Service<br/>Task Queue Management]
        Worker_Service[Worker Service<br/>Background Jobs]
    end

    subgraph "Persistence"
        DB[(PostgreSQL/Cassandra<br/>Workflow State)]
        ES[(Elasticsearch<br/>Visibility)]
    end

    subgraph "Workers"
        W1[Worker Pod 1]
        W2[Worker Pod 2]
        W3[Worker Pod N]
    end

    Frontend --> DB
    History --> DB
    History --> ES
    Matching --> DB
    Worker_Service --> DB

    W1 --> Frontend
    W2 --> Frontend
    W3 --> Frontend

Helm Chart Deployment:

# 1. Add Temporal Helm repo
helm repo add temporalio https://go.temporal.io/helm-charts
helm repo update

# 2. Create namespace
kubectl create namespace temporal

# 3. Install with custom values
helm install temporal temporalio/temporal \
  --namespace temporal \
  --values temporal-values.yaml

temporal-values.yaml:

# Temporal Server configuration for production

# High Availability: Multiple replicas
server:
  replicaCount: 3

  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi

  # Frontend service
  frontend:
    replicaCount: 3
    service:
      type: LoadBalancer
      port: 7233

  # History service (most critical)
  history:
    replicaCount: 5
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi

  # Matching service
  matching:
    replicaCount: 3

  # Worker service
  worker:
    replicaCount: 2

# PostgreSQL (persistence)
postgresql:
  enabled: true
  persistence:
    enabled: true
    size: 100Gi
    storageClass: "fast-ssd"

  # HA setup
  replication:
    enabled: true
    readReplicas: 2

  resources:
    requests:
      cpu: 2000m
      memory: 8Gi

# Elasticsearch (visibility)
elasticsearch:
  enabled: true
  replicas: 3
  minimumMasterNodes: 2

  volumeClaimTemplate:
    accessModes: ["ReadWriteOnce"]
    resources:
      requests:
        storage: 100Gi

# Prometheus metrics
prometheus:
  enabled: true

# Grafana dashboards
grafana:
  enabled: true

Verify Installation:

# Check pods
kubectl get pods -n temporal

# Expected output:
# NAME                                   READY   STATUS    RESTARTS   AGE
# temporal-frontend-xxxxx                1/1     Running   0          5m
# temporal-history-xxxxx                 1/1     Running   0          5m
# temporal-matching-xxxxx                1/1     Running   0          5m
# temporal-worker-xxxxx                  1/1     Running   0          5m
# temporal-postgresql-0                  1/1     Running   0          5m
# temporal-elasticsearch-0               1/1     Running   0          5m

# Check services
kubectl get svc -n temporal

# Port-forward to access UI
kubectl port-forward -n temporal svc/temporal-frontend 7233:7233
kubectl port-forward -n temporal svc/temporal-web 8080:8080

10.2.4 High Availability Setup

Multi-Region Deployment:

graph TB
    subgraph "Region US-East"
        US_LB[Load Balancer]
        US_T1[Temporal Cluster]
        US_DB[(PostgreSQL Primary)]
    end

    subgraph "Region EU-West"
        EU_LB[Load Balancer]
        EU_T1[Temporal Cluster]
        EU_DB[(PostgreSQL Replica)]
    end

    subgraph "Region AP-South"
        AP_LB[Load Balancer]
        AP_T1[Temporal Cluster]
        AP_DB[(PostgreSQL Replica)]
    end

    US_DB -.Replication.-> EU_DB
    US_DB -.Replication.-> AP_DB

    Global_DNS[Global DNS/CDN] --> US_LB
    Global_DNS --> EU_LB
    Global_DNS --> AP_LB

HA Checklist:

Infrastructure:

  • Multiple availability zones
  • Database replication (PostgreSQL streaming)
  • Load balancer health checks
  • Auto-scaling groups
  • Network redundancy

Temporal Server:

  • Frontend: ≥3 replicas
  • History: ≥5 replicas (most critical)
  • Matching: ≥3 replicas
  • Worker: ≥2 replicas

Database:

  • PostgreSQL with streaming replication
  • Automated backups (daily)
  • Point-in-time recovery enabled
  • Separate disk for WAL logs
  • Connection pooling (PgBouncer)

Monitoring:

  • Prometheus + Grafana
  • Alert on service degradation
  • Dashboard for all services
  • Log aggregation (ELK/Loki)

10.2.5 Disaster Recovery

Backup Strategy:

# Automated PostgreSQL backup script

#!/bin/bash
# backup-temporal-db.sh

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/temporal"
DB_NAME="temporal"
DB_USER="temporal"

# Full backup
pg_dump -U $DB_USER -d $DB_NAME -F c -b -v \
  -f $BACKUP_DIR/temporal_$TIMESTAMP.dump

# Compress
gzip $BACKUP_DIR/temporal_$TIMESTAMP.dump

# Upload to S3
aws s3 cp $BACKUP_DIR/temporal_$TIMESTAMP.dump.gz \
  s3://my-temporal-backups/daily/

# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.dump.gz" -mtime +30 -delete

echo "Backup completed: temporal_$TIMESTAMP.dump.gz"

Cron Schedule:

# Daily backup at 2 AM
0 2 * * * /scripts/backup-temporal-db.sh >> /var/log/temporal-backup.log 2>&1

# Hourly incremental backup (WAL archiving)
0 * * * * /scripts/archive-wal.sh >> /var/log/wal-archive.log 2>&1

Restore Procedure:

#!/bin/bash
# restore-temporal-db.sh

BACKUP_FILE=$1

# Download from S3
aws s3 cp s3://my-temporal-backups/daily/$BACKUP_FILE .

# Decompress
gunzip $BACKUP_FILE

# Restore
pg_restore -U temporal -d temporal -c -v ${BACKUP_FILE%.gz}

echo "Restore completed from $BACKUP_FILE"

DR Runbook:

# Disaster Recovery Runbook

## Scenario 1: Database Corruption

1. **Stop Temporal services**
   ```bash
   kubectl scale deployment temporal-frontend --replicas=0 -n temporal
   kubectl scale deployment temporal-history --replicas=0 -n temporal
   kubectl scale deployment temporal-matching --replicas=0 -n temporal
  1. Restore from latest backup

    ./restore-temporal-db.sh temporal_20250118_020000.dump.gz
    
  2. Verify database integrity

    psql -U temporal -d temporal -c "SELECT COUNT(*) FROM executions;"
    
  3. Restart services

    kubectl scale deployment temporal-frontend --replicas=3 -n temporal
    kubectl scale deployment temporal-history --replicas=5 -n temporal
    kubectl scale deployment temporal-matching --replicas=3 -n temporal
    
  4. Verify workflows resuming

    temporal workflow list
    

Scenario 2: Complete Region Failure

  1. Switch DNS to DR region

    aws route53 change-resource-record-sets \
      --hosted-zone-id Z1234567890ABC \
      --change-batch file://failover.json
    
  2. Promote replica to primary

    kubectl exec -it postgresql-replica-0 -n temporal -- \
      pg_ctl promote
    
  3. Scale up DR services

    kubectl scale deployment temporal-frontend --replicas=3 -n temporal-dr
    kubectl scale deployment temporal-history --replicas=5 -n temporal-dr
    
  4. Update worker connections

    # Workers automatically reconnect to new endpoint via DNS
    

RTO/RPO Targets

  • RTO (Recovery Time Objective): 15 minutes
  • RPO (Recovery Point Objective): 1 hour (last backup)

## 10.3 Capacity Planning

### 10.3.1 Worker Sizing

**Factors:**

1. **Workflow Throughput**
   - Workflows/second
   - Average workflow duration
   - Concurrent workflow limit

2. **Activity Characteristics**
   - Average duration
   - CPU/Memory usage
   - External dependencies (API rate limits)

3. **Task Queue Backlog**
   - Acceptable lag
   - Peak vs average load

**Formula:**

```python
"""
Worker Capacity Calculator
"""

from dataclasses import dataclass
from typing import List

@dataclass
class WorkloadProfile:
    """Characterize your workload"""
    workflows_per_second: float
    avg_workflow_duration_sec: float
    activities_per_workflow: int
    avg_activity_duration_sec: float
    activity_cpu_cores: float = 0.1
    activity_memory_mb: float = 256

def calculate_required_workers(profile: WorkloadProfile) -> dict:
    """Calculate required worker capacity"""

    # Concurrent workflows
    concurrent_workflows = profile.workflows_per_second * profile.avg_workflow_duration_sec

    # Concurrent activities
    concurrent_activities = (
        concurrent_workflows *
        profile.activities_per_workflow
    )

    # Worker slots (assuming 100 slots per worker)
    slots_per_worker = 100
    required_workers = max(1, int(concurrent_activities / slots_per_worker) + 1)

    # Resource requirements
    total_cpu = concurrent_activities * profile.activity_cpu_cores
    total_memory_gb = (concurrent_activities * profile.activity_memory_mb) / 1024

    return {
        "concurrent_workflows": int(concurrent_workflows),
        "concurrent_activities": int(concurrent_activities),
        "required_workers": required_workers,
        "total_cpu_cores": round(total_cpu, 2),
        "total_memory_gb": round(total_memory_gb, 2),
        "cpu_per_worker": round(total_cpu / required_workers, 2),
        "memory_per_worker_gb": round(total_memory_gb / required_workers, 2)
    }

# Example
profile = WorkloadProfile(
    workflows_per_second=10,
    avg_workflow_duration_sec=300,  # 5 minutes
    activities_per_workflow=5,
    avg_activity_duration_sec=10,
    activity_cpu_cores=0.1,
    activity_memory_mb=256
)

result = calculate_required_workers(profile)

print("Capacity Planning Results:")
print(f"  Concurrent Workflows: {result['concurrent_workflows']}")
print(f"  Concurrent Activities: {result['concurrent_activities']}")
print(f"  Required Workers: {result['required_workers']}")
print(f"  Total CPU: {result['total_cpu_cores']} cores")
print(f"  Total Memory: {result['total_memory_gb']} GB")
print(f"  Per Worker: {result['cpu_per_worker']} CPU, {result['memory_per_worker_gb']} GB RAM")

Output:

Capacity Planning Results:
  Concurrent Workflows: 3000
  Concurrent Activities: 15000
  Required Workers: 151
  Total CPU: 1500.0 cores
  Total Memory: 3750.0 GB
  Per Worker: 9.93 CPU, 24.83 GB RAM

10.3.2 Horizontal Pod Autoscaling

Kubernetes HPA:

# hpa-worker.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: temporal-worker-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: temporal-worker

  minReplicas: 5
  maxReplicas: 50

  metrics:
  # Scale based on CPU
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  # Scale based on Memory
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

  # Custom metric: Task queue backlog
  - type: Pods
    pods:
      metric:
        name: temporal_task_queue_backlog
      target:
        type: AverageValue
        averageValue: "100"

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50  # Max 50% pods removed at once
        periodSeconds: 60

    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Max 100% pods added at once
        periodSeconds: 15
      - type: Pods
        value: 5  # Max 5 pods added at once
        periodSeconds: 15
      selectPolicy: Max  # Use most aggressive policy

Custom Metrics (Prometheus Adapter):

# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'temporal_task_queue_backlog'
  resources:
    template: <<.Resource>>
  name:
    matches: "^(.*)$"
    as: "temporal_task_queue_backlog"
  metricsQuery: 'avg(temporal_task_queue_backlog{queue="production-queue"})'

10.3.3 Database Sizing

PostgreSQL Sizing Guidelines:

WorkflowsStorageCPURAMIOPS
1M active100 GB4 cores16 GB3,000
10M active500 GB8 cores32 GB10,000
100M active2 TB16 cores64 GB30,000

Storage Growth Estimation:

def estimate_storage_growth(
    workflows_per_day: int,
    avg_events_per_workflow: int,
    avg_event_size_bytes: int = 1024,
    retention_days: int = 90
) -> dict:
    """Estimate PostgreSQL storage requirements"""

    # Total workflows in retention window
    total_workflows = workflows_per_day * retention_days

    # Events
    total_events = total_workflows * avg_events_per_workflow

    # Storage (with overhead)
    raw_storage_gb = (total_events * avg_event_size_bytes) / (1024**3)
    storage_with_overhead_gb = raw_storage_gb * 1.5  # 50% overhead for indexes, etc.

    # Growth per day
    daily_growth_gb = (workflows_per_day * avg_events_per_workflow * avg_event_size_bytes) / (1024**3)

    return {
        "total_workflows": total_workflows,
        "total_events": total_events,
        "storage_required_gb": round(storage_with_overhead_gb, 2),
        "daily_growth_gb": round(daily_growth_gb, 2)
    }

# Example
result = estimate_storage_growth(
    workflows_per_day=100000,
    avg_events_per_workflow=50,
    retention_days=90
)

print(f"Storage required: {result['storage_required_gb']} GB")
print(f"Daily growth: {result['daily_growth_gb']} GB")

10.4 Production Checklist

10.4.1 Pre-Deployment

Code:

  • All tests passing (unit, integration, replay)
  • Workflow versioning implemented (patching or Build IDs)
  • Error handling and retries configured
  • Logging at appropriate levels
  • No secrets in code (use Secret Manager)
  • Code reviewed and approved

Infrastructure:

  • Temporal Server deployed (Cloud or self-hosted)
  • Database configured with replication
  • Backups automated and tested
  • Monitoring and alerting setup
  • Resource limits configured
  • Network policies applied

Security:

  • TLS enabled for all connections
  • mTLS configured (if required)
  • RBAC/authorization configured
  • Secrets encrypted at rest
  • Audit logging enabled
  • Vulnerability scanning completed

Operations:

  • Runbooks created (incident response, DR)
  • On-call rotation scheduled
  • Escalation paths defined
  • SLOs/SLAs documented

10.4.2 Deployment

  • Deploy in off-peak hours
  • Use deployment strategy (Rolling/Blue-Green/Canary)
  • Monitor metrics in real-time
  • Validate with smoke tests
  • Communicate to stakeholders

10.4.3 Post-Deployment

  • Verify all workers healthy
  • Check task queue backlog
  • Monitor error rates
  • Review logs for warnings
  • Confirm workflows completing successfully
  • Update documentation
  • Retrospective (lessons learned)

10.5 Zusammenfassung

Wichtigste Konzepte

  1. Graceful Shutdown

    • Workers müssen laufende Activities abschließen
    • graceful_shutdown_timeout > längste Activity
    • Kubernetes terminationGracePeriodSeconds entsprechend setzen
  2. Deployment Strategies

    • Rolling: Standard, kostengünstig, moderate Risk
    • Blue-Green: Instant Rollback, höhere Kosten
    • Canary: Minimales Risk, schrittweise Rollout
  3. Temporal Server

    • Cloud: Einfach, managed, pay-per-use
    • Self-Hosted: Volle Kontrolle, höherer Aufwand
    • HA Setup: Multi-AZ, Replikation, Load Balancing
  4. Capacity Planning

    • Worker-Sizing basierend auf Workload-Profil
    • Horizontal Pod Autoscaling für elastische Kapazität
    • Database-Sizing für Storage und IOPS
  5. Production Readiness

    • Comprehensive Checklisten
    • Automated Backups & DR
    • Monitoring & Alerting (Kapitel 11)

Best Practices

DO:

  • Implement graceful shutdown
  • Use deployment strategies (not ad-hoc restarts)
  • Automate capacity planning
  • Test DR procedures regularly
  • Monitor all the things (Kapitel 11)

DON’T:

  • Kill workers abruptly (SIGKILL)
  • Deploy without versioning
  • Skip capacity planning
  • Ignore backup testing
  • Deploy without monitoring

Nächste Schritte

  • Kapitel 11: Monitoring und Observability – Wie Sie Production-Workflows überwachen
  • Kapitel 12: Testing Strategies – Comprehensive testing für Temporal
  • Kapitel 13: Best Practices und Anti-Muster – Production-ready Temporal-Anwendungen

Weiterführende Ressourcen


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 11: Monitoring und Observability

Code-Beispiele für dieses Kapitel: examples/part-04/chapter-10/

Praxis-Tipp: Beginnen Sie mit Temporal Cloud für schnellen Start. Wenn Sie spezifische Compliance- oder Kosten-Anforderungen haben, evaluieren Sie Self-Hosted. Unabhängig davon: Implementieren Sie von Anfang an graceful shutdown und Deployment-Strategien!

Kapitel 11: Monitoring und Observability

Einleitung

Sie haben Temporal in Production deployed, Workers laufen, Workflows werden ausgeführt. Alles scheint gut zu funktionieren. Bis plötzlich:

  • Workflows verzögern sich ohne erkennbaren Grund
  • Activities schlagen häufiger fehl als erwartet
  • Task Queues füllen sich auf
  • Die Business-Logik funktioniert nicht mehr wie gewünscht

Ohne Monitoring sind Sie blind. Sie merken Probleme erst, wenn Kunden sich beschweren. Sie haben keine Ahnung, wo das Problem liegt. Debugging wird zum Rätselraten.

Mit richtigem Monitoring und Observability sehen Sie:

  • Wie viele Workflows gerade laufen
  • Wo Bottlenecks sind
  • Welche Activities am längsten dauern
  • Ob Worker überlastet sind
  • Wann Probleme begannen und warum

Temporal bietet umfassende Observability-Features, aber sie müssen richtig konfiguriert und genutzt werden.

Das Grundproblem

Scenario: Sie betreiben einen Order Processing Service mit Temporal:

@workflow.defn
class OrderWorkflow:
    async def run(self, order_id: str) -> str:
        # 10+ Activities: payment, inventory, shipping, notifications, etc.
        payment = await workflow.execute_activity(process_payment, ...)
        inventory = await workflow.execute_activity(check_inventory, ...)
        shipping = await workflow.execute_activity(create_shipment, ...)
        # ... more activities

Plötzlich: Kunden berichten, dass Orders langsamer verarbeitet werden. Von 2 Minuten auf 10+ Minuten.

Ohne Monitoring:

❓ Welche Activity ist langsam?
❓ Ist es ein spezifischer Worker?
❓ Ist die Datenbank überlastet?
❓ Sind externe APIs langsam?
❓ Gibt es ein Deployment-Problem?

→ Stunden mit Guesswork verbringen
→ Logs manuell durchsuchen
→ Code instrumentieren und neu deployen

Mit Monitoring & Observability:

✓ Grafana Dashboard öffnen
✓ "process_payment" Activity latency: 9 Minuten (normal: 30s)
✓ Trace zeigt: Payment API antwortet nicht
✓ Logs zeigen: Connection timeouts zu payment.api.com
✓ Alert wurde bereits ausgelöst

→ Problem in 2 Minuten identifiziert
→ Payment Service Team kontaktieren
→ Fallback-Lösung aktivieren

Die drei Säulen der Observability

1. Metrics (Was passiert?)

  • Workflow execution rate
  • Activity success/failure rates
  • Queue depth
  • Worker utilization
  • Latency percentiles (p50, p95, p99)

2. Logs (Warum passiert es?)

  • Structured logging in Workflows/Activities
  • Correlation mit Workflow/Activity IDs
  • Error messages und stack traces
  • Business-relevante Events

3. Traces (Wie fließen Requests?)

  • End-to-end Workflow execution traces
  • Activity spans
  • Distributed tracing über Service-Grenzen
  • Bottleneck-Identifikation

Lernziele

Nach diesem Kapitel können Sie:

  • SDK Metrics mit Prometheus exportieren und monitoren
  • Temporal Cloud/Server Metrics nutzen
  • Grafana Dashboards für Temporal erstellen und nutzen
  • OpenTelemetry für Distributed Tracing integrieren
  • Strukturierte Logs mit Correlation implementieren
  • SLO-basiertes Alerting für kritische Workflows aufsetzen
  • Debugging mit Observability-Tools durchführen

11.1 SDK Metrics mit Prometheus

11.1.1 Warum SDK Metrics?

Temporal bietet zwei Arten von Metrics:

Metric SourcePerspektiveWas wird gemessen?
SDK MetricsClient/WorkerIhre Application-Performance
Server MetricsTemporal ServiceTemporal Infrastructure Health

Für Application Monitoring → SDK Metrics sind die Source of Truth!

SDK Metrics zeigen:

  • Activity execution time aus Sicht Ihrer Worker
  • Workflow execution success rate Ihrer Workflows
  • Task Queue lag Ihrer Queues
  • Worker resource usage Ihrer Deployments

11.1.2 Prometheus Setup für Python SDK

Schritt 1: Dependencies

# requirements.txt
temporalio>=1.5.0
prometheus-client>=0.19.0

Schritt 2: Prometheus Exporter in Worker

"""
Worker mit Prometheus Metrics Export

Chapter: 11 - Monitoring und Observability
"""

import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio.contrib.prometheus import PrometheusMetricsExporter
from prometheus_client import start_http_server, CollectorRegistry
import logging

logger = logging.getLogger(__name__)


class MonitoredWorker:
    """Worker mit Prometheus Metrics"""

    def __init__(
        self,
        temporal_host: str,
        task_queue: str,
        workflows: list,
        activities: list,
        metrics_port: int = 9090
    ):
        self.temporal_host = temporal_host
        self.task_queue = task_queue
        self.workflows = workflows
        self.activities = activities
        self.metrics_port = metrics_port

    async def run(self):
        """Start worker mit Prometheus metrics export"""

        # 1. Prometheus Registry erstellen
        registry = CollectorRegistry()

        # 2. Temporal Client mit Metrics Exporter
        client = await Client.connect(
            self.temporal_host,
            # Metrics aktivieren
            runtime=self._create_runtime_with_metrics(registry)
        )

        # 3. Prometheus HTTP Server starten
        start_http_server(self.metrics_port, registry=registry)
        logger.info(f"✓ Prometheus metrics exposed on :{self.metrics_port}/metrics")

        # 4. Worker mit Metrics starten
        async with Worker(
            client,
            task_queue=self.task_queue,
            workflows=self.workflows,
            activities=self.activities
        ):
            logger.info(f"✓ Worker started on queue: {self.task_queue}")
            await asyncio.Event().wait()  # Run forever

    def _create_runtime_with_metrics(self, registry: CollectorRegistry):
        """Runtime mit Prometheus Metrics konfigurieren"""
        from temporalio.runtime import (
            Runtime,
            TelemetryConfig,
            PrometheusConfig
        )

        return Runtime(telemetry=TelemetryConfig(
            metrics=PrometheusConfig(
                # Bind an localhost:0 - wird von start_http_server übernommen
                bind_address="0.0.0.0:0",
                # Custom registry
                registry=registry
            )
        ))


# Verwendung
if __name__ == "__main__":
    from my_workflows import OrderWorkflow
    from my_activities import process_payment, check_inventory

    worker = MonitoredWorker(
        temporal_host="localhost:7233",
        task_queue="order-processing",
        workflows=[OrderWorkflow],
        activities=[process_payment, check_inventory],
        metrics_port=9090
    )

    asyncio.run(worker.run())

Schritt 3: Metrics abrufen

# Metrics endpoint testen
curl http://localhost:9090/metrics

# Ausgabe (Beispiel):
# temporal_workflow_task_execution_count{namespace="default",task_queue="order-processing"} 142
# temporal_activity_execution_count{activity_type="process_payment"} 89
# temporal_activity_execution_latency_seconds_sum{activity_type="process_payment"} 45.2
# temporal_worker_task_slots_available{task_queue="order-processing"} 98
# ...

11.1.3 Prometheus Scrape Configuration

prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Temporal Workers
  - job_name: 'temporal-workers'
    static_configs:
      - targets:
          - 'worker-1:9090'
          - 'worker-2:9090'
          - 'worker-3:9090'
    # Labels für besseres Filtering
    relabel_configs:
      - source_labels: [__address__]
        regex: 'worker-(\d+):.*'
        target_label: 'worker_id'
        replacement: '$1'

  # Temporal Server (self-hosted)
  - job_name: 'temporal-server'
    static_configs:
      - targets:
          - 'temporal-frontend:9090'
          - 'temporal-history:9090'
          - 'temporal-matching:9090'
          - 'temporal-worker:9090'

  # Temporal Cloud (via Prometheus API)
  - job_name: 'temporal-cloud'
    scheme: https
    static_configs:
      - targets:
          - 'cloud-metrics.temporal.io'
    authorization:
      credentials: '<YOUR_TEMPORAL_CLOUD_API_KEY>'
    params:
      namespace: ['your-namespace.account']

Kubernetes Service Discovery (fortgeschritten):

scrape_configs:
  - job_name: 'temporal-workers-k8s'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Nur Pods mit Label app=temporal-worker
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: temporal-worker
      # Port 9090 targeten
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: '$1:9090'
      # Labels hinzufügen
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

11.1.4 Wichtige SDK Metrics

Workflow Metrics:

# Workflow Execution Rate
rate(temporal_workflow_task_execution_count[5m])

# Workflow Success Rate
rate(temporal_workflow_completed_count{status="completed"}[5m])
  /
rate(temporal_workflow_completed_count[5m])

# Workflow Latency (p95)
histogram_quantile(0.95,
  rate(temporal_workflow_execution_latency_seconds_bucket[5m])
)

Activity Metrics:

# Activity Execution Rate by Type
rate(temporal_activity_execution_count[5m]) by (activity_type)

# Activity Failure Rate
rate(temporal_activity_execution_failed_count[5m]) by (activity_type)

# Activity Latency by Type
histogram_quantile(0.95,
  rate(temporal_activity_execution_latency_seconds_bucket[5m])
) by (activity_type)

# Slowest Activities (Top 5)
topk(5,
  avg(rate(temporal_activity_execution_latency_seconds_sum[5m]))
  by (activity_type)
)

Worker Metrics:

# Task Slots Used vs Available
temporal_worker_task_slots_used / temporal_worker_task_slots_available

# Task Queue Lag (Backlog)
temporal_task_queue_lag_seconds

# Worker Poll Success Rate
rate(temporal_worker_poll_success_count[5m])
  /
rate(temporal_worker_poll_count[5m])

11.1.5 Custom Business Metrics

Problem: SDK Metrics zeigen technische Metriken, aber nicht Ihre Business KPIs.

Lösung: Custom Metrics in Activities exportieren.

"""
Custom Business Metrics in Activities
"""

from temporalio import activity
from prometheus_client import Counter, Histogram, Gauge

# Custom Metrics
orders_processed = Counter(
    'orders_processed_total',
    'Total orders processed',
    ['status', 'payment_method']
)

order_value = Histogram(
    'order_value_usd',
    'Order value in USD',
    buckets=[10, 50, 100, 500, 1000, 5000]
)

payment_latency = Histogram(
    'payment_processing_seconds',
    'Payment processing time',
    ['payment_provider']
)

active_orders = Gauge(
    'active_orders',
    'Currently processing orders'
)


@activity.defn
async def process_order(order_id: str, amount: float, payment_method: str) -> str:
    """Process order mit custom metrics"""

    # Gauge: Active orders erhöhen
    active_orders.inc()

    try:
        # Business-Logic
        start = time.time()
        payment_result = await process_payment(amount, payment_method)
        latency = time.time() - start

        # Metrics erfassen
        payment_latency.labels(payment_provider=payment_method).observe(latency)
        order_value.observe(amount)
        orders_processed.labels(
            status='completed',
            payment_method=payment_method
        ).inc()

        return f"Order {order_id} completed"

    except Exception as e:
        orders_processed.labels(
            status='failed',
            payment_method=payment_method
        ).inc()
        raise

    finally:
        # Gauge: Active orders reduzieren
        active_orders.dec()

PromQL Queries für Business Metrics:

# Revenue per Hour
sum(rate(order_value_usd_sum[1h]))

# Orders per Minute by Payment Method
sum(rate(orders_processed_total[1m])) by (payment_method)

# Payment Provider Performance
histogram_quantile(0.95,
  rate(payment_processing_seconds_bucket[5m])
) by (payment_provider)

# Success Rate by Payment Method
sum(rate(orders_processed_total{status="completed"}[5m])) by (payment_method)
  /
sum(rate(orders_processed_total[5m])) by (payment_method)

11.2 Grafana Dashboards

11.2.1 Grafana Setup

Docker Compose Setup (Development):

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:

Grafana Datasource Provisioning:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

11.2.2 Community Dashboards

Temporal stellt Community Grafana Dashboards bereit:

Installation:

# Dashboard JSON herunterladen
curl -O https://raw.githubusercontent.com/temporalio/dashboards/main/grafana/temporal-sdk.json

# In Grafana importieren:
# Dashboards > Import > Upload JSON file

Verfügbare Dashboards:

  1. Temporal SDK Overview

    • Workflow execution rates
    • Activity success/failure rates
    • Worker health
    • Task queue metrics
  2. Temporal Server

    • Service health (Frontend, History, Matching, Worker)
    • Request rates und latency
    • Database performance
    • Resource usage
  3. Temporal Cloud

    • Namespace metrics
    • API request rates
    • Workflow execution trends
    • Billing metrics

11.2.3 Custom Dashboard erstellen

Panel 1: Workflow Execution Rate

{
  "title": "Workflow Execution Rate",
  "targets": [{
    "expr": "rate(temporal_workflow_task_execution_count{namespace=\"$namespace\"}[5m])",
    "legendFormat": "{{task_queue}}"
  }],
  "type": "graph"
}

Panel 2: Activity Latency Heatmap

{
  "title": "Activity Latency Distribution",
  "targets": [{
    "expr": "rate(temporal_activity_execution_latency_seconds_bucket{activity_type=\"$activity\"}[5m])",
    "format": "heatmap"
  }],
  "type": "heatmap",
  "yAxis": { "format": "s" }
}

Panel 3: Worker Task Slots

{
  "title": "Worker Task Slots",
  "targets": [
    {
      "expr": "temporal_worker_task_slots_available",
      "legendFormat": "Available - {{worker_id}}"
    },
    {
      "expr": "temporal_worker_task_slots_used",
      "legendFormat": "Used - {{worker_id}}"
    }
  ],
  "type": "graph",
  "stack": true
}

Panel 4: Top Slowest Activities

{
  "title": "Top 10 Slowest Activities",
  "targets": [{
    "expr": "topk(10, avg(rate(temporal_activity_execution_latency_seconds_sum[5m])) by (activity_type))",
    "legendFormat": "{{activity_type}}",
    "instant": true
  }],
  "type": "table"
}

Complete Dashboard Example (kompakt):

{
  "dashboard": {
    "title": "Temporal - Order Processing",
    "timezone": "browser",
    "templating": {
      "list": [
        {
          "name": "namespace",
          "type": "query",
          "query": "label_values(temporal_workflow_task_execution_count, namespace)"
        },
        {
          "name": "task_queue",
          "type": "query",
          "query": "label_values(temporal_workflow_task_execution_count{namespace=\"$namespace\"}, task_queue)"
        }
      ]
    },
    "panels": [
      {
        "title": "Workflow Execution Rate",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [{
          "expr": "rate(temporal_workflow_task_execution_count{namespace=\"$namespace\",task_queue=\"$task_queue\"}[5m])"
        }]
      },
      {
        "title": "Activity Success Rate",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [{
          "expr": "rate(temporal_activity_execution_count{status=\"completed\"}[5m]) / rate(temporal_activity_execution_count[5m])"
        }]
      },
      {
        "title": "Task Queue Lag",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [{
          "expr": "temporal_task_queue_lag_seconds{task_queue=\"$task_queue\"}"
        }]
      }
    ]
  }
}

11.2.4 Alerting in Grafana

Alert 1: High Workflow Failure Rate

# Alert Definition
- alert: HighWorkflowFailureRate
  expr: |
    (
      rate(temporal_workflow_completed_count{status="failed"}[5m])
      /
      rate(temporal_workflow_completed_count[5m])
    ) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High workflow failure rate"
    description: "{{ $value | humanizePercentage }} of workflows are failing"

Alert 2: Task Queue Backlog

- alert: TaskQueueBacklog
  expr: temporal_task_queue_lag_seconds > 300
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Task queue has significant backlog"
    description: "Task queue {{ $labels.task_queue }} has {{ $value }}s lag"

Alert 3: Worker Unavailable

- alert: WorkerUnavailable
  expr: up{job="temporal-workers"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Worker is down"
    description: "Worker {{ $labels.instance }} is not responding"

Alert 4: Activity Latency Spike

- alert: ActivityLatencySpike
  expr: |
    histogram_quantile(0.95,
      rate(temporal_activity_execution_latency_seconds_bucket[5m])
    ) > 60
  for: 5m
  labels:
    severity: warning
    activity_type: "{{ $labels.activity_type }}"
  annotations:
    summary: "Activity latency is high"
    description: "p95 latency for {{ $labels.activity_type }}: {{ $value }}s"

11.3 OpenTelemetry Integration

11.3.1 Warum OpenTelemetry?

Prometheus + Grafana geben Ihnen Metrics. Aber für Distributed Tracing brauchen Sie mehr:

  • End-to-End Traces: Verfolgen Sie einen Request durch Ihr gesamtes System
  • Spans: Sehen Sie, wie lange jede Activity dauert
  • Context Propagation: Korrelieren Sie Logs, Metrics und Traces
  • Service Dependencies: Visualisieren Sie, wie Services miteinander kommunizieren

Use Case: Ein Workflow ruft 5 verschiedene Microservices auf. Welcher Service verursacht die Latenz?

HTTP Request → API Gateway → Order Workflow
                                  ├─> Payment Service (500ms)
                                  ├─> Inventory Service (200ms)
                                  ├─> Shipping Service (3000ms) ← BOTTLENECK!
                                  ├─> Email Service (100ms)
                                  └─> Analytics Service (50ms)

Mit OpenTelemetry sehen Sie diese gesamte Kette als einen zusammenhängenden Trace.

11.3.2 OpenTelemetry Setup

Dependencies:

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-temporal \
    opentelemetry-exporter-otlp

Tracer Setup:

"""
OpenTelemetry Integration für Temporal

Chapter: 11 - Monitoring und Observability
"""

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio import workflow, activity
import asyncio


def setup_telemetry(service_name: str):
    """Setup OpenTelemetry Tracing"""

    # Resource: Identifiziert diesen Service
    resource = Resource.create({
        "service.name": service_name,
        "service.version": "1.0.0",
        "deployment.environment": "production"
    })

    # Tracer Provider
    provider = TracerProvider(resource=resource)

    # OTLP Exporter (zu Tempo, Jaeger, etc.)
    otlp_exporter = OTLPSpanExporter(
        endpoint="http://tempo:4317",
        insecure=True
    )

    # Batch Processor (für Performance)
    span_processor = BatchSpanProcessor(otlp_exporter)
    provider.add_span_processor(span_processor)

    # Global Tracer setzen
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name)


# Tracer erstellen
tracer = setup_telemetry("order-service")


@activity.defn
async def process_payment(order_id: str, amount: float) -> dict:
    """Activity mit manual tracing"""

    # Span für diese Activity
    with tracer.start_as_current_span("process_payment") as span:

        # Span Attributes (Metadata)
        span.set_attribute("order_id", order_id)
        span.set_attribute("amount", amount)
        span.set_attribute("activity.type", "process_payment")

        # External Service Call tracen
        with tracer.start_as_current_span("call_payment_api") as api_span:
            api_span.set_attribute("http.method", "POST")
            api_span.set_attribute("http.url", "https://payment.api/charge")

            # Simulierter API Call
            await asyncio.sleep(0.5)

            api_span.set_attribute("http.status_code", 200)

        # Span Status
        span.set_status(trace.StatusCode.OK)

        return {
            "success": True,
            "transaction_id": f"txn_{order_id}"
        }


@workflow.defn
class OrderWorkflow:
    """Workflow mit Tracing"""

    @workflow.run
    async def run(self, order_id: str) -> dict:

        # Workflow-Context als Span
        # (automatisch durch Temporal SDK + OpenTelemetry Instrumentation)

        workflow.logger.info(f"Processing order {order_id}")

        # Activities werden automatisch als Child Spans getrackt
        payment = await workflow.execute_activity(
            process_payment,
            args=[order_id, 99.99],
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Weitere Activities...

        return {"status": "completed", "payment": payment}

11.3.3 Automatic Instrumentation

Einfachere Alternative: Temporal SDK Instrumentation (experimentell):

from opentelemetry.instrumentation.temporal import TemporalInstrumentor

# Automatische Instrumentation
TemporalInstrumentor().instrument()

# Jetzt werden Workflows und Activities automatisch getrackt
client = await Client.connect("localhost:7233")

Was wird automatisch getrackt:

  • Workflow Start/Complete
  • Activity Execution
  • Task Queue Operations
  • Signals/Queries
  • Child Workflows

11.3.4 Tempo + Grafana Setup

Docker Compose:

version: '3.8'

services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
      - "3200:3200"  # Tempo Query Frontend
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml

volumes:
  tempo-data:

tempo.yaml:

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces

query_frontend:
  search:
    enabled: true

grafana-datasources.yaml:

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    isDefault: false

11.3.5 Trace Visualisierung

In Grafana Explore:

1. Data Source: Tempo
2. Query: trace_id = "abc123..."
3. Visualisierung:

   OrderWorkflow                     [========== 5.2s ==========]
   ├─ process_payment               [=== 0.5s ===]
   │  └─ call_payment_api          [== 0.48s ==]
   ├─ check_inventory               [= 0.2s =]
   ├─ create_shipment              [======== 3.0s ========] ← SLOW!
   ├─ send_confirmation_email      [= 0.1s =]
   └─ update_analytics             [= 0.05s =]

Trace Search Queries:

# Alle Traces für einen Workflow
service.name="order-service" && workflow.type="OrderWorkflow"

# Langsame Traces (> 5s)
service.name="order-service" && duration > 5s

# Fehlerhafte Traces
status=error

# Traces für bestimmte Order
order_id="order-12345"

11.3.6 Correlation: Metrics + Logs + Traces

Das Problem: Metrics zeigen ein Problem, aber Sie brauchen Details.

Lösung: Exemplars + Trace IDs in Logs

Prometheus Exemplars:

from prometheus_client import Histogram
from opentelemetry import trace

# Histogram mit Exemplar Support
activity_latency = Histogram(
    'activity_execution_seconds',
    'Activity execution time'
)

@activity.defn
async def my_activity():
    start = time.time()

    # ... Activity Logic ...

    latency = time.time() - start

    # Metric + Trace ID als Exemplar
    current_span = trace.get_current_span()
    trace_id = current_span.get_span_context().trace_id

    activity_latency.observe(
        latency,
        exemplar={'trace_id': format(trace_id, '032x')}
    )

In Grafana: Click auf Metric Point → Jump zu Trace!

Structured Logging mit Trace Context:

import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

@activity.defn
async def my_activity(order_id: str):

    # Trace Context extrahieren
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, '032x')
    span_id = format(span.get_span_context().span_id, '016x')

    # Structured Log mit Trace IDs
    logger.info(
        "Processing order",
        extra={
            "order_id": order_id,
            "trace_id": trace_id,
            "span_id": span_id,
            "workflow_id": activity.info().workflow_id,
            "activity_type": activity.info().activity_type
        }
    )

Log Output (JSON):

{
  "timestamp": "2025-01-19T10:30:45Z",
  "level": "INFO",
  "message": "Processing order",
  "order_id": "order-12345",
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "789abc...",
  "workflow_id": "order-workflow-12345",
  "activity_type": "process_payment"
}

In Grafana Loki: Search for trace_id="a1b2c3d4e5f6..." → Alle Logs für diesen Trace!


11.4 Logging Best Practices

11.4.1 Structured Logging Setup

Warum Structured Logging?

Unstructured (schlecht):

logger.info(f"Order {order_id} completed in {duration}s")

Structured (gut):

logger.info("Order completed", extra={
    "order_id": order_id,
    "duration_seconds": duration,
    "status": "success"
})

Vorteile:

  • Suchbar nach Feldern
  • Aggregierbar
  • Maschinenlesbar
  • Integriert mit Observability Tools

Python Setup mit structlog:

import structlog
from temporalio import activity, workflow

# Structlog konfigurieren
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ],
    logger_factory=structlog.PrintLoggerFactory(),
)

logger = structlog.get_logger()


@activity.defn
async def process_order(order_id: str):
    """Activity mit strukturiertem Logging"""

    # Workflow Context hinzufügen
    log = logger.bind(
        workflow_id=activity.info().workflow_id,
        activity_id=activity.info().activity_id,
        activity_type="process_order",
        order_id=order_id
    )

    log.info("activity_started")

    try:
        # Business Logic
        result = await do_something(order_id)

        log.info(
            "activity_completed",
            result=result,
            duration_ms=123
        )

        return result

    except Exception as e:
        log.error(
            "activity_failed",
            error=str(e),
            error_type=type(e).__name__
        )
        raise

Log Output:

{
  "timestamp": "2025-01-19T10:30:45.123456Z",
  "level": "info",
  "event": "activity_started",
  "workflow_id": "order-workflow-abc",
  "activity_id": "activity-xyz",
  "activity_type": "process_order",
  "order_id": "order-12345"
}

{
  "timestamp": "2025-01-19T10:30:45.345678Z",
  "level": "info",
  "event": "activity_completed",
  "workflow_id": "order-workflow-abc",
  "activity_id": "activity-xyz",
  "result": "success",
  "duration_ms": 123,
  "order_id": "order-12345"
}

11.4.2 Temporal Logger Integration

Temporal SDK Logger nutzen:

from temporalio import workflow, activity


@workflow.defn
class MyWorkflow:

    @workflow.run
    async def run(self):
        # Temporal Workflow Logger (automatisch mit Context)
        workflow.logger.info(
            "Workflow started",
            extra={"custom_field": "value"}
        )

        # Logging ist replay-safe!
        # Logs werden nur bei echter Execution ausgegeben


@activity.defn
async def my_activity():
    # Temporal Activity Logger (automatisch mit Context)
    activity.logger.info(
        "Activity started",
        extra={"custom_field": "value"}
    )

Automatischer Context:

Temporal Logger fügen automatisch hinzu:

  • workflow_id
  • workflow_type
  • run_id
  • activity_id
  • activity_type
  • namespace
  • task_queue

11.4.3 Log Aggregation mit Loki

Loki Setup:

# docker-compose.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

loki-config.yaml:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

promtail-config.yaml:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: temporal-workers
    static_configs:
      - targets:
          - localhost
        labels:
          job: temporal-workers
          __path__: /var/log/temporal-worker/*.log

    # JSON Log Parsing
    pipeline_stages:
      - json:
          expressions:
            timestamp: timestamp
            level: level
            message: event
            workflow_id: workflow_id
            activity_type: activity_type
      - labels:
          level:
          workflow_id:
          activity_type:
      - timestamp:
          source: timestamp
          format: RFC3339

LogQL Queries in Grafana:

# Alle Logs für einen Workflow
{job="temporal-workers"} | json | workflow_id="order-workflow-abc"

# Fehler-Logs
{job="temporal-workers"} | json | level="error"

# Langsame Activities (> 5s)
{job="temporal-workers"}
  | json
  | duration_ms > 5000
  | line_format "{{.activity_type}}: {{.duration_ms}}ms"

# Rate von Errors
rate({job="temporal-workers"} | json | level="error" [5m])

# Top Activities nach Count
topk(10,
  sum by (activity_type) (
    count_over_time({job="temporal-workers"} | json [1h])
  )
)

11.4.4 Best Practices

DO:

  • ✅ Strukturierte Logs (JSON)
  • ✅ Correlation IDs (workflow_id, trace_id)
  • ✅ Log Level appropriate nutzen (DEBUG, INFO, WARN, ERROR)
  • ✅ Performance-relevante Metrics loggen
  • ✅ Business Events loggen
  • ✅ Fehler mit Context loggen

DON’T:

  • ❌ Sensitive Daten loggen (Passwords, PII, Credit Cards)
  • ❌ Zu viel loggen (Performance-Impact)
  • ❌ Unstrukturierte Logs
  • ❌ Logging in Workflows ohne Replay-Safety

Replay-Safe Logging:

@workflow.defn
class MyWorkflow:

    @workflow.run
    async def run(self):
        # FALSCH: Logging ohne Replay-Check
        print(f"Workflow started at {datetime.now()}")  # ❌ Non-deterministic!

        # RICHTIG: Temporal Logger (replay-safe)
        workflow.logger.info("Workflow started")  # ✅ Nur bei echter Execution

Sensitive Data redaktieren:

import re

def redact_sensitive(data: dict) -> dict:
    """Redact sensitive fields"""
    sensitive_fields = ['password', 'credit_card', 'ssn', 'api_key']

    redacted = data.copy()
    for key in redacted:
        if any(field in key.lower() for field in sensitive_fields):
            redacted[key] = "***REDACTED***"

    return redacted


@activity.defn
async def process_payment(payment_data: dict):
    # Log mit redaktierten Daten
    activity.logger.info(
        "Processing payment",
        extra=redact_sensitive(payment_data)
    )

11.5 SLO-basiertes Alerting

11.5.1 Was sind SLIs, SLOs, SLAs?

SLI (Service Level Indicator): Messgröße für Service-Qualität

  • Beispiel: “99.5% der Workflows werden erfolgreich abgeschlossen”

SLO (Service Level Objective): Ziel für SLI

  • Beispiel: “SLO: 99.9% Workflow Success Rate”

SLA (Service Level Agreement): Vertragliche Vereinbarung

  • Beispiel: “SLA: 99.5% Uptime mit finanziellen Konsequenzen”

Verhältnis: SLI ≤ SLO ≤ SLA

11.5.2 SLIs für Temporal Workflows

Request Success Rate (wichtigster SLI):

# Workflow Success Rate
sum(rate(temporal_workflow_completed_count{status="completed"}[5m]))
  /
sum(rate(temporal_workflow_completed_count[5m]))

Latency (p50, p95, p99):

# Workflow p95 Latency
histogram_quantile(0.95,
  rate(temporal_workflow_execution_latency_seconds_bucket[5m])
)

Availability:

# Worker Availability
avg(up{job="temporal-workers"})

Beispiel SLOs:

SLISLOMessung
Workflow Success Rate≥ 99.9%Last 30d
Order Workflow p95 Latency≤ 5sLast 1h
Worker Availability≥ 99.5%Last 30d
Task Queue Lag≤ 30sLast 5m

11.5.3 Error Budget

Konzept: Wie viel “Failure” ist erlaubt?

Berechnung:

Error Budget = 100% - SLO

Beispiel:

SLO: 99.9% Success Rate
Error Budget: 0.1% = 1 von 1000 Requests darf fehlschlagen

Bei 1M Workflows/Monat:
Error Budget = 1M * 0.001 = 1,000 erlaubte Failures

Error Budget Tracking:

# Verbleibender Error Budget (30d window)
(
  1 - (
    sum(increase(temporal_workflow_completed_count{status="completed"}[30d]))
    /
    sum(increase(temporal_workflow_completed_count[30d]))
  )
) / 0.001  # 0.001 = Error Budget für 99.9% SLO

Interpretation:

Result = 0.5  → 50% Error Budget verbraucht ✅
Result = 0.9  → 90% Error Budget verbraucht ⚠️
Result = 1.2  → 120% Error Budget verbraucht ❌ SLO missed!

11.5.4 Multi-Window Multi-Burn-Rate Alerts

Problem mit einfachen Alerts:

# Zu simpel
- alert: HighErrorRate
  expr: error_rate > 0.01
  for: 5m

Probleme:

  • Flapping bei kurzen Spikes
  • Langsame Reaktion bei echten Problemen
  • Keine Unterscheidung: Kurzer Spike vs. anhaltender Ausfall

Lösung: Multi-Window Alerts (aus Google SRE Workbook)

Konzept:

SeverityBurn RateShort WindowLong WindowAlert
Critical14.4x1h5mPage immediately
High6x6h30mPage during business hours
Medium3x1d2hTicket
Low1x3d6hNo alert

Implementation:

groups:
  - name: temporal_slo_alerts
    rules:
      # Critical: 14.4x burn rate (1h budget in 5m)
      - alert: WorkflowSLOCritical
        expr: |
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[1h]))
              /
              sum(rate(temporal_workflow_completed_count[1h]))
            )) > (14.4 * 0.001)
          )
          and
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[5m]))
              /
              sum(rate(temporal_workflow_completed_count[5m]))
            )) > (14.4 * 0.001)
          )
        labels:
          severity: critical
        annotations:
          summary: "Critical: Workflow SLO burn rate too high"
          description: "Error budget will be exhausted in < 2 days at current rate"

      # High: 6x burn rate
      - alert: WorkflowSLOHigh
        expr: |
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[6h]))
              /
              sum(rate(temporal_workflow_completed_count[6h]))
            )) > (6 * 0.001)
          )
          and
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[30m]))
              /
              sum(rate(temporal_workflow_completed_count[30m]))
            )) > (6 * 0.001)
          )
        labels:
          severity: warning
        annotations:
          summary: "High: Workflow SLO burn rate elevated"
          description: "Error budget will be exhausted in < 5 days at current rate"

      # Medium: 3x burn rate
      - alert: WorkflowSLOMedium
        expr: |
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[1d]))
              /
              sum(rate(temporal_workflow_completed_count[1d]))
            )) > (3 * 0.001)
          )
          and
          (
            (1 - (
              sum(rate(temporal_workflow_completed_count{status="completed"}[2h]))
              /
              sum(rate(temporal_workflow_completed_count[2h]))
            )) > (3 * 0.001)
          )
        labels:
          severity: info
        annotations:
          summary: "Medium: Workflow SLO burn rate concerning"
          description: "Error budget will be exhausted in < 10 days at current rate"

11.5.5 Activity-Specific SLOs

Nicht alle Activities sind gleich wichtig!

Beispiel:

# Critical Activity: Payment Processing
- alert: PaymentActivitySLOBreach
  expr: |
    (
      sum(rate(temporal_activity_execution_count{
        activity_type="process_payment",
        status="completed"
      }[5m]))
      /
      sum(rate(temporal_activity_execution_count{
        activity_type="process_payment"
      }[5m]))
    ) < 0.999  # 99.9% SLO
  for: 5m
  labels:
    severity: critical
    activity: process_payment
  annotations:
    summary: "Payment activity SLO breach"
    description: "Success rate: {{ $value | humanizePercentage }}"

# Low-Priority Activity: Analytics Update
- alert: AnalyticsActivitySLOBreach
  expr: |
    (
      sum(rate(temporal_activity_execution_count{
        activity_type="update_analytics",
        status="completed"
      }[30m]))
      /
      sum(rate(temporal_activity_execution_count{
        activity_type="update_analytics"
      }[30m]))
    ) < 0.95  # 95% SLO (relaxed)
  for: 30m
  labels:
    severity: warning
    activity: update_analytics
  annotations:
    summary: "Analytics activity degraded"

11.5.6 Alertmanager Configuration

alertmanager.yml:

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts → PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true

    # Critical alerts → Slack #alerts
    - match:
        severity: critical
      receiver: slack-critical

    # Warnings → Slack #monitoring
    - match:
        severity: warning
      receiver: slack-monitoring

    # Info → Slack #monitoring (low priority)
    - match:
        severity: info
      receiver: slack-monitoring
      group_wait: 5m
      repeat_interval: 12h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#monitoring'
        title: 'Temporal Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'danger'

  - name: 'slack-monitoring'
    slack_configs:
      - channel: '#monitoring'
        title: '⚠️  {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'warning'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'

11.6 Temporal Cloud Observability

11.6.1 Cloud Metrics Zugriff

Temporal Cloud bietet zwei Metrics Endpoints:

  1. Prometheus Endpoint (Scraping):
https://cloud-metrics.temporal.io/prometheus/<account-id>/<namespace>
  1. PromQL Endpoint (Querying):
https://cloud-metrics.temporal.io/api/v1/query

Authentication:

# API Key generieren (Temporal Cloud UI)
# Settings > Integrations > Prometheus

# Metrics abrufen
curl -H "Authorization: Bearer <API_KEY>" \
  https://cloud-metrics.temporal.io/prometheus/<account-id>/<namespace>/metrics

11.6.2 Prometheus Scrape Config

scrape_configs:
  - job_name: 'temporal-cloud'
    scheme: https
    static_configs:
      - targets:
          - 'cloud-metrics.temporal.io'
    authorization:
      credentials: '<YOUR_API_KEY>'
    params:
      account: ['<account-id>']
      namespace: ['<namespace>']
    scrape_interval: 60s  # Cloud Metrics: Max 1/minute

11.6.3 Verfügbare Cloud Metrics

Namespace Metrics:

# Workflow Execution Rate
temporal_cloud_v0_workflow_started

# Workflow Success/Failure
temporal_cloud_v0_workflow_success
temporal_cloud_v0_workflow_failed

# Active Workflows
temporal_cloud_v0_workflow_running

# Task Queue Depth
temporal_cloud_v0_task_queue_depth{task_queue="order-processing"}

Resource Metrics:

# Actions per Second (Billing)
temporal_cloud_v0_resource_actions_count

# Storage Usage
temporal_cloud_v0_resource_storage_bytes

11.6.4 Grafana Dashboard für Cloud

Cloud-specific Dashboard:

{
  "title": "Temporal Cloud Overview",
  "panels": [
    {
      "title": "Workflow Start Rate",
      "targets": [{
        "expr": "rate(temporal_cloud_v0_workflow_started[5m])",
        "legendFormat": "{{namespace}}"
      }]
    },
    {
      "title": "Workflow Success Rate",
      "targets": [{
        "expr": "rate(temporal_cloud_v0_workflow_success[5m]) / rate(temporal_cloud_v0_workflow_started[5m])",
        "legendFormat": "Success Rate"
      }]
    },
    {
      "title": "Active Workflows",
      "targets": [{
        "expr": "temporal_cloud_v0_workflow_running",
        "legendFormat": "{{workflow_type}}"
      }]
    },
    {
      "title": "Actions per Second (Billing)",
      "targets": [{
        "expr": "rate(temporal_cloud_v0_resource_actions_count[5m])",
        "legendFormat": "Actions/s"
      }]
    }
  ]
}

11.6.5 SDK Metrics vs. Cloud Metrics

Wichtig: Verwenden Sie die richtige Metrik-Quelle!

Use CaseSourceWarum
“Wie lange dauert meine Activity?”SDK MetricsMisst aus Worker-Sicht
“Wie viele Workflows sind aktiv?”Cloud MetricsServer-seitige Sicht
“Ist mein Worker überlastet?”SDK MetricsWorker-spezifisch
“Task Queue Backlog?”Cloud MetricsServer-seitiger Zustand
“Billing/Cost?”Cloud MetricsNur Cloud kennt Actions

Best Practice: Beide kombinieren!

# Workflow End-to-End Latency (Cloud)
temporal_cloud_v0_workflow_execution_time

# Activity Latency within Workflow (SDK)
temporal_activity_execution_latency_seconds{activity_type="process_payment"}

11.7 Debugging mit Observability

11.7.1 Problem → Metrics → Traces → Logs

Workflow: Von groß zu klein

1. Metrics: "Payment workflows sind langsam (p95: 30s statt 5s)"
   ↓
2. Traces: "process_payment Activity dauert 25s"
   ↓
3. Logs: "Connection timeout zu payment.api.com"
   ↓
4. Root Cause: Payment API ist down

Grafana Workflow:

1. Öffne Dashboard "Temporal - Orders"
2. Panel "Activity Latency" zeigt Spike
3. Click auf Spike → "View Traces"
4. Trace zeigt: "process_payment span: 25s"
5. Click auf Span → "View Logs"
6. Log: "ERROR: connection timeout after 20s"

11.7.2 Temporal Web UI Integration

Web UI: https://cloud.temporal.io oder http://localhost:8080

Features:

  • Workflow Execution History
  • Event Timeline
  • Pending Activities
  • Stack Traces
  • Retry History

Von Grafana zu Web UI:

Grafana Alert: "Workflow order-workflow-abc failed"
  ↓
Annotation Link: https://cloud.temporal.io/namespaces/default/workflows/order-workflow-abc
  ↓
Web UI: Zeigt komplette Workflow History

Grafana Annotation Setup:

import requests

def send_workflow_annotation(workflow_id: str, message: str):
    """Send Grafana annotation for workflow event"""

    requests.post(
        'http://grafana:3000/api/annotations',
        json={
            'text': message,
            'tags': ['temporal', 'workflow', workflow_id],
            'time': int(time.time() * 1000),  # Unix timestamp ms
        },
        headers={
            'Authorization': 'Bearer <GRAFANA_API_KEY>',
            'Content-Type': 'application/json'
        }
    )


@activity.defn
async def critical_activity():
    workflow_id = activity.info().workflow_id

    try:
        result = await do_something()
        send_workflow_annotation(
            workflow_id,
            f"✓ Critical activity completed"
        )
        return result
    except Exception as e:
        send_workflow_annotation(
            workflow_id,
            f"❌ Critical activity failed: {e}"
        )
        raise

11.7.3 Correlation Queries

Problem: Metrics/Traces/Logs sind isoliert.

Lösung: Queries mit Correlation IDs.

Find all data for a workflow:

# 1. Prometheus: Get workflow start time
workflow_start_time=$(
  promtool query instant \
    'temporal_workflow_started_time{workflow_id="order-abc"}'
)

# 2. Tempo: Find traces for workflow
curl -G http://tempo:3200/api/search \
  --data-urlencode 'q={workflow_id="order-abc"}'

# 3. Loki: Find logs for workflow
curl -G http://loki:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="workers"} | json | workflow_id="order-abc"' \
  --data-urlencode "start=$workflow_start_time"

In Grafana Explore (einfacher):

1. Data Source: Prometheus
2. Query: temporal_workflow_started{workflow_id="order-abc"}
3. Click auf Datapoint → "View in Tempo"
4. Trace öffnet sich → Click auf Span → "View in Loki"
5. Logs erscheinen für diesen Span

11.7.4 Common Debugging Scenarios

Scenario 1: “Workflows are slow”

1. Check: Workflow p95 latency metric
   → Which workflow type is slow?

2. Check: Activity latency breakdown
   → Which activity is the bottleneck?

3. Check: Traces for slow workflow instances
   → Is it always slow or intermittent?

4. Check: Logs for slow activity executions
   → What error/timeout is occurring?

5. Check: External service metrics
   → Is downstream service degraded?

Scenario 2: “High failure rate”

1. Check: Workflow failure rate by type
   → Which workflow is failing?

2. Check: Activity failure rate
   → Which activity is failing?

3. Check: Error logs
   → What error messages appear?

4. Check: Temporal Web UI
   → Look at failed workflow history

5. Check: Deployment timeline
   → Did failure start after deployment?

Scenario 3: “Task queue is backing up”

1. Check: Task queue lag metric
   → How large is the backlog?

2. Check: Worker availability
   → Are workers up?

3. Check: Worker task slots
   → Are workers saturated?

4. Check: Activity execution rate
   → Is processing rate dropping?

5. Check: Worker logs
   → Are workers crashing/restarting?

11.8 Zusammenfassung

Was Sie gelernt haben

SDK Metrics:

  • ✅ Prometheus Export aus Python Workers konfigurieren
  • ✅ Wichtige Metrics: Workflow/Activity Rate, Latency, Success Rate
  • ✅ Custom Business Metrics in Activities
  • ✅ Prometheus Scraping für Kubernetes

Grafana:

  • ✅ Community Dashboards installieren
  • ✅ Custom Dashboards erstellen
  • ✅ PromQL Queries für Temporal Metrics
  • ✅ Alerting Rules definieren

OpenTelemetry:

  • ✅ Distributed Tracing Setup
  • ✅ Automatic Instrumentation für Workflows
  • ✅ Manual Spans in Activities
  • ✅ Tempo Integration
  • ✅ Correlation: Metrics + Traces + Logs

Logging:

  • ✅ Structured Logging mit structlog
  • ✅ Temporal Logger mit Auto-Context
  • ✅ Loki für Log Aggregation
  • ✅ LogQL Queries
  • ✅ Replay-Safe Logging

SLO-basiertes Alerting:

  • ✅ SLI/SLO/SLA Konzepte
  • ✅ Error Budget Tracking
  • ✅ Multi-Window Multi-Burn-Rate Alerts
  • ✅ Activity-specific SLOs
  • ✅ Alertmanager Configuration

Temporal Cloud:

  • ✅ Cloud Metrics API
  • ✅ Prometheus Scraping
  • ✅ SDK vs. Cloud Metrics
  • ✅ Billing Metrics

Debugging:

  • ✅ Von Metrics zu Traces zu Logs
  • ✅ Temporal Web UI Integration
  • ✅ Correlation Queries
  • ✅ Common Debugging Scenarios

Production Checklist

Monitoring Setup:

  • SDK Metrics Export konfiguriert
  • Prometheus scraping Workers
  • Grafana Dashboards deployed
  • Alerting Rules definiert
  • Alertmanager konfiguriert (Slack/PagerDuty)
  • On-Call Rotation definiert

Observability:

  • Structured Logging implementiert
  • Log Aggregation (Loki/ELK) läuft
  • OpenTelemetry Tracing aktiviert
  • Trace Backend (Tempo/Jaeger) deployed
  • Correlation IDs in allen Logs

SLOs:

  • SLIs für kritische Workflows definiert
  • SLOs festgelegt (99.9%? 99.5%?)
  • Error Budget Dashboard erstellt
  • Multi-Burn-Rate Alerts konfiguriert
  • Activity-specific SLOs dokumentiert

Dashboards:

  • Workflow Overview Dashboard
  • Worker Health Dashboard
  • Activity Performance Dashboard
  • Business Metrics Dashboard
  • SLO Tracking Dashboard

Alerts:

  • High Workflow Failure Rate
  • Task Queue Backlog
  • Worker Unavailable
  • Activity Latency Spike
  • SLO Burn Rate Critical
  • Error Budget Exhausted

Häufige Fehler

Zu wenig monitoren

Problem: Nur Server-Metrics, keine SDK Metrics
Folge: Keine Sicht auf Ihre Application-Performance

Richtig:

Beide monitoren: Server + SDK Metrics
SDK Metrics = Source of Truth für Application Performance

Nur Metrics, keine Traces

Problem: Wissen, dass es langsam ist, aber nicht wo
Folge: Debugging dauert Stunden

Richtig:

Metrics → Traces → Logs Pipeline
Correlation IDs überall

Alert Fatigue

Problem: 100 Alerts pro Tag
Folge: Wichtige Alerts werden ignoriert

Richtig:

SLO-basiertes Alerting
Multi-Burn-Rate Alerts (weniger False Positives)
Alert nur auf SLO-Verletzungen

Keine Correlation

Problem: Metrics, Logs, Traces sind isoliert
Folge: Müssen manuell korrelieren

Richtig:

Exemplars in Metrics
Trace IDs in Logs
Grafana-Integration

Best Practices

  1. Metriken hierarchisch organisieren

    System Metrics (Server CPU, Memory)
      → Temporal Metrics (Workflows, Activities)
        → Business Metrics (Orders, Revenue)
    
  2. Alerts nach Severity gruppieren

    Critical → Page immediately (SLO breach)
    Warning → Page during business hours
    Info → Ticket for next sprint
    
  3. Dashboards für Rollen

    Executive: Business KPIs (Orders/hour, Revenue)
    Engineering: Technical Metrics (Latency, Error Rate)
    SRE: Operational (Worker Health, Queue Depth)
    On-Call: Incident Response (Recent Alerts, Anomalies)
    
  4. Retention Policies

    Metrics: 30 days high-res, 1 year downsampled
    Logs: 7 days full, 30 days search indices
    Traces: 7 days (sampling: 10% background, 100% errors)
    
  5. Cost Optimization

    - Use sampling for traces (not every request)
    - Downsample old metrics
    - Compress logs
    - Use Cloud Metrics API efficiently (max 1 req/min)
    

Weiterführende Ressourcen

Temporal Docs:

Grafana:

OpenTelemetry:

SRE:

Nächste Schritte

Sie können jetzt Production-ready Monitoring aufsetzen! Aber Observability ist nur ein Teil des Betriebsalltags.

Weiter geht’s mit:

  • Kapitel 12: Testing Strategies – Wie Sie Workflows umfassend testen
  • Kapitel 13: Best Practices und Anti-Muster – Production-ready Temporal-Anwendungen
  • Kapitel 14-15: Kochbuch – Konkrete Patterns und Rezepte für häufige Use Cases

⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 12: Testing Strategies

Code-Beispiele für dieses Kapitel: examples/part-04/chapter-11/

💡 Tipp: Monitoring ist nicht “set and forget”. Review your dashboards and alerts regularly:

  • Monatlich: SLO Review (wurden sie eingehalten?)
  • Quarterly: Alert Review (zu viele False Positives?)
  • Nach Incidents: Post-Mortem → Update Alerts/Dashboards

Kapitel 12: Testing Strategies

Einleitung

Sie haben einen komplexen Workflow implementiert, der mehrere External Services orchestriert, komplizierte Retry-Logik hat und über Tage hinweg laufen kann. Alles funktioniert lokal. Sie deployen in Production – und plötzlich:

  • Ein Edge Case bricht den Workflow
  • Eine kürzlich geänderte Activity verhält sich anders als erwartet
  • Ein Refactoring führt zu Non-Determinismus-Fehlern
  • Ein Workflow, der Tage dauert, kann nicht schnell getestet werden

Ohne Testing-Strategie sind Sie:

  • Unsicher bei jedem Deployment
  • Abhängig von manuellen Tests in Production
  • Blind gegenüber Breaking Changes
  • Langsam beim Debugging

Mit einer robusten Testing-Strategie haben Sie:

  • Vertrauen in Ihre Changes
  • Schnelles Feedback (Sekunden statt Tage)
  • Automatische Regression-Detection
  • Sichere Workflow-Evolution

Temporal bietet leistungsstarke Testing-Tools, die speziell für durable, long-running Workflows entwickelt wurden. Dieses Kapitel zeigt Ihnen, wie Sie sie effektiv nutzen.

Das Grundproblem

Scenario: Sie entwickeln einen Order Processing Workflow:

@workflow.defn
class OrderWorkflow:
    async def run(self, order_id: str) -> str:
        # Payment (mit Retry-Logik)
        payment = await workflow.execute_activity(
            process_payment,
            order_id,
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Inventory (kann lange dauern)
        await workflow.execute_activity(
            reserve_inventory,
            order_id,
            start_to_close_timeout=timedelta(hours=24)
        )

        # Warte auf manuelle Approval (via Signal)
        await workflow.wait_condition(lambda: self.approved)

        # Shipping
        tracking = await workflow.execute_activity(
            create_shipment,
            order_id,
            start_to_close_timeout=timedelta(hours=1)
        )

        return tracking

Ohne Testing-Framework:

❌ Test dauert 24+ Stunden (wegen inventory timeout)
❌ Manuelle Approval muss simuliert werden
❌ External Services müssen verfügbar sein
❌ Retry-Logik schwer zu testen
❌ Workflow-Evolution kann nicht validiert werden

→ Tests werden nicht geschrieben
→ Bugs landen in Production
→ Debugging dauert Stunden

Mit Temporal Testing:

✓ Test läuft in Sekunden (time-skipping)
✓ Activities werden gemockt
✓ Signals können simuliert werden
✓ Retry-Verhalten ist testbar
✓ Workflow History kann replayed werden

→ Comprehensive Test Suite
→ Bugs werden vor Deployment gefunden
→ Sichere Refactorings

Lernziele

Nach diesem Kapitel können Sie:

  • Unit Tests für Activities und Workflows schreiben
  • Integration Tests mit WorkflowEnvironment implementieren
  • Time-Skipping für Tests mit langen Timeouts nutzen
  • Activities mocken für isolierte Workflow-Tests
  • Replay Tests für Workflow-Evolution durchführen
  • pytest Fixtures für Test-Isolation aufsetzen
  • CI/CD Integration mit automatisierten Tests
  • Production Histories in Tests verwenden

12.1 Unit Testing: Activities in Isolation

Der einfachste Test-Ansatz: Activities direkt aufrufen, ohne Worker oder Workflow.

12.1.1 Warum Activity Unit Tests?

Vorteile:

  • ⚡ Schnell (keine Temporal-Infrastruktur nötig)
  • 🎯 Fokussiert (nur Business-Logik)
  • 🔄 Einfach zu debuggen
  • 📊 Hohe Code Coverage

Best Practice: 80% Unit Tests, 15% Integration Tests, 5% E2E Tests

12.1.2 Activity Unit Test Beispiel

# activities.py
from temporalio import activity
from dataclasses import dataclass
import httpx

@dataclass
class PaymentRequest:
    order_id: str
    amount: float

@dataclass
class PaymentResult:
    success: bool
    transaction_id: str

@activity.defn
async def process_payment(request: PaymentRequest) -> PaymentResult:
    """Process payment via external API"""
    activity.logger.info(f"Processing payment for {request.order_id}")

    # Call external payment API
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://payment.api.com/charge",
            json={
                "order_id": request.order_id,
                "amount": request.amount
            },
            timeout=30.0
        )
        response.raise_for_status()
        data = response.json()

    return PaymentResult(
        success=data["status"] == "success",
        transaction_id=data["transaction_id"]
    )

Test (ohne Temporal):

# tests/test_activities.py
import pytest
from unittest.mock import AsyncMock, patch
from activities import process_payment, PaymentRequest, PaymentResult

@pytest.mark.asyncio
async def test_process_payment_success():
    """Test successful payment processing"""

    # Mock httpx client
    mock_response = AsyncMock()
    mock_response.json.return_value = {
        "status": "success",
        "transaction_id": "txn_12345"
    }

    with patch("httpx.AsyncClient") as mock_client:
        mock_client.return_value.__aenter__.return_value.post = AsyncMock(
            return_value=mock_response
        )

        # Call activity directly (no Temporal needed!)
        result = await process_payment(
            PaymentRequest(order_id="order-001", amount=99.99)
        )

        # Assert
        assert result.success is True
        assert result.transaction_id == "txn_12345"

@pytest.mark.asyncio
async def test_process_payment_failure():
    """Test payment processing failure"""

    with patch("httpx.AsyncClient") as mock_client:
        # Simulate API error
        mock_client.return_value.__aenter__.return_value.post = AsyncMock(
            side_effect=httpx.HTTPStatusError(
                "Payment failed",
                request=AsyncMock(),
                response=AsyncMock(status_code=400)
            )
        )

        # Expect activity to raise
        with pytest.raises(httpx.HTTPStatusError):
            await process_payment(
                PaymentRequest(order_id="order-002", amount=199.99)
            )

Vorteile:

  • ✅ Keine Temporal Server nötig
  • ✅ Tests laufen in Millisekunden
  • ✅ External API wird gemockt
  • ✅ Error Cases sind testbar

12.2 Integration Testing mit WorkflowEnvironment

Integration Tests testen Workflows UND Activities zusammen, mit einem in-memory Temporal Server.

12.2.1 WorkflowEnvironment Setup

# tests/test_workflows.py
import pytest
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
from workflows import OrderWorkflow
from activities import process_payment, reserve_inventory, create_shipment

@pytest.fixture
async def workflow_env():
    """Fixture: Temporal test environment"""
    async with await WorkflowEnvironment.start_time_skipping() as env:
        yield env

@pytest.fixture
async def worker(workflow_env):
    """Fixture: Worker mit Workflows und Activities"""
    async with Worker(
        workflow_env.client,
        task_queue="test-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment, reserve_inventory, create_shipment]
    ):
        yield

Wichtig: start_time_skipping() aktiviert automatisches Time-Skipping!

12.2.2 Workflow Integration Test

@pytest.mark.asyncio
async def test_order_workflow_success(workflow_env, worker):
    """Test successful order workflow execution"""

    # Start workflow
    handle = await workflow_env.client.start_workflow(
        OrderWorkflow.run,
        "order-test-001",
        id="test-order-001",
        task_queue="test-queue"
    )

    # Send approval signal (simulating manual step)
    await handle.signal(OrderWorkflow.approve)

    # Wait for result
    result = await handle.result()

    # Assert
    assert result.startswith("TRACKING-")

Was passiert hier?

  1. workflow_env startet in-memory Temporal Server
  2. worker registriert Workflows/Activities
  3. Workflow wird gestartet
  4. Signal wird gesendet (simuliert manuellen Schritt)
  5. Ergebnis wird validiert

Time-Skipping: 24-Stunden Timeout dauert nur Sekunden!


12.3 Time-Skipping: Tage in Sekunden testen

12.3.1 Das Problem: Lange Timeouts

@workflow.defn
class NotificationWorkflow:
    async def run(self, user_id: str):
        # Send initial notification
        await workflow.execute_activity(
            send_email,
            user_id,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Wait 3 days
        await asyncio.sleep(timedelta(days=3).total_seconds())

        # Send reminder
        await workflow.execute_activity(
            send_reminder,
            user_id,
            start_to_close_timeout=timedelta(minutes=5)
        )

Ohne Time-Skipping: Test dauert 3 Tage 😱

Mit Time-Skipping: Test dauert Sekunden ⚡

12.3.2 Time-Skipping in Action

@pytest.mark.asyncio
async def test_notification_workflow_with_delay(workflow_env, worker):
    """Test workflow with 3-day sleep (executes in seconds!)"""

    # Start workflow
    handle = await workflow_env.client.start_workflow(
        NotificationWorkflow.run,
        "user-123",
        id="test-notification",
        task_queue="test-queue"
    )

    # Wait for completion (time is automatically skipped!)
    await handle.result()

    # Verify both activities were called
    history = await handle.fetch_history()
    activity_events = [
        e for e in history.events
        if e.event_type == "ACTIVITY_TASK_SCHEDULED"
    ]
    assert len(activity_events) == 2  # send_email + send_reminder

Wie funktioniert Time-Skipping?

  • WorkflowEnvironment erkennt, dass keine Activities laufen
  • Zeit wird automatisch vorwärts gespult bis zum nächsten Event
  • asyncio.sleep(3 days) wird instant übersprungen
  • Test läuft in <1 Sekunde

12.3.3 Manuelles Time-Skipping

@pytest.mark.asyncio
async def test_manual_time_skip(workflow_env):
    """Manually control time skipping"""

    # Start workflow
    handle = await workflow_env.client.start_workflow(
        NotificationWorkflow.run,
        "user-456",
        id="test-manual-skip",
        task_queue="test-queue"
    )

    # Manually skip time
    await workflow_env.sleep(timedelta(days=3))

    # Check workflow state via query
    state = await handle.query("get_state")
    assert state == "reminder_sent"

12.4 Mocking Activities

Problem: Activities rufen externe Services auf (Datenbanken, APIs, etc.). Im Test wollen wir diese nicht aufrufen.

12.4.1 Activity Mocking mit Mock-Implementierung

# activities.py (production code)
@activity.defn
async def send_email(user_id: str, subject: str, body: str):
    """Send email via SendGrid"""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.sendgrid.com/v3/mail/send",
            json={
                "to": f"user-{user_id}@example.com",
                "subject": subject,
                "body": body
            },
            headers={"Authorization": f"Bearer {SENDGRID_API_KEY}"}
        )
        response.raise_for_status()

# tests/mocks.py (test code)
@activity.defn(name="send_email")  # Same name as production activity!
async def mock_send_email(user_id: str, subject: str, body: str):
    """Mock email sending (no external call)"""
    activity.logger.info(f"MOCK: Sending email to user {user_id}")
    # No actual API call - just return success
    return None

Test mit Mock:

from tests.mocks import mock_send_email

@pytest.mark.asyncio
async def test_with_mock_activity(workflow_env):
    """Test workflow with mocked activity"""

    # Worker uses MOCK activity instead of production one
    async with Worker(
        workflow_env.client,
        task_queue="test-queue",
        workflows=[NotificationWorkflow],
        activities=[mock_send_email]  # Mock statt Production!
    ):
        handle = await workflow_env.client.start_workflow(
            NotificationWorkflow.run,
            "user-789",
            id="test-with-mock",
            task_queue="test-queue"
        )

        await handle.result()

        # Verify workflow completed without calling SendGrid

Vorteile:

  • ✅ Keine external dependencies
  • ✅ Tests laufen offline
  • ✅ Schneller (keine Network Latency)
  • ✅ Deterministisch (keine Flakiness)

12.4.2 Conditional Mocking (Production vs Test)

# config.py
import os

IS_TEST = os.getenv("TESTING", "false") == "true"

# activities.py
@activity.defn
async def send_email(user_id: str, subject: str, body: str):
    if IS_TEST:
        activity.logger.info(f"TEST MODE: Would send email to {user_id}")
        return

    # Production code
    async with httpx.AsyncClient() as client:
        # ... real API call
        pass

Nachteile dieses Ansatzes:

  • ⚠️ Vermischt Production und Test-Code
  • ⚠️ Schwieriger zu maintainen
  • Besser: Separate Mock-Implementierungen (siehe oben)

12.5 Replay Testing: Workflow-Evolution validieren

Replay Testing ist Temporals Killer-Feature für sichere Workflow-Evolution.

12.5.1 Was ist Replay Testing?

Konzept:

  1. Workflow wird ausgeführt → History wird aufgezeichnet
  2. Workflow-Code wird geändert
  3. Replay: Alte History wird mit neuem Code replayed
  4. Validierung: Prüfen, ob neuer Code deterministisch ist

Use Case: Sie deployen eine neue Workflow-Version. Replay Testing stellt sicher, dass alte, noch laufende Workflows nicht brechen.

12.5.2 Replay Test Setup

# tests/test_replay.py
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory
from workflows import OrderWorkflowV1, OrderWorkflowV2

@pytest.mark.asyncio
async def test_workflow_v2_replays_v1_history():
    """Test that v2 workflow can replay v1 history"""

    # 1. Execute v1 workflow and capture history
    async with await WorkflowEnvironment.start_time_skipping() as env:
        async with Worker(
            env.client,
            task_queue="test-queue",
            workflows=[OrderWorkflowV1],
            activities=[process_payment]
        ):
            handle = await env.client.start_workflow(
                OrderWorkflowV1.run,
                "order-replay-test",
                id="replay-test",
                task_queue="test-queue"
            )

            await handle.result()

            # Capture workflow history
            history = await handle.fetch_history()

    # 2. Create Replayer with v2 workflow
    replayer = Replayer(
        workflows=[OrderWorkflowV2],
        activities=[process_payment]
    )

    # 3. Replay v1 history with v2 code
    try:
        await replayer.replay_workflow(history)
        print("✅ Replay successful - v2 is compatible!")
    except Exception as e:
        pytest.fail(f"❌ Replay failed - non-determinism detected: {e}")

12.5.3 Breaking Change Detection

Scenario: Sie ändern Activity-Reihenfolge (Breaking Change!)

# workflows.py (v1)
@workflow.defn
class OrderWorkflowV1:
    async def run(self, order_id: str):
        payment = await workflow.execute_activity(process_payment, ...)
        inventory = await workflow.execute_activity(reserve_inventory, ...)
        return "done"

# workflows.py (v2 - BREAKING!)
@workflow.defn
class OrderWorkflowV2:
    async def run(self, order_id: str):
        # WRONG: Changed order!
        inventory = await workflow.execute_activity(reserve_inventory, ...)
        payment = await workflow.execute_activity(process_payment, ...)
        return "done"

Replay Test fängt das ab:

❌ Replay failed - non-determinism detected:
Expected ActivityScheduled(process_payment)
Got ActivityScheduled(reserve_inventory)

Lösung: Verwende workflow.patched() (siehe Kapitel 8)

@workflow.defn
class OrderWorkflowV2Fixed:
    async def run(self, order_id: str):
        if workflow.patched("swap-order-v2"):
            # New order
            inventory = await workflow.execute_activity(reserve_inventory, ...)
            payment = await workflow.execute_activity(process_payment, ...)
        else:
            # Old order (for replay)
            payment = await workflow.execute_activity(process_payment, ...)
            inventory = await workflow.execute_activity(reserve_inventory, ...)

        return "done"

12.5.4 Production History Replay

Best Practice: Replay echte Production Histories in CI/CD!

# tests/test_production_replay.py
import json
from pathlib import Path

@pytest.mark.asyncio
async def test_replay_production_histories():
    """Replay 100 most recent production histories"""

    # Load histories from exported JSON files
    history_dir = Path("tests/fixtures/production_histories")

    replayer = Replayer(
        workflows=[OrderWorkflowV2],
        activities=[process_payment, reserve_inventory, create_shipment]
    )

    for history_file in history_dir.glob("*.json"):
        with open(history_file) as f:
            history_data = json.load(f)

        workflow_id = history_file.stem
        history = WorkflowHistory.from_json(workflow_id, history_data)

        try:
            await replayer.replay_workflow(history)
            print(f"✅ {workflow_id} replayed successfully")
        except Exception as e:
            pytest.fail(f"❌ {workflow_id} failed: {e}")

Workflow Histories exportieren:

# Export history for a workflow
temporal workflow show \
  --workflow-id order-12345 \
  --output json > tests/fixtures/production_histories/order-12345.json

# Batch export (last 100 workflows)
temporal workflow list \
  --query 'WorkflowType="OrderWorkflow"' \
  --limit 100 \
  --fields WorkflowId \
  | xargs -I {} temporal workflow show --workflow-id {} --output json > {}.json

12.6 pytest Fixtures für Test-Isolation

Problem: Tests beeinflussen sich gegenseitig, wenn sie Workflows mit denselben IDs starten.

Lösung: pytest Fixtures + eindeutige Workflow IDs

12.6.1 Wiederverwendbare Fixtures

# tests/conftest.py (shared fixtures)
import pytest
import uuid
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
from workflows import OrderWorkflow
from activities import process_payment, reserve_inventory

@pytest.fixture
async def temporal_env():
    """Fixture: Temporal test environment (time-skipping)"""
    async with await WorkflowEnvironment.start_time_skipping() as env:
        yield env

@pytest.fixture
async def worker(temporal_env):
    """Fixture: Worker with all workflows/activities"""
    async with Worker(
        temporal_env.client,
        task_queue="test-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment, reserve_inventory]
    ):
        yield

@pytest.fixture
def unique_workflow_id():
    """Fixture: Generate unique workflow ID for each test"""
    return f"test-{uuid.uuid4()}"

12.6.2 Test Isolation

# tests/test_order_workflow.py
import pytest

@pytest.mark.asyncio
async def test_order_success(temporal_env, worker, unique_workflow_id):
    """Test successful order (isolated via unique ID)"""

    handle = await temporal_env.client.start_workflow(
        OrderWorkflow.run,
        "order-001",
        id=unique_workflow_id,  # Unique ID!
        task_queue="test-queue"
    )

    result = await handle.result()
    assert result == "ORDER_COMPLETED"

@pytest.mark.asyncio
async def test_order_payment_failure(temporal_env, worker, unique_workflow_id):
    """Test order with payment failure (isolated)"""

    handle = await temporal_env.client.start_workflow(
        OrderWorkflow.run,
        "order-002",
        id=unique_workflow_id,  # Different unique ID!
        task_queue="test-queue"
    )

    # Expect workflow to fail
    with pytest.raises(Exception, match="Payment failed"):
        await handle.result()

Vorteile:

  • ✅ Keine Test-Interferenz
  • ✅ Tests können parallel laufen
  • ✅ Deterministisch (kein Flakiness)

12.6.3 Parametrisierte Tests

@pytest.mark.parametrize("order_id,expected_status", [
    ("order-001", "COMPLETED"),
    ("order-002", "PAYMENT_FAILED"),
    ("order-003", "INVENTORY_UNAVAILABLE"),
])
@pytest.mark.asyncio
async def test_order_scenarios(
    temporal_env,
    worker,
    unique_workflow_id,
    order_id,
    expected_status
):
    """Test multiple order scenarios"""

    handle = await temporal_env.client.start_workflow(
        OrderWorkflow.run,
        order_id,
        id=unique_workflow_id,
        task_queue="test-queue"
    )

    result = await handle.result()
    assert result["status"] == expected_status

12.7 CI/CD Integration

12.7.1 pytest in CI/CD Pipeline

GitHub Actions Beispiel:

# .github/workflows/test.yml
name: Temporal Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-asyncio

      - name: Run unit tests
        run: pytest tests/test_activities.py -v

      - name: Run integration tests
        run: pytest tests/test_workflows.py -v

      - name: Run replay tests
        run: pytest tests/test_replay.py -v

      - name: Generate coverage report
        run: |
          pip install pytest-cov
          pytest --cov=workflows --cov=activities --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3

12.7.2 Test-Organisation

tests/
├── conftest.py              # Shared fixtures
├── test_activities.py       # Unit tests (fast)
├── test_workflows.py        # Integration tests (slower)
├── test_replay.py           # Replay tests (critical)
├── fixtures/
│   └── production_histories/  # Exported workflow histories
│       ├── order-12345.json
│       └── order-67890.json
└── mocks/
    └── mock_activities.py   # Mock implementations

pytest Marker für Test-Kategorien:

# tests/test_workflows.py
import pytest

@pytest.mark.unit
@pytest.mark.asyncio
async def test_activity_directly():
    """Fast unit test"""
    result = await process_payment(...)
    assert result.success

@pytest.mark.integration
@pytest.mark.asyncio
async def test_workflow_with_worker(temporal_env, worker):
    """Slower integration test"""
    handle = await temporal_env.client.start_workflow(...)
    await handle.result()

@pytest.mark.replay
@pytest.mark.asyncio
async def test_replay_production_history():
    """Critical replay test"""
    replayer = Replayer(...)
    await replayer.replay_workflow(history)

Selektives Ausführen:

# Nur Unit Tests (schnell)
pytest -m unit

# Nur Integration Tests
pytest -m integration

# Nur Replay Tests (vor Deployment!)
pytest -m replay

# Alle Tests
pytest

12.7.3 Pre-Commit Hook für Replay Tests

# .git/hooks/pre-commit
#!/bin/bash

echo "Running replay tests before commit..."
pytest tests/test_replay.py -v

if [ $? -ne 0 ]; then
    echo "❌ Replay tests failed! Commit blocked."
    exit 1
fi

echo "✅ Replay tests passed!"

12.8 Advanced: Testing mit echten Temporal Server

Use Case: End-to-End Tests mit realem Temporal Server (nicht in-memory).

12.8.1 Temporal Dev Server in CI

# .github/workflows/e2e.yml
jobs:
  e2e-test:
    runs-on: ubuntu-latest

    services:
      temporal:
        image: temporalio/auto-setup:latest
        ports:
          - 7233:7233
        env:
          TEMPORAL_ADDRESS: localhost:7233

    steps:
      - uses: actions/checkout@v3

      - name: Wait for Temporal
        run: |
          timeout 60 bash -c 'until nc -z localhost 7233; do sleep 1; done'

      - name: Run E2E tests
        run: pytest tests/e2e/ -v
        env:
          TEMPORAL_ADDRESS: localhost:7233

12.8.2 E2E Test mit realem Server

# tests/e2e/test_order_e2e.py
import pytest
from temporalio.client import Client
from temporalio.worker import Worker

@pytest.mark.e2e
@pytest.mark.asyncio
async def test_order_workflow_e2e():
    """E2E test with real Temporal server"""

    # Connect to real Temporal server
    client = await Client.connect("localhost:7233")

    # Start real worker
    async with Worker(
        client,
        task_queue="e2e-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment, reserve_inventory]
    ):
        # Execute workflow
        handle = await client.start_workflow(
            OrderWorkflow.run,
            "order-e2e-001",
            id="e2e-test-001",
            task_queue="e2e-queue"
        )

        result = await handle.result()
        assert result == "ORDER_COMPLETED"

        # Verify via Temporal UI (optional)
        history = await handle.fetch_history()
        assert len(history.events) > 0

12.9 Testing Best Practices

12.9.1 Test-Pyramide für Temporal

         /\
        /  \  E2E Tests (5%)
       /____\  - Real Temporal Server
      /      \  - All services integrated
     /________\ Integration Tests (15%)
    /          \ - WorkflowEnvironment
   /____________\ - Time-skipping
  /              \ - Mocked activities
 /________________\ Unit Tests (80%)
                   - Direct activity calls
                   - Fast, isolated

12.9.2 Checkliste: Was testen?

Workflows:

  • ✅ Happy Path (erfolgreiches Durchlaufen)
  • ✅ Error Cases (Activity Failures, Timeouts)
  • ✅ Signal Handling (korrekte Reaktion auf Signals)
  • ✅ Query Responses (richtige State-Rückgabe)
  • ✅ Retry Behavior (Retries funktionieren wie erwartet)
  • ✅ Long-running Scenarios (mit Time-Skipping)
  • ✅ Replay Compatibility (nach Code-Änderungen)

Activities:

  • ✅ Business Logic (korrekte Berechnung/Verarbeitung)
  • ✅ Error Handling (Exceptions werden richtig geworfen)
  • ✅ Edge Cases (null, empty, extreme values)
  • ✅ External API Mocking (keine echten Calls im Test)

Workflow Evolution:

  • ✅ Replay Tests (alte Histories mit neuem Code)
  • ✅ Patching Scenarios (workflow.patched() funktioniert)
  • ✅ Breaking Change Detection (Non-Determinismus)

12.9.3 Common Testing Mistakes

FehlerProblemLösung
Keine Replay TestsBreaking Changes in ProductionReplay Tests in CI/CD
Tests dauern zu langKeine Time-Skipping-Nutzungstart_time_skipping()
Flaky TestsShared Workflow IDsUnique IDs pro Test
Nur Happy PathBugs in Error CasesEdge Cases testen
External Calls im TestLangsam, flaky, KostenActivities mocken
Keine Production HistoryUngetestete Edge CasesProduction Histories exportieren

12.9.4 Performance-Optimierung

# SLOW: Neues Environment pro Test
@pytest.mark.asyncio
async def test_workflow_1():
    async with await WorkflowEnvironment.start_time_skipping() as env:
        # Test...
        pass

@pytest.mark.asyncio
async def test_workflow_2():
    async with await WorkflowEnvironment.start_time_skipping() as env:
        # Test...
        pass

# FAST: Shared environment via fixture (session scope)
@pytest.fixture(scope="session")
async def shared_env():
    async with await WorkflowEnvironment.start_time_skipping() as env:
        yield env

@pytest.mark.asyncio
async def test_workflow_1(shared_env):
    # Test... (uses same environment)
    pass

@pytest.mark.asyncio
async def test_workflow_2(shared_env):
    # Test... (uses same environment)
    pass

Speedup: 10x schneller bei vielen Tests!


12.10 Zusammenfassung

Testing Strategy Checklist

Development:

  • Unit Tests für alle Activities
  • Integration Tests für kritische Workflows
  • Replay Tests für Workflow-Versionen
  • Mocks für externe Services
  • Time-Skipping für lange Workflows

CI/CD:

  • pytest in GitHub Actions / GitLab CI
  • Replay Tests vor jedem Deployment
  • Production History Replay (wöchentlich)
  • Test Coverage Tracking (>80%)
  • Pre-Commit Hooks für Replay Tests

Production:

  • Workflow Histories regelmäßig exportieren
  • Replay Tests mit Production Histories
  • Monitoring für Test-Failures in CI
  • Rollback-Plan bei Breaking Changes

Häufige Fehler

FEHLER 1: Keine Replay Tests

# Deployment ohne Replay Testing
# → Breaking Changes landen in Production

RICHTIG:

@pytest.mark.asyncio
async def test_replay_before_deploy():
    replayer = Replayer(workflows=[WorkflowV2])
    await replayer.replay_workflow(production_history)

FEHLER 2: Tests dauern ewig

# Warten auf echte Timeouts
await asyncio.sleep(3600)  # 1 Stunde

RICHTIG:

# Time-Skipping nutzen
async with await WorkflowEnvironment.start_time_skipping() as env:
    # 1 Stunde wird instant übersprungen

FEHLER 3: Flaky Tests

# Feste Workflow ID
id="test-workflow"  # Mehrere Tests kollidieren!

RICHTIG:

# Unique ID pro Test
id=f"test-{uuid.uuid4()}"

Best Practices

  1. 80/15/5 Regel: 80% Unit, 15% Integration, 5% E2E
  2. Time-Skipping immer nutzen für Integration Tests
  3. Replay Tests in CI/CD vor jedem Deployment
  4. Production Histories regelmäßig exportieren und testen
  5. Activities mocken für schnelle, deterministische Tests
  6. Unique Workflow IDs für Test-Isolation
  7. pytest Fixtures für Wiederverwendbarkeit
  8. Test-Marker für selektives Ausführen

Testing Anti-Patterns

Anti-PatternWarum schlecht?Alternative
Nur manuelle TestsLangsam, fehleranfälligAutomatisierte pytest Suite
Keine MocksTests brauchen externe ServicesMock Activities
Feste Workflow IDsTests beeinflussen sichUnique IDs via uuid
Warten auf echte ZeitTests dauern Stunden/TageTime-Skipping
Kein Replay TestingBreaking Changes unentdecktReplay in CI/CD
Nur Happy PathBugs in Edge CasesError Cases testen

Nächste Schritte

Nach diesem Kapitel sollten Sie:

  1. Test Suite aufsetzen:

    mkdir tests
    touch tests/conftest.py tests/test_activities.py tests/test_workflows.py
    
  2. pytest konfigurieren:

    # pytest.ini
    [pytest]
    asyncio_mode = auto
    markers =
        unit: Unit tests
        integration: Integration tests
        replay: Replay tests
    
  3. CI/CD Pipeline erweitern:

    # .github/workflows/test.yml
    - name: Run tests
      run: pytest -v --cov
    
  4. Production History Export automatisieren:

    # Wöchentlicher Cron Job
    temporal workflow list --limit 100 | xargs -I {} temporal workflow show ...
    

Ressourcen


⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 13: Best Practices und Anti-Muster

Code-Beispiele für dieses Kapitel: examples/part-04/chapter-12/

Kapitel 13: Best Practices und Anti-Muster

Einleitung

Sie haben die Grundlagen von Temporal gelernt, Workflows geschrieben, Testing implementiert und Monitoring aufgesetzt. Ihr System läuft in Production. Doch dann kommt der Moment:

  • Ein Workflow bricht plötzlich mit Non-Determinismus-Fehlern ab
  • Die Event History überschreitet 50.000 Events und der Workflow wird terminiert
  • Worker können die Last nicht bewältigen, obwohl genug Ressourcen verfügbar sind
  • Ein vermeintlich kleines Refactoring führt zu Production-Incidents
  • Code-Reviews dauern Stunden, weil niemand die Workflow-Struktur versteht

Diese Probleme sind vermeidbar – wenn Sie von Anfang an bewährte Patterns folgen und häufige Anti-Patterns vermeiden.

Dieses Kapitel destilliert Jahre an Production-Erfahrung aus der Temporal-Community in konkrete, umsetzbare Guidelines. Sie lernen was funktioniert, was nicht funktioniert, und warum.

Das Grundproblem

Scenario: Ein Team entwickelt einen E-Commerce Workflow. Nach einigen Monaten in Production:

# ❌ ANTI-PATTERN: Alles in einem gigantischen Workflow

@workflow.defn
class MonolithWorkflow:
    """Ein 3000-Zeilen Monster-Workflow"""

    def __init__(self):
        self.orders = []  # ❌ Unbegrenzte Liste
        self.user_sessions = {}  # ❌ Wächst endlos
        self.cache = {}  # ❌ Memory Leak

    @workflow.run
    async def run(self, user_id: str):
        # ❌ Non-deterministic!
        if random.random() > 0.5:
            discount = 0.1

        # ❌ Business Logic im Workflow
        price = self.calculate_complex_pricing(...)

        # ❌ Externe API direkt aufrufen
        async with httpx.AsyncClient() as client:
            response = await client.post("https://payment.api/charge")

        # ❌ Workflow läuft ewig ohne Continue-As-New
        while True:
            order = await workflow.wait_condition(lambda: len(self.orders) > 0)
            # Process order...
            # Event History wächst ins Unendliche

        # ❌ Map-Iteration (random order!)
        for session_id, session in self.user_sessions.items():
            await self.process_session(session)

Konsequenzen nach 6 Monaten:

❌ Event History: 75.000 Events → Workflow terminiert
❌ Non-Determinismus bei Replay → 30% der Workflows brechen ab
❌ Worker Overload → Schedule-To-Start > 10 Minuten
❌ Deployment dauert 6 Stunden → Rollback bei jedem Change
❌ Debugging unmöglich → Team ist frustriert

Mit Best Practices:

# ✅ BEST PRACTICE: Clean, maintainable, production-ready

@dataclass
class OrderInput:
    """Single object input pattern"""
    user_id: str
    cart_items: List[str]
    discount_code: Optional[str] = None

@workflow.defn
class OrderWorkflow:
    """Focused workflow: Orchestrate, don't implement"""

    @workflow.run
    async def run(self, input: OrderInput) -> OrderResult:
        # ✅ Deterministic: All randomness in activities
        discount = await workflow.execute_activity(
            calculate_discount,
            input.discount_code,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # ✅ Business logic in activities
        payment = await workflow.execute_activity(
            process_payment,
            PaymentInput(user_id=input.user_id, discount=discount),
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # ✅ External calls in activities
        tracking = await workflow.execute_activity(
            create_shipment,
            payment.order_id,
            start_to_close_timeout=timedelta(hours=1)
        )

        return OrderResult(
            order_id=payment.order_id,
            tracking_number=tracking
        )

Resultat:

✓ Event History: ~20 Events pro Workflow
✓ 100% Replay Success Rate
✓ Schedule-To-Start: <100ms
✓ Zero-Downtime Deployments
✓ Debugging in Minuten statt Stunden

Lernziele

Nach diesem Kapitel können Sie:

  • Best Practices für Workflow-Design, Code-Organisation und Worker-Konfiguration anwenden
  • Anti-Patterns erkennen und vermeiden, bevor sie Production-Probleme verursachen
  • Determinismus garantieren durch korrektes Pattern-Anwendung
  • Performance optimieren durch Worker-Tuning und Event History Management
  • Code-Organisation strukturieren für Wartbarkeit und Skalierbarkeit
  • Production-Ready Workflows schreiben, die jahrelang laufen
  • Code Reviews durchführen mit klarer Checkliste
  • Refactorings sicher vornehmen ohne Breaking Changes

13.1 Workflow Design Best Practices

Orchestration vs. Implementation

Regel: Workflows orchestrieren, Activities implementieren.

# ❌ ANTI-PATTERN: Business Logic im Workflow

@workflow.defn
class PricingWorkflowBad:
    @workflow.run
    async def run(self, product_id: str) -> float:
        # ❌ Complex logic in workflow (non-testable, non-deterministic risk)
        base_price = 100.0

        # ❌ Time-based logic (non-deterministic!)
        current_hour = datetime.now().hour
        if current_hour >= 18:
            base_price *= 1.2  # Evening surge pricing

        # ❌ Heavy computation
        for i in range(1000000):
            base_price += math.sin(i) * 0.0001

        return base_price
# ✅ BEST PRACTICE: Orchestration only

@workflow.defn
class PricingWorkflowGood:
    @workflow.run
    async def run(self, product_id: str) -> float:
        # ✅ Delegate to activity
        price = await workflow.execute_activity(
            calculate_price,
            product_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

        return price

# ✅ Logic in activity
@activity.defn
async def calculate_price(product_id: str) -> float:
    """Complex pricing logic isolated in activity"""
    base_price = await fetch_base_price(product_id)

    # Time-based logic OK in activity
    current_hour = datetime.now().hour
    if current_hour >= 18:
        base_price *= 1.2

    # Heavy computation OK in activity
    for i in range(1000000):
        base_price += math.sin(i) * 0.0001

    return base_price

Warum?

  • ✅ Workflows bleiben deterministisch
  • ✅ Activities sind unit-testbar
  • ✅ Retry-Logik funktioniert korrekt
  • ✅ Workflow History bleibt klein

Single Object Input Pattern

Regel: Ein Input-Objekt statt mehrere Parameter.

# ❌ ANTI-PATTERN: Multiple primitive arguments

@workflow.defn
class OrderWorkflowBad:
    @workflow.run
    async def run(
        self,
        user_id: str,
        product_id: str,
        quantity: int,
        discount: float,
        shipping_address: str,
        billing_address: str,
        gift_wrap: bool,
        express_shipping: bool
    ) -> str:
        # ❌ Signature-Änderungen brechen alte Workflows
        # ❌ Schwer zu lesen
        # ❌ Keine Validierung
        ...
# ✅ BEST PRACTICE: Single dataclass input

from dataclasses import dataclass
from typing import Optional

@dataclass
class OrderInput:
    """Order workflow input (versioned)"""
    user_id: str
    product_id: str
    quantity: int
    shipping_address: str

    # Optional fields für Evolution
    discount: Optional[float] = None
    billing_address: Optional[str] = None
    gift_wrap: bool = False
    express_shipping: bool = False

    def __post_init__(self):
        # ✅ Validation at input
        if self.quantity <= 0:
            raise ValueError("Quantity must be positive")

@workflow.defn
class OrderWorkflowGood:
    @workflow.run
    async def run(self, input: OrderInput) -> OrderResult:
        # ✅ Neue Felder hinzufügen ist safe
        # ✅ Validierung ist gekapselt
        # ✅ Lesbar und wartbar
        ...

Vorteile:

  • ✅ Einfacher zu erweitern (neue optionale Felder)
  • ✅ Bessere Validierung
  • ✅ Lesbarerer Code
  • ✅ Type-Safety

Continue-As-New für Long-Running Workflows

Regel: Verwenden Sie Continue-As-New, wenn Event History groß wird.

# ❌ ANTI-PATTERN: Endlos-Workflow ohne Continue-As-New

@workflow.defn
class UserSessionWorkflowBad:
    def __init__(self):
        self.events = []  # ❌ Wächst unbegrenzt

    @workflow.run
    async def run(self, user_id: str):
        while True:  # ❌ Läuft ewig
            event = await workflow.wait_condition(
                lambda: len(self.pending_events) > 0
            )
            self.events.append(event)  # ❌ Event History explodiert

            # Nach 1 Jahr: 50.000+ Events
            # → Workflow wird terminiert!
# ✅ BEST PRACTICE: Continue-As-New mit Limit

@workflow.defn
class UserSessionWorkflowGood:
    def __init__(self):
        self.events = []
        self.processed_count = 0

    @workflow.run
    async def run(self, user_id: str, total_processed: int = 0):
        while True:
            # ✅ Check history size regularly
            info = workflow.info()
            if info.get_current_history_length() > 1000:
                workflow.logger.info(
                    f"History size: {info.get_current_history_length()}, "
                    "continuing as new"
                )
                # ✅ Continue with fresh history
                workflow.continue_as_new(
                    user_id,
                    total_processed=total_processed + self.processed_count
                )

            event = await workflow.wait_condition(
                lambda: len(self.pending_events) > 0
            )

            await workflow.execute_activity(
                process_event,
                event,
                start_to_close_timeout=timedelta(seconds=30)
            )

            self.processed_count += 1

Wann Continue-As-New verwenden:

  • Event History > 1.000 Events
  • Workflow läuft > 1 Jahr
  • State wächst unbegrenzt
  • Workflow ist ein “Entity Workflow” (z.B. User Session, Shopping Cart)

Limits:

  • ⚠️ Workflow terminiert automatisch bei 50.000 Events
  • ⚠️ Workflow terminiert bei 50 MB History Size

13.2 Determinismus Best Practices

Alles Non-Deterministische in Activities

Regel: Workflows müssen deterministisch sein. Alles andere → Activity.

# ❌ ANTI-PATTERN: Non-deterministic workflow code

@workflow.defn
class FraudCheckWorkflowBad:
    @workflow.run
    async def run(self, transaction_id: str) -> bool:
        # ❌ random() ist non-deterministic!
        risk_score = random.random()

        # ❌ datetime.now() ist non-deterministic!
        if datetime.now().hour > 22:
            risk_score += 0.3

        # ❌ UUID generation non-deterministic!
        audit_id = str(uuid.uuid4())

        # ❌ Map iteration order non-deterministic!
        checks = {"ip": check_ip, "device": check_device}
        for check_name, check_fn in checks.items():  # ❌ Random order!
            await check_fn()

        return risk_score < 0.5
# ✅ BEST PRACTICE: Deterministic workflow

@workflow.defn
class FraudCheckWorkflowGood:
    @workflow.run
    async def run(self, transaction_id: str) -> bool:
        # ✅ Random logic in activity
        risk_score = await workflow.execute_activity(
            calculate_risk_score,
            transaction_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # ✅ Time-based logic in activity
        time_modifier = await workflow.execute_activity(
            get_time_based_modifier,
            start_to_close_timeout=timedelta(seconds=5)
        )

        # ✅ UUID generation in activity
        audit_id = await workflow.execute_activity(
            generate_audit_id,
            start_to_close_timeout=timedelta(seconds=5)
        )

        # ✅ Deterministic iteration order
        check_names = sorted(["ip", "device", "location"])  # ✅ Sorted!
        for check_name in check_names:
            result = await workflow.execute_activity(
                run_fraud_check,
                FraudCheckInput(transaction_id, check_name),
                start_to_close_timeout=timedelta(seconds=30)
            )

        return risk_score + time_modifier < 0.5

# ✅ Non-deterministic logic in activities
@activity.defn
async def calculate_risk_score(transaction_id: str) -> float:
    """Random logic OK in activity"""
    return random.random()

@activity.defn
async def get_time_based_modifier() -> float:
    """Time-based logic OK in activity"""
    if datetime.now().hour > 22:
        return 0.3
    return 0.0

@activity.defn
async def generate_audit_id() -> str:
    """UUID generation OK in activity"""
    return str(uuid.uuid4())

Non-Deterministische Operationen:

OperationWo?Warum?
random.random()❌ WorkflowReplay generiert anderen Wert
datetime.now()❌ WorkflowReplay hat andere Zeit
uuid.uuid4()❌ WorkflowReplay generiert andere UUID
time.time()❌ WorkflowReplay hat andere Timestamp
dict.items() iteration❌ WorkflowOrder ist non-deterministic in Python <3.7
set iteration❌ WorkflowOrder ist non-deterministic
External API calls❌ WorkflowResponse kann sich ändern
File I/O❌ WorkflowDatei-Inhalt kann sich ändern
Database queries❌ WorkflowDaten können sich ändern

✅ Alle diese Operationen sind OK in Activities!


Workflow-Code-Order nie ändern

Regel: Activity-Aufrufe dürfen nicht umgeordnet werden.

# v1: Original Workflow
@workflow.defn
class OnboardingWorkflowV1:
    @workflow.run
    async def run(self, user_id: str):
        # Step 1: Validate
        await workflow.execute_activity(
            validate_user,
            user_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Step 2: Create account
        await workflow.execute_activity(
            create_account,
            user_id,
            start_to_close_timeout=timedelta(seconds=30)
        )
# ❌ v2-bad: Reihenfolge geändert (NON-DETERMINISTIC!)

@workflow.defn
class OnboardingWorkflowV2Bad:
    @workflow.run
    async def run(self, user_id: str):
        # ❌ FEHLER: Reihenfolge geändert!
        # Step 1: Create account (war vorher Step 2)
        await workflow.execute_activity(
            create_account,  # ❌ Replay erwartet validate_user!
            user_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Step 2: Validate (war vorher Step 1)
        await workflow.execute_activity(
            validate_user,  # ❌ Replay erwartet create_account!
            user_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

Was passiert bei Replay:

History Event: ActivityTaskScheduled(activity_name="validate_user")
Replayed Code: workflow.execute_activity(create_account, ...)

❌ ERROR: Non-deterministic workflow!
   Expected: validate_user
   Got: create_account
# ✅ v2-good: Mit workflow.patched() ist Order-Änderung safe

@workflow.defn
class OnboardingWorkflowV2Good:
    @workflow.run
    async def run(self, user_id: str):
        if workflow.patched("reorder-validation-v2"):
            # ✅ NEW CODE PATH: Neue Reihenfolge
            await workflow.execute_activity(create_account, ...)
            await workflow.execute_activity(validate_user, ...)
        else:
            # ✅ OLD CODE PATH: Alte Reihenfolge für Replay
            await workflow.execute_activity(validate_user, ...)
            await workflow.execute_activity(create_account, ...)

13.3 State Management Best Practices

Vermeiden Sie große Workflow-State

Regel: Workflow-State klein halten. Große Daten in Activities oder extern speichern.

# ❌ ANTI-PATTERN: Große Daten im Workflow State

@workflow.defn
class DataProcessingWorkflowBad:
    def __init__(self):
        self.processed_records = []  # ❌ Wächst unbegrenzt!
        self.results = {}  # ❌ Kann riesig werden!

    @workflow.run
    async def run(self, dataset_id: str):
        # ❌ 1 Million Records in Memory
        records = await workflow.execute_activity(
            fetch_all_records,  # Returns 1M records
            dataset_id,
            start_to_close_timeout=timedelta(minutes=10)
        )

        for record in records:
            result = await workflow.execute_activity(
                process_record,
                record,
                start_to_close_timeout=timedelta(seconds=30)
            )
            self.processed_records.append(record)  # ❌ State explodiert!
            self.results[record.id] = result  # ❌ Speichert alles!

        # Event History: 50 MB+ → Workflow terminiert!
# ✅ BEST PRACTICE: Minimaler State, externe Speicherung

@workflow.defn
class DataProcessingWorkflowGood:
    def __init__(self):
        self.processed_count = 0  # ✅ Nur Counter
        self.batch_id = None  # ✅ Nur ID

    @workflow.run
    async def run(self, dataset_id: str):
        # ✅ Activity gibt nur Batch-ID zurück (nicht die Daten!)
        self.batch_id = await workflow.execute_activity(
            create_processing_batch,
            dataset_id,
            start_to_close_timeout=timedelta(minutes=1)
        )

        # ✅ Activity returned nur Count
        total_records = await workflow.execute_activity(
            get_record_count,
            self.batch_id,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # ✅ Process in batches
        batch_size = 1000
        for offset in range(0, total_records, batch_size):
            # ✅ Activity verarbeitet Batch und speichert extern
            processed = await workflow.execute_activity(
                process_batch,
                ProcessBatchInput(self.batch_id, offset, batch_size),
                start_to_close_timeout=timedelta(minutes=5)
            )

            self.processed_count += processed  # ✅ Nur Counter im State

        # ✅ Final result aus externer DB
        return await workflow.execute_activity(
            finalize_batch,
            self.batch_id,
            start_to_close_timeout=timedelta(minutes=1)
        )

# ✅ Activities speichern große Daten extern
@activity.defn
async def process_batch(input: ProcessBatchInput) -> int:
    """Process batch and store results in external DB"""
    records = fetch_records_from_db(input.batch_id, input.offset, input.limit)

    results = []
    for record in records:
        result = process_record(record)
        results.append(result)

    # ✅ Store in external database (S3, PostgreSQL, etc.)
    store_results_in_db(input.batch_id, results)

    return len(results)  # ✅ Return only count, not data

Best Practices:

  • ✅ Speichern Sie IDs, nicht Daten
  • ✅ Verwenden Sie Counters statt Listen
  • ✅ Große Daten in Activities → S3, DB, Redis
  • ✅ Workflow State < 1 KB ideal

Query Handlers sind Read-Only

Regel: Queries dürfen niemals State mutieren.

# ❌ ANTI-PATTERN: Query mutiert State

@workflow.defn
class OrderWorkflowBad:
    def __init__(self):
        self.status = "pending"
        self.view_count = 0

    @workflow.query
    def get_status(self) -> str:
        self.view_count += 1  # ❌ MUTATION in Query!
        return self.status  # ❌ Non-deterministic!

Warum ist das schlimm?

  • Queries werden nicht in History gespeichert
  • Bei Replay werden Queries nicht ausgeführt
  • State ist nach Replay anders als vor Replay
  • Non-Determinismus!
# ✅ BEST PRACTICE: Read-only Queries

@workflow.defn
class OrderWorkflowGood:
    def __init__(self):
        self.status = "pending"
        self.view_count = 0  # Tracked via Signal instead

    @workflow.query
    def get_status(self) -> str:
        """Read-only query"""
        return self.status  # ✅ No mutation

    @workflow.signal
    def track_view(self):
        """Use Signal for mutations"""
        self.view_count += 1  # ✅ Signal ist in History

13.4 Code-Organisation Best Practices

Struktur: Workflows, Activities, Worker getrennt

Regel: Klare Trennung zwischen Workflows, Activities und Worker.

# ❌ ANTI-PATTERN: Alles in einer Datei

my_project/
  └── main.py  # 5000 Zeilen: Workflows, Activities, Worker, Client, alles!
# ✅ BEST PRACTICE: Modulare Struktur

my_project/
  ├── workflows/
  │   ├── __init__.py
  │   ├── order_workflow.py          # ✅ Ein Workflow pro File
  │   ├── payment_workflow.py
  │   └── shipping_workflow.py
  │
  ├── activities/
  │   ├── __init__.py
  │   ├── order_activities.py        # ✅ Activities grouped by domain
  │   ├── payment_activities.py
  │   ├── shipping_activities.py
  │   └── shared_activities.py       # ✅ Shared utilities
  │
  ├── models/
  │   ├── __init__.py
  │   ├── order_models.py            # ✅ Dataclasses für Inputs/Outputs
  │   └── payment_models.py
  │
  ├── workers/
  │   ├── __init__.py
  │   ├── order_worker.py            # ✅ Worker per domain
  │   └── payment_worker.py
  │
  ├── client/
  │   └── temporal_client.py         # ✅ Client-Setup
  │
  └── tests/
      ├── test_workflows/
      ├── test_activities/
      └── test_integration/

Beispiel: Order Workflow strukturiert

# workflows/order_workflow.py
from models.order_models import OrderInput, OrderResult
from activities.order_activities import validate_order, process_payment

@workflow.defn
class OrderWorkflow:
    """Order processing workflow"""

    @workflow.run
    async def run(self, input: OrderInput) -> OrderResult:
        # Clean orchestration only
        ...

# activities/order_activities.py
@activity.defn
async def validate_order(input: OrderInput) -> bool:
    """Validate order data"""
    ...

@activity.defn
async def process_payment(order_id: str) -> PaymentResult:
    """Process payment"""
    ...

# models/order_models.py
@dataclass
class OrderInput:
    """Order workflow input"""
    order_id: str
    user_id: str
    items: List[OrderItem]

@dataclass
class OrderResult:
    """Order workflow result"""
    order_id: str
    status: str
    tracking_number: str

# workers/order_worker.py
async def main():
    """Order worker entrypoint"""
    client = await create_temporal_client()

    worker = Worker(
        client,
        task_queue="order-queue",
        workflows=[OrderWorkflow],
        activities=[validate_order, process_payment]
    )

    await worker.run()

Vorteile:

  • ✅ Testbarkeit: Jede Komponente isoliert testbar
  • ✅ Wartbarkeit: Klare Zuständigkeiten
  • ✅ Code Reviews: Kleinere, fokussierte Files
  • ✅ Onboarding: Neue Entwickler finden sich schnell zurecht

Worker pro Domain/Use Case

Regel: Separate Workers für verschiedene Domains.

# ❌ ANTI-PATTERN: Ein Monolith-Worker für alles

async def main():
    worker = Worker(
        client,
        task_queue="everything-queue",  # ❌ Alle Workflows auf einer Queue
        workflows=[
            OrderWorkflow,
            PaymentWorkflow,
            ShippingWorkflow,
            UserWorkflow,
            NotificationWorkflow,
            ReportWorkflow,
            # ... 50+ Workflows
        ],
        activities=[
            # ... 200+ Activities
        ]
    )
    # ❌ Probleme:
    # - Kann nicht unabhängig skaliert werden
    # - Deployment ist All-or-Nothing
    # - Ein Bug betrifft alle Workflows
# ✅ BEST PRACTICE: Worker pro Domain

# workers/order_worker.py
async def run_order_worker():
    """Dedicated worker for order workflows"""
    client = await create_temporal_client()

    worker = Worker(
        client,
        task_queue="order-queue",  # ✅ Dedicated queue
        workflows=[OrderWorkflow],
        activities=[
            validate_order,
            process_payment,
            reserve_inventory,
            create_shipment
        ]
    )

    await worker.run()

# workers/notification_worker.py
async def run_notification_worker():
    """Dedicated worker for notifications"""
    client = await create_temporal_client()

    worker = Worker(
        client,
        task_queue="notification-queue",  # ✅ Dedicated queue
        workflows=[NotificationWorkflow],
        activities=[
            send_email,
            send_sms,
            send_push_notification
        ]
    )

    await worker.run()

Deployment:

# kubernetes/order-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-worker
spec:
  replicas: 5  # ✅ Skaliert unabhängig
  template:
    spec:
      containers:
      - name: order-worker
        image: myapp/order-worker:v2.3.0  # ✅ Unabhängige Versions

---
# kubernetes/notification-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notification-worker
spec:
  replicas: 10  # ✅ Mehr Replicas für hohe Last
  template:
    spec:
      containers:
      - name: notification-worker
        image: myapp/notification-worker:v1.5.0  # ✅ Andere Version OK

Vorteile:

  • ✅ Unabhängige Skalierung
  • ✅ Unabhängige Deployments
  • ✅ Blast Radius Isolation
  • ✅ Team Autonomie

13.5 Worker Configuration Best Practices

Immer mehr als ein Worker

Regel: Production braucht mindestens 2 Workers pro Queue.

# ❌ ANTI-PATTERN: Single Worker in Production

# ❌ Single Point of Failure!
# Wenn dieser Worker crashed:
#   → Alle Tasks bleiben liegen
#   → Schedule-To-Start explodiert
#   → Workflows timeout

docker run my-worker:latest  # ❌ Nur 1 Instance
# ✅ BEST PRACTICE: Multiple Workers für HA

# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-worker
spec:
  replicas: 3  # ✅ Minimum 3 für High Availability
  template:
    spec:
      containers:
      - name: worker
        image: my-worker:latest
        env:
        - name: TEMPORAL_TASK_QUEUE
          value: "order-queue"

        # ✅ Resource Limits
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

        # ✅ Health Checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

        # ✅ Graceful Shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Warum mehrere Workers?

  • ✅ High Availability: Worker-Crash betrifft nur Teil der Kapazität
  • ✅ Rolling Updates: Zero-Downtime Deployments
  • ✅ Load Balancing: Temporal verteilt automatisch
  • ✅ Redundanz: Hardware-Failure resilient

Worker Tuning

Regel: Tunen Sie Worker basierend auf Schedule-To-Start Metrics.

# ❌ ANTI-PATTERN: Default Settings in Production

worker = Worker(
    client,
    task_queue="order-queue",
    workflows=[OrderWorkflow],
    activities=[process_payment, create_shipment]
    # ❌ Keine Tuning-Parameter
    # → Worker kann überlastet werden
    # → Oder underutilized sein
)
# ✅ BEST PRACTICE: Getunter Worker

from temporalio.worker import Worker, WorkerConfig

worker = Worker(
    client,
    task_queue="order-queue",
    workflows=[OrderWorkflow],
    activities=[process_payment, create_shipment],

    # ✅ Max concurrent Workflow Tasks
    max_concurrent_workflow_tasks=100,  # Default: 100

    # ✅ Max concurrent Activity Tasks
    max_concurrent_activities=50,  # Default: 100

    # ✅ Max concurrent Local Activities
    max_concurrent_local_activities=200,  # Default: 200

    # ✅ Workflow Cache Size
    max_cached_workflows=500,  # Default: 600

    # ✅ Sticky Queue Schedule-To-Start Timeout
    sticky_queue_schedule_to_start_timeout=timedelta(seconds=5)
)

Tuning Guidelines:

MetricWertAktion
Schedule-To-Start > 1sSteigendMehr Workers oder max_concurrent erhöhen
Schedule-To-Start < 100msKonstant✅ Optimal
Worker CPU > 80%KonstantWeniger Concurrency oder mehr Workers
Worker Memory > 80%Steigendmax_cached_workflows reduzieren

Monitoring-basiertes Tuning:

# workers/tuned_worker.py
import os

# ✅ Environment-based tuning
MAX_WORKFLOW_TASKS = int(os.getenv("MAX_WORKFLOW_TASKS", "100"))
MAX_ACTIVITIES = int(os.getenv("MAX_ACTIVITIES", "50"))

async def main():
    client = await create_temporal_client()

    worker = Worker(
        client,
        task_queue="order-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment],
        max_concurrent_workflow_tasks=MAX_WORKFLOW_TASKS,
        max_concurrent_activities=MAX_ACTIVITIES
    )

    logging.info(
        f"Starting worker with "
        f"max_workflow_tasks={MAX_WORKFLOW_TASKS}, "
        f"max_activities={MAX_ACTIVITIES}"
    )

    await worker.run()
# Deployment mit tuning
kubectl set env deployment/order-worker \
  MAX_WORKFLOW_TASKS=200 \
  MAX_ACTIVITIES=100

# ✅ Live-Tuning ohne Code-Change!

13.6 Performance Best Practices

Sandbox Performance Optimization

Regel: Pass deterministic modules through für bessere Performance.

# ❌ ANTI-PATTERN: Langsamer Sandbox (alles wird gesandboxed)

from temporalio.worker import Worker

worker = Worker(
    client,
    task_queue="order-queue",
    workflows=[OrderWorkflow],
    activities=[process_payment]
    # ❌ Alle Module werden gesandboxed
    # → Pydantic Models sind sehr langsam
)
# ✅ BEST PRACTICE: Optimierter Sandbox

from temporalio.worker import Worker
from temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner, SandboxRestrictions

# ✅ Pass-through für deterministische Module
passthrough_modules = [
    "pydantic",  # ✅ Pydantic ist deterministisch
    "dataclasses",  # ✅ Dataclasses sind deterministisch
    "models",  # ✅ Unsere eigenen Models
    "workflows.order_models",  # ✅ Order-spezifische Models
]

worker = Worker(
    client,
    task_queue="order-queue",
    workflows=[OrderWorkflow],
    activities=[process_payment],

    # ✅ Custom Sandbox Configuration
    workflow_runner=SandboxedWorkflowRunner(
        restrictions=SandboxRestrictions.default.with_passthrough_modules(
            *passthrough_modules
        )
    )
)

# ✅ Resultat: 5-10x schnellerer Workflow-Start!

Event History Size Monitoring

Regel: Monitoren Sie History Size und reagieren Sie frühzeitig.

# ✅ BEST PRACTICE: History Size Monitoring im Workflow

@workflow.defn
class LongRunningWorkflow:
    @workflow.run
    async def run(self, input: JobInput):
        processed = 0

        for item in input.items:
            # ✅ Regelmäßig History Size checken
            info = workflow.info()
            history_length = info.get_current_history_length()

            if history_length > 8000:  # ✅ Warning bei 8k (limit: 50k)
                workflow.logger.warning(
                    f"History size: {history_length} events, "
                    "approaching limit (50k). Consider Continue-As-New."
                )

            if history_length > 10000:  # ✅ Continue-As-New bei 10k
                workflow.logger.info(
                    f"History size: {history_length}, continuing as new"
                )
                workflow.continue_as_new(
                    JobInput(
                        items=input.items[processed:],
                        total_processed=input.total_processed + processed
                    )
                )

            result = await workflow.execute_activity(
                process_item,
                item,
                start_to_close_timeout=timedelta(seconds=30)
            )

            processed += 1

Prometheus Metrics:

# workers/metrics.py
from prometheus_client import Histogram, Counter

workflow_history_size = Histogram(
    'temporal_workflow_history_size',
    'Workflow history event count',
    buckets=[10, 50, 100, 500, 1000, 5000, 10000, 50000]
)

continue_as_new_counter = Counter(
    'temporal_continue_as_new_total',
    'Continue-As-New executions'
)

# Im Workflow
workflow_history_size.observe(history_length)

if history_length > 10000:
    continue_as_new_counter.inc()
    workflow.continue_as_new(...)

13.7 Anti-Pattern Katalog

1. SDK Over-Wrapping

Anti-Pattern: Temporal SDK zu stark wrappen.

# ❌ ANTI-PATTERN: Zu starkes Wrapping versteckt Features

class MyTemporalWrapper:
    """❌ Versteckt wichtige Temporal-Features"""

    def __init__(self, namespace: str):
        # ❌ Versteckt Client-Konfiguration
        self.client = Client.connect(namespace)

    async def run_workflow(self, name: str, data: dict):
        # ❌ Kein Zugriff auf:
        #   - Workflow ID customization
        #   - Retry Policies
        #   - Timeouts
        #   - Signals/Queries
        return await self.client.execute_workflow(name, data)

    # ❌ SDK-Updates sind schwierig
    # ❌ Team kennt Temporal nicht wirklich
    # ❌ Features wie Schedules, Updates nicht nutzbar
# ✅ BEST PRACTICE: Dünner Helper, voller SDK-Zugriff

# helpers/temporal_helpers.py
async def create_temporal_client(
    namespace: str = "default"
) -> Client:
    """Thin helper for client creation"""
    return await Client.connect(
        f"localhost:7233",
        namespace=namespace,
        # ✅ Weitere Config durchreichbar
    )

# Application code: Voller SDK-Zugriff
async def main():
    client = await create_temporal_client()

    # ✅ Direkter SDK-Zugriff für alle Features
    handle = await client.start_workflow(
        OrderWorkflow.run,
        order_input,
        id=f"order-{order_id}",
        task_queue="order-queue",
        retry_policy=RetryPolicy(maximum_attempts=3),
        execution_timeout=timedelta(days=7)
    )

    # ✅ Signals
    await handle.signal(OrderWorkflow.approve)

    # ✅ Queries
    status = await handle.query(OrderWorkflow.get_status)

2. Local Activities ohne Idempotenz

Anti-Pattern: Local Activities verwenden ohne Idempotenz-Keys.

# ❌ ANTI-PATTERN: Non-Idempotent Local Activity

@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, amount: float):
        # ❌ Local Activity (kann mehrfach ausgeführt werden!)
        await workflow.execute_local_activity(
            charge_credit_card,
            amount,
            start_to_close_timeout=timedelta(seconds=5)
        )
        # ❌ Bei Retry: Kunde wird doppelt belastet!

@activity.defn
async def charge_credit_card(amount: float):
    """❌ Nicht idempotent!"""
    # Charge without idempotency key
    await payment_api.charge(amount)  # ❌ Kann mehrfach passieren!

Was passiert:

1. Local Activity startet: charge_credit_card(100.0)
2. Payment API wird aufgerufen: $100 charged
3. Worker crashed vor Activity-Completion
4. Workflow replay: Local Activity wird NOCHMAL ausgeführt
5. Payment API wird NOCHMAL aufgerufen: $100 charged AGAIN
6. Kunde wurde $200 belastet statt $100!
# ✅ BEST PRACTICE: Idempotente Local Activity ODER Regular Activity

# Option 1: Idempotent Local Activity
@activity.defn
async def charge_credit_card_idempotent(
    amount: float,
    idempotency_key: str  # ✅ Idempotency Key!
):
    """✅ Idempotent mit Key"""
    await payment_api.charge(
        amount,
        idempotency_key=idempotency_key  # ✅ API merkt Duplikate
    )

@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, payment_id: str, amount: float):
        # ✅ Unique Key basierend auf Workflow
        idempotency_key = f"{workflow.info().workflow_id}-payment"

        await workflow.execute_local_activity(
            charge_credit_card_idempotent,
            args=[amount, idempotency_key],
            start_to_close_timeout=timedelta(seconds=5)
        )

# Option 2: Regular Activity (recommended!)
@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, amount: float):
        # ✅ Regular Activity: Temporal garantiert at-most-once
        await workflow.execute_activity(
            charge_credit_card,
            amount,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

Regel: Verwenden Sie Regular Activities als Default. Local Activities nur für:

  • Sehr schnelle Operationen (<1s)
  • Read-Only Operationen
  • Operations mit eingebauter Idempotenz

3. Workers Side-by-Side mit Application Code

Anti-Pattern: Workers im gleichen Process wie Application Code deployen.

# ❌ ANTI-PATTERN: Worker + Web Server im gleichen Process

# main.py
from fastapi import FastAPI
from temporalio.worker import Worker

app = FastAPI()

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    """Web API endpoint"""
    ...

async def main():
    # ❌ Worker und Web Server im gleichen Process!
    client = await create_temporal_client()

    # Start Worker (blocking!)
    worker = Worker(
        client,
        task_queue="order-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment]
    )

    # ❌ Probleme:
    # - Worker blockiert Web Server (oder umgekehrt)
    # - Resource Contention (CPU/Memory)
    # - Deployment ist gekoppelt
    # - Scaling ist gekoppelt
    # - Ein Crash betrifft beides

    await worker.run()
# ✅ BEST PRACTICE: Separate Processes

# web_server.py (separate deployment)
from fastapi import FastAPI

app = FastAPI()

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    """Web API endpoint"""
    client = await create_temporal_client()
    handle = client.get_workflow_handle(order_id)
    status = await handle.query(OrderWorkflow.get_status)
    return {"status": status}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

# worker.py (separate deployment)
from temporalio.worker import Worker

async def main():
    """Dedicated worker process"""
    client = await create_temporal_client()

    worker = Worker(
        client,
        task_queue="order-queue",
        workflows=[OrderWorkflow],
        activities=[process_payment]
    )

    await worker.run()

if __name__ == "__main__":
    asyncio.run(main())

Separate Deployments:

# kubernetes/web-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 10  # ✅ Viele Replicas für Web Traffic
  template:
    spec:
      containers:
      - name: web
        image: myapp/web:latest
        command: ["python", "web_server.py"]
        resources:
          requests:
            cpu: "200m"  # ✅ Wenig CPU für Web
            memory: "256Mi"

---
# kubernetes/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker
spec:
  replicas: 3  # ✅ Weniger Replicas, aber mehr Ressourcen
  template:
    spec:
      containers:
      - name: worker
        image: myapp/worker:latest
        command: ["python", "worker.py"]
        resources:
          requests:
            cpu: "1000m"  # ✅ Mehr CPU für Worker
            memory: "2Gi"  # ✅ Mehr Memory für Workflow Caching

13.8 Production Readiness Checklist

Code-Ebene

✅ Workflows orchestrieren nur, implementieren nicht
✅ Single Object Input Pattern für alle Workflows
✅ Alle non-deterministic Operationen in Activities
✅ Continue-As-New für long-running Workflows
✅ History Size Monitoring implementiert
✅ Query Handlers sind read-only
✅ Replay Tests in CI/CD
✅ Comprehensive Unit Tests für Activities
✅ Integration Tests mit WorkflowEnvironment

Deployment-Ebene

✅ Minimum 3 Worker Replicas pro Queue
✅ Workers separiert von Application Code
✅ Resource Limits definiert (CPU/Memory)
✅ Health Checks konfiguriert
✅ Graceful Shutdown implementiert
✅ Worker pro Domain/Use Case
✅ Worker Tuning basierend auf Metrics
✅ Rolling Update Strategy konfiguriert

Monitoring-Ebene

✅ Schedule-To-Start Metrics
✅ Workflow Success/Failure Rate
✅ Activity Duration & Error Rate
✅ Event History Size Tracking
✅ Worker CPU/Memory Monitoring
✅ Continue-As-New Rate
✅ Alerts konfiguriert (PagerDuty/Slack)

Testing-Ebene

✅ Replay Tests für jede Workflow-Version
✅ Unit Tests für jede Activity
✅ Integration Tests für Happy Path
✅ Integration Tests für Error Cases
✅ Production History Replay in CI
✅ Load Testing für Worker Capacity
✅ Chaos Engineering Tests (Worker Failures)

13.9 Code Review Checkliste

Verwenden Sie diese Checkliste bei Code Reviews:

Workflow Code Review

✅ Workflow orchestriert nur (keine Business Logic)?
✅ Single Object Input statt multiple Parameters?
✅ Keine non-deterministic Operationen (random, datetime.now, etc.)?
✅ Keine Activity-Reihenfolge geändert ohne workflow.patched()?
✅ Continue-As-New für long-running Workflows?
✅ History Size Monitoring vorhanden?
✅ Workflow State klein (<1 KB)?
✅ Query Handlers sind read-only?
✅ Replay Tests hinzugefügt?

Activity Code Review

✅ Activity ist idempotent?
✅ Activity hat Retry-Logic (oder RetryPolicy)?
✅ Activity hat Timeout definiert?
✅ Activity ist unit-testbar?
✅ Externe Calls haben Circuit Breaker?
✅ Activity loggt Errors mit Context?
✅ Activity gibt strukturiertes Result zurück (nicht primitives)?

Worker Code Review

✅ Worker hat max_concurrent_* konfiguriert?
✅ Worker hat Health Check Endpoint?
✅ Worker hat Graceful Shutdown?
✅ Worker ist unabhängig deploybar?
✅ Worker hat Resource Limits?
✅ Worker hat Monitoring/Metrics?

13.10 Zusammenfassung

Top 10 Best Practices

  1. Orchestration, nicht Implementation: Workflows orchestrieren, Activities implementieren
  2. Single Object Input: Ein Dataclass-Input statt viele Parameter
  3. Determinismus: Alles Non-Deterministische in Activities
  4. Continue-As-New: Bei >1.000 Events oder long-running Workflows
  5. Minimaler State: IDs speichern, nicht Daten
  6. Code-Organisation: Workflows, Activities, Workers getrennt
  7. Multiple Workers: Minimum 3 Replicas in Production
  8. Worker Tuning: Basierend auf Schedule-To-Start Metrics
  9. Replay Testing: Jede Workflow-Änderung testen
  10. Monitoring: Schedule-To-Start, Success Rate, History Size

Top 10 Anti-Patterns

  1. Non-Determinismus: random(), datetime.now(), uuid.uuid4() im Workflow
  2. Activity-Reihenfolge ändern: Ohne workflow.patched()
  3. Große Event History: >10.000 Events ohne Continue-As-New
  4. Großer Workflow State: Listen/Dicts statt IDs
  5. Query Mutation: State in Query Handler ändern
  6. SDK Over-Wrapping: Temporal SDK zu stark abstrahieren
  7. Local Activities ohne Idempotenz: Duplikate werden nicht verhindert
  8. Single Worker: Kein Failover, kein Rolling Update
  9. Workers mit App Code: Resource Contention, gekoppeltes Deployment
  10. Fehlende Tests: Keine Replay Tests, keine Integration Tests

Quick Reference: Was ist OK wo?

OperationWorkflowActivityWarum
random.random()Non-deterministic
datetime.now()Non-deterministic
uuid.uuid4()Non-deterministic
External API CallNon-deterministic
Database QueryNon-deterministic
File I/ONon-deterministic
Heavy ComputationShould be retryable
workflow.sleep()Deterministic timer
workflow.execute_activity()Workflow orchestration
State Management✅ (minimal)Workflow owns state
LoggingBoth OK

Nächste Schritte

Sie haben jetzt:

  • Best Practices für Production-Ready Workflows
  • Anti-Patterns Katalog zur Vermeidung häufiger Fehler
  • Code-Organisation Patterns für Wartbarkeit
  • Worker-Tuning Guidelines für Performance
  • Production Readiness Checkliste

In Teil V (Kochbuch) werden wir konkrete Rezepte für häufige Use Cases sehen:

  • E-Commerce Order Processing
  • Payment Processing with Retries
  • Long-Running Approval Workflows
  • Scheduled Cleanup Jobs
  • Fan-Out/Fan-In Patterns
  • Saga Pattern Implementation

⬆ Zurück zum Inhaltsverzeichnis

Nächstes Kapitel: Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)

Code-Beispiele für dieses Kapitel: examples/part-04/chapter-13/

Ressourcen

Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)

In diesem Kapitel untersuchen wir drei bewährte Workflow-Muster, die in der Praxis häufig vorkommen und für die Temporal besonders gut geeignet ist. Diese Muster zeigen die Stärken von Temporal bei der Orchestrierung komplexer Geschäftsprozesse.

14.1 Überblick: Warum Muster-Rezepte?

Während wir in den vorherigen Kapiteln die Grundlagen von Temporal kennengelernt haben, geht es nun darum, wie man häufige Geschäftsszenarien elegant und robust implementiert. Die drei Muster, die wir behandeln werden, repräsentieren typische Herausforderungen in verteilten Systemen:

  • Human-in-the-Loop: Prozesse, die menschliche Eingaben oder Genehmigungen erfordern
  • Cron/Scheduling: Zeitgesteuerte, wiederkehrende Aufgaben
  • Order Fulfillment (Saga): Verteilte Transaktionen über mehrere Services hinweg

14.2 Human-in-the-Loop Pattern

Das Problem

Viele Geschäftsprozesse erfordern an bestimmten Punkten menschliche Entscheidungen oder Eingaben:

  • Genehmigung von Urlaubsanträgen
  • Überprüfung von Hintergrundüberprüfungen (Background Checks)
  • Freigabe von Zahlungen über einem bestimmten Betrag
  • Klärung von Mehrdeutigkeiten in automatisierten Prozessen

Die Herausforderung besteht darin, dass diese menschlichen Interaktionen unvorhersehbar lange dauern können – von Minuten bis zu Tagen oder sogar Wochen.

Die Temporal-Lösung

Temporal ermöglicht es Workflows, auf menschliche Eingaben zu warten, ohne Ressourcen zu blockieren. Der Workflow kann für Stunden oder Tage “schlafen” und wird genau dort fortgesetzt, wo er gestoppt hat, sobald die Eingabe erfolgt.

Wichtige Konzepte:

  • Signals: Ermöglichen es, Daten in einen laufenden Workflow zu senden
  • Queries: Erlauben das Abfragen des aktuellen Workflow-Status
  • Timers: Können als Timeout für zu lange Wartezeiten dienen

Implementierungsbeispiel: Genehmigungsprozess

from temporalio import workflow
from datetime import timedelta

@workflow.defn
class ApprovalWorkflow:
    def __init__(self):
        self.approved = False
        self.rejection_reason = None

    @workflow.run
    async def run(self, request_data: dict) -> str:
        # 1. Sende Benachrichtigung an Genehmiger
        await workflow.execute_activity(
            send_approval_notification,
            request_data,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # 2. Warte auf Genehmigung mit Timeout von 7 Tagen
        try:
            await workflow.wait_condition(
                lambda: self.approved or self.rejection_reason,
                timeout=timedelta(days=7)
            )
        except asyncio.TimeoutError:
            # Automatische Eskalation nach 7 Tagen
            await workflow.execute_activity(
                escalate_to_manager,
                request_data,
                start_to_close_timeout=timedelta(seconds=30)
            )
            # Warte weitere 3 Tage auf Manager
            await workflow.wait_condition(
                lambda: self.approved or self.rejection_reason,
                timeout=timedelta(days=3)
            )

        # 3. Verarbeite das Ergebnis
        if self.approved:
            await workflow.execute_activity(
                process_approval,
                request_data,
                start_to_close_timeout=timedelta(minutes=5)
            )
            return "approved"
        else:
            await workflow.execute_activity(
                notify_rejection,
                args=[request_data, self.rejection_reason],
                start_to_close_timeout=timedelta(seconds=30)
            )
            return f"rejected: {self.rejection_reason}"

    @workflow.signal
    async def approve(self):
        """Signal zum Genehmigen des Antrags"""
        self.approved = True

    @workflow.signal
    async def reject(self, reason: str):
        """Signal zum Ablehnen des Antrags"""
        self.rejection_reason = reason

    @workflow.query
    def get_status(self) -> dict:
        """Abfrage des aktuellen Status"""
        return {
            "approved": self.approved,
            "rejected": self.rejection_reason is not None,
            "pending": not self.approved and not self.rejection_reason
        }

Verwendung des Workflows:

# Workflow starten
handle = await client.start_workflow(
    ApprovalWorkflow.run,
    request_data,
    id="approval-12345",
    task_queue="approval-tasks"
)

# Status abfragen (jederzeit möglich)
status = await handle.query(ApprovalWorkflow.get_status)
print(f"Current status: {status}")

# Genehmigung senden (kann Tage später erfolgen)
await handle.signal(ApprovalWorkflow.approve)

# Auf Ergebnis warten
result = await handle.result()

Best Practices

  1. Timeouts verwenden: Implementiere immer Timeouts und Eskalationsmechanismen
  2. Status abfragbar machen: Nutze Queries, damit Benutzer den Status jederzeit prüfen können
  3. Benachrichtigungen senden: Informiere Menschen aktiv über ausstehende Aktionen
  4. Idempotenz beachten: Signals können mehrfach gesendet werden – handle dies entsprechend

14.3 Cron und Scheduling Pattern

Warum nicht einfach Cron?

Traditionelle Cron-Jobs haben mehrere Probleme:

  • Keine Visibilität in den Ausführungsstatus
  • Keine einfache Möglichkeit, Jobs zu pausieren oder zu stoppen
  • Schwierig zu testen und zu überwachen
  • Keine Garantie für genau eine Ausführung (at-least-once, aber nicht exactly-once)
  • Kein eingebautes Retry-Verhalten

Temporal Schedules: Die bessere Alternative

Temporal Schedules bieten:

  • Vollständige Kontrolle: Start, Stop, Pause, Update zur Laufzeit
  • Observability: Einsicht in alle vergangenen und zukünftigen Ausführungen
  • Backfill: Nachträgliches Ausführen verpasster Runs
  • Overlap-Policies: Kontrolliere, was passiert, wenn ein Workflow noch läuft, während der nächste starten soll

Schedule-Optionen

1. Cron-Style Scheduling:

from temporalio.client import Client, ScheduleActionStartWorkflow, ScheduleSpec, ScheduleIntervalSpec
from datetime import timedelta

async def create_cron_schedule():
    client = await Client.connect("localhost:7233")

    await client.create_schedule(
        id="daily-report-schedule",
        schedule=Schedule(
            action=ScheduleActionStartWorkflow(
                workflow_type=GenerateReportWorkflow,
                args=["daily"],
                id=f"daily-report-{datetime.now().strftime('%Y%m%d')}",
                task_queue="reporting"
            ),
            spec=ScheduleSpec(
                # Jeden Tag um 6 Uhr morgens UTC
                cron_expressions=["0 6 * * *"],
            ),
            # Was tun bei Überlappungen?
            policy=SchedulePolicy(
                overlap=ScheduleOverlapPolicy.SKIP,  # Überspringe, wenn noch läuft
            )
        )
    )

Cron-Format in Temporal:

┌───────────── Minute (0-59)
│ ┌───────────── Stunde (0-23)
│ │ ┌───────────── Tag des Monats (1-31)
│ │ │ ┌───────────── Monat (1-12)
│ │ │ │ ┌───────────── Tag der Woche (0-6, Sonntag = 0)
│ │ │ │ │
* * * * *

Beispiele:

  • 0 9 * * 1-5: Werktags um 9 Uhr
  • */15 * * * *: Alle 15 Minuten
  • 0 0 1 * *: Am ersten Tag jeden Monats um Mitternacht

2. Interval-basiertes Scheduling:

await client.create_schedule(
    id="health-check-schedule",
    schedule=Schedule(
        action=ScheduleActionStartWorkflow(
            workflow_type=HealthCheckWorkflow,
            task_queue="monitoring"
        ),
        spec=ScheduleSpec(
            # Alle 5 Minuten
            intervals=[ScheduleIntervalSpec(
                every=timedelta(minutes=5)
            )],
        )
    )
)

Overlap-Policies

Was passiert, wenn ein Workflow noch läuft, während der nächste geplant ist?

from temporalio import ScheduleOverlapPolicy

# SKIP: Überspringe die neue Ausführung
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.SKIP)

# BUFFER_ONE: Führe maximal eine weitere Ausführung in der Warteschlange
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.BUFFER_ONE)

# BUFFER_ALL: Puffere alle Ausführungen (Vorsicht: kann zu Stau führen!)
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.BUFFER_ALL)

# CANCEL_OTHER: Breche den laufenden Workflow ab und starte neu
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.CANCEL_OTHER)

# ALLOW_ALL: Erlaube parallele Ausführungen
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.ALLOW_ALL)

Schedule-Management

# Schedule abrufen
schedule_handle = client.get_schedule_handle("daily-report-schedule")

# Beschreibung abrufen
description = await schedule_handle.describe()
print(f"Next 5 runs: {description.info.next_action_times[:5]}")

# Pausieren
await schedule_handle.pause(note="Maintenance window")

# Wieder aktivieren
await schedule_handle.unpause(note="Maintenance complete")

# Einmalig manuell auslösen
await schedule_handle.trigger(overlap=ScheduleOverlapPolicy.ALLOW_ALL)

# Backfill: Verpasste Ausführungen nachholen
await schedule_handle.backfill(
    start_at=datetime(2024, 1, 1),
    end_at=datetime(2024, 1, 31),
    overlap=ScheduleOverlapPolicy.ALLOW_ALL
)

# Schedule löschen
await schedule_handle.delete()

Workflow-Implementierung für Schedules

@workflow.defn
class DataSyncWorkflow:
    @workflow.run
    async def run(self) -> dict:
        # Workflow weiß, ob er via Schedule gestartet wurde
        info = workflow.info()

        workflow.logger.info(
            f"Running scheduled sync. Attempt: {info.attempt}"
        )

        # Normale Workflow-Logik
        records = await workflow.execute_activity(
            fetch_new_records,
            start_to_close_timeout=timedelta(minutes=10)
        )

        await workflow.execute_activity(
            sync_to_database,
            records,
            start_to_close_timeout=timedelta(minutes=5)
        )

        return {
            "synced_records": len(records),
            "timestamp": datetime.now().isoformat()
        }

Best Practices für Scheduling

  1. Idempotenz: Schedules können Workflows mehrfach starten – stelle sicher, dass deine Logik idempotent ist
  2. Monitoring: Nutze Temporal UI, um verpasste oder fehlgeschlagene Runs zu überwachen
  3. Overlap-Policy wählen: Überlege genau, was bei Überlappungen passieren soll
  4. Zeitzone beachten: Cron-Ausdrücke werden standardmäßig in UTC interpretiert
  5. Workflow-IDs: Verwende dynamische Workflow-IDs mit Zeitstempel, um Duplikate zu vermeiden

14.4 Order Fulfillment mit dem Saga Pattern

Das Problem: Verteilte Transaktionen

Stellen wir uns einen E-Commerce-Bestellprozess vor, der mehrere Services involviert:

  1. Inventory Service: Prüfe Verfügbarkeit und reserviere Artikel
  2. Payment Service: Belaste Kreditkarte
  3. Shipping Service: Erstelle Versandetikett und beauftrage Versand
  4. Notification Service: Sende Bestätigungsmail

Was passiert, wenn Schritt 3 fehlschlägt, nachdem wir bereits Schritt 1 und 2 ausgeführt haben? Wir müssen:

  • Die Kreditkartenbelastung rückgängig machen (Schritt 2)
  • Die Inventar-Reservierung aufheben (Schritt 1)

Dies ist das klassische Problem verteilter Transaktionen: Entweder alle Schritte erfolgreich, oder keiner.

Das Saga Pattern

Ein Saga ist eine Sequenz von lokalen Transaktionen, wobei jede Transaktion eine Kompensation (Rückgängigmachung) hat. Falls ein Schritt fehlschlägt, werden alle vorherigen Schritte durch ihre Kompensationen rückgängig gemacht.

Zwei Hauptkomponenten:

  1. Forward-Recovery: Die normalen Schritte vorwärts
  2. Backward-Recovery (Compensations): Die Rückgängigmachung bei Fehler

Temporal vereinfacht Sagas

Ohne Temporal müsstest du:

  • Selbst den Fortschritt tracken (Event Sourcing)
  • Retry-Logik implementieren
  • State Management über Services hinweg
  • Crash-Recovery-Mechanismen bauen

Mit Temporal bekommst du all das kostenlos. Du musst nur die Kompensationen definieren.

Implementierung: Order Fulfillment

from temporalio import workflow, activity
from datetime import timedelta
from dataclasses import dataclass
from typing import Optional

@dataclass
class OrderInfo:
    order_id: str
    customer_id: str
    items: list[dict]
    total_amount: float
    idempotency_key: str  # Wichtig für Idempotenz!

@dataclass
class SagaCompensation:
    activity_name: str
    args: list

class Saga:
    """Helper-Klasse zum Verwalten von Kompensationen"""

    def __init__(self):
        self.compensations: list[SagaCompensation] = []

    def add_compensation(self, activity_name: str, *args):
        """Füge eine Kompensation hinzu"""
        self.compensations.append(
            SagaCompensation(activity_name, list(args))
        )

    async def compensate(self):
        """Führe alle Kompensationen in umgekehrter Reihenfolge aus"""
        # LIFO: Last In, First Out
        for comp in reversed(self.compensations):
            try:
                await workflow.execute_activity(
                    comp.activity_name,
                    args=comp.args,
                    start_to_close_timeout=timedelta(minutes=5),
                    retry_policy=workflow.RetryPolicy(
                        maximum_attempts=5,
                        initial_interval=timedelta(seconds=1),
                        maximum_interval=timedelta(seconds=60)
                    )
                )
                workflow.logger.info(f"Compensated: {comp.activity_name}")
            except Exception as e:
                workflow.logger.error(
                    f"Compensation failed for {comp.activity_name}: {e}"
                )
                # In Produktion: Dead Letter Queue, Alerting, etc.

@workflow.defn
class OrderFulfillmentWorkflow:
    @workflow.run
    async def run(self, order: OrderInfo) -> dict:
        saga = Saga()

        try:
            # Schritt 1: Inventar prüfen und reservieren
            inventory_reserved = await workflow.execute_activity(
                reserve_inventory,
                order,
                start_to_close_timeout=timedelta(minutes=2)
            )
            # Kompensation hinzufügen: Reservierung aufheben
            saga.add_compensation(
                "release_inventory",
                order.order_id,
                order.idempotency_key
            )
            workflow.logger.info(f"Inventory reserved: {inventory_reserved}")

            # Schritt 2: Zahlung durchführen
            payment_result = await workflow.execute_activity(
                charge_payment,
                order,
                start_to_close_timeout=timedelta(minutes=5)
            )
            # Kompensation hinzufügen: Zahlung erstatten
            saga.add_compensation(
                "refund_payment",
                payment_result["transaction_id"],
                order.total_amount,
                order.idempotency_key
            )
            workflow.logger.info(f"Payment charged: {payment_result}")

            # Schritt 3: Versand erstellen
            shipping_result = await workflow.execute_activity(
                create_shipment,
                order,
                start_to_close_timeout=timedelta(minutes=3)
            )
            # Kompensation hinzufügen: Versand stornieren
            saga.add_compensation(
                "cancel_shipment",
                shipping_result["shipment_id"],
                order.idempotency_key
            )
            workflow.logger.info(f"Shipment created: {shipping_result}")

            # Schritt 4: Bestätigung senden (keine Kompensation nötig)
            await workflow.execute_activity(
                send_confirmation_email,
                order,
                start_to_close_timeout=timedelta(seconds=30)
            )

            # Erfolg!
            return {
                "status": "fulfilled",
                "order_id": order.order_id,
                "tracking_number": shipping_result["tracking_number"]
            }

        except Exception as e:
            workflow.logger.error(f"Order fulfillment failed: {e}")
            # Kompensiere alle bisherigen Schritte
            await saga.compensate()

            # Sende Fehlerbenachrichtigung
            await workflow.execute_activity(
                send_error_notification,
                args=[order, str(e)],
                start_to_close_timeout=timedelta(seconds=30)
            )

            return {
                "status": "failed",
                "order_id": order.order_id,
                "error": str(e)
            }

Activity-Implementierungen

# Activities mit Idempotenz

@activity.defn
async def reserve_inventory(order: OrderInfo) -> bool:
    """Reserviere Artikel im Inventar"""
    # Verwende idempotency_key, um Duplikate zu vermeiden
    response = await inventory_service.reserve(
        items=order.items,
        order_id=order.order_id,
        idempotency_key=order.idempotency_key
    )
    return response.success

@activity.defn
async def release_inventory(order_id: str, idempotency_key: str):
    """Kompensation: Gib Inventar-Reservierung frei"""
    await inventory_service.release(
        order_id=order_id,
        idempotency_key=f"{idempotency_key}-release"
    )

@activity.defn
async def charge_payment(order: OrderInfo) -> dict:
    """Belaste Zahlungsmittel"""
    # Viele Payment-APIs akzeptieren bereits idempotency_keys
    response = await payment_service.charge(
        customer_id=order.customer_id,
        amount=order.total_amount,
        idempotency_key=order.idempotency_key
    )
    return {
        "transaction_id": response.transaction_id,
        "status": response.status
    }

@activity.defn
async def refund_payment(
    transaction_id: str,
    amount: float,
    idempotency_key: str
):
    """Kompensation: Erstatte Zahlung"""
    await payment_service.refund(
        transaction_id=transaction_id,
        amount=amount,
        idempotency_key=f"{idempotency_key}-refund"
    )

@activity.defn
async def create_shipment(order: OrderInfo) -> dict:
    """Erstelle Versandetikett"""
    response = await shipping_service.create_shipment(
        order=order,
        idempotency_key=order.idempotency_key
    )
    return {
        "shipment_id": response.shipment_id,
        "tracking_number": response.tracking_number
    }

@activity.defn
async def cancel_shipment(shipment_id: str, idempotency_key: str):
    """Kompensation: Storniere Versand"""
    await shipping_service.cancel(
        shipment_id=shipment_id,
        idempotency_key=f"{idempotency_key}-cancel"
    )

@activity.defn
async def send_confirmation_email(order: OrderInfo):
    """Sende Bestätigungs-E-Mail"""
    await email_service.send(
        to=order.customer_id,
        template="order_confirmation",
        data=order
    )

@activity.defn
async def send_error_notification(order: OrderInfo, error: str):
    """Sende Fehler-Benachrichtigung"""
    await email_service.send(
        to=order.customer_id,
        template="order_failed",
        data={"order": order, "error": error}
    )

Kritisches Konzept: Idempotenz

Da Temporal Activities automatisch wiederholt, müssen alle Activities idempotent sein:

# Schlechtes Beispiel: Nicht idempotent
async def charge_payment_bad(customer_id: str, amount: float):
    # Könnte bei Retry mehrfach belasten!
    return payment_api.charge(customer_id, amount)

# Gutes Beispiel: Idempotent mit Key
async def charge_payment_good(
    customer_id: str,
    amount: float,
    idempotency_key: str
):
    # Payment-API prüft den Key und führt nur einmal aus
    return payment_api.charge(
        customer_id,
        amount,
        idempotency_key=idempotency_key
    )

Best Practices für Idempotenz:

  1. Idempotenz-Keys verwenden: UUIDs oder zusammengesetzte Keys (z.B. {order_id}-{operation})
  2. API-Unterstützung nutzen: Viele APIs (Stripe, PayPal, etc.) akzeptieren bereits Idempotenz-Keys
  3. Datenbank-Constraints: Unique-Constraints auf Keys in der Datenbank
  4. State-Checks: Prüfe vor Ausführung, ob Operation bereits durchgeführt wurde

Erweiterte Saga-Techniken

Parallele Kompensationen:

async def compensate_parallel(self):
    """Führe Kompensationen parallel aus für bessere Performance"""
    tasks = []
    for comp in reversed(self.compensations):
        task = workflow.execute_activity(
            comp.activity_name,
            args=comp.args,
            start_to_close_timeout=timedelta(minutes=5)
        )
        tasks.append(task)

    # Warte auf alle Kompensationen
    results = await asyncio.gather(*tasks, return_exceptions=True)

    for i, result in enumerate(results):
        if isinstance(result, Exception):
            workflow.logger.error(
                f"Compensation failed: {self.compensations[i].activity_name}"
            )

Teilweise Kompensation:

class Saga:
    def __init__(self):
        self.compensations: list[SagaCompensation] = []
        self.checkpoints: list[str] = []

    def add_checkpoint(self, name: str):
        """Setze einen Checkpoint für teilweise Kompensation"""
        self.checkpoints.append(name)

    async def compensate_to_checkpoint(self, checkpoint_name: str):
        """Kompensiere nur bis zu einem bestimmten Checkpoint"""
        checkpoint_index = self.checkpoints.index(checkpoint_name)
        for comp in reversed(self.compensations[:checkpoint_index]):
            await workflow.execute_activity(...)

Wann Sagas verwenden?

Geeignet für:

  • E-Commerce Order Processing
  • Reisebuchungen (Flug + Hotel + Mietwagen)
  • Finanzielle Transaktionen über mehrere Konten
  • Multi-Service Workflows mit “Alles-oder-Nichts”-Semantik

Nicht geeignet für:

  • Einfache, nicht-transaktionale Workflows
  • Workflows ohne Notwendigkeit für Rollback
  • Szenarien, wo echte ACID-Transaktionen möglich sind

14.5 Zusammenfassung

In diesem Kapitel haben wir drei essenzielle Workflow-Muster kennengelernt:

Human-in-the-Loop

  • Problem: Workflows benötigen menschliche Eingaben mit unvorhersehbarer Dauer
  • Lösung: Signals zum Senden von Eingaben, Queries zum Abfragen des Status, Timers für Timeouts
  • Key Takeaway: Temporal-Workflows können problemlos Tage oder Wochen auf Input warten

Cron/Scheduling

  • Problem: Traditionelle Cron-Jobs sind schwer zu überwachen und zu steuern
  • Lösung: Temporal Schedules mit voller Kontrolle, Observability und Overlap-Policies
  • Key Takeaway: Schedules sind Cron-Jobs mit Superkräften

Order Fulfillment (Saga Pattern)

  • Problem: Verteilte Transaktionen über mehrere Services ohne echte ACID-Garantien
  • Lösung: Saga Pattern mit Kompensationen für Rollback, Temporal übernimmt State-Management
  • Key Takeaway: Idempotenz ist kritisch, Temporal macht Sagas einfach

Gemeinsame Prinzipien

Alle drei Muster profitieren von Temporals Kernstärken:

  1. Durability: State wird automatisch persistiert
  2. Reliability: Automatische Retries und Fehlerbehandlung
  3. Observability: Vollständige Einsicht in Workflow-Ausführungen
  4. Scalability: Workflows können über lange Zeiträume laufen

Im nächsten Kapitel werden wir uns mit Testing-Strategien für Temporal-Workflows beschäftigen, um sicherzustellen, dass diese Muster auch robust in Produktion laufen.

Kapitel 15: Erweiterte Rezepte (AI Agents, Lambda, Polyglot)

In diesem Kapitel behandeln wir drei fortgeschrittene Anwendungsfälle, die zeigen, wie Temporal in modernen, heterogenen Architekturen eingesetzt wird. Diese Rezepte demonstrieren die Flexibilität und Erweiterbarkeit der Plattform.

15.1 Überblick: Die Evolution von Temporal

Während Kapitel 14 klassische Workflow-Muster behandelte, konzentriert sich dieses Kapitel auf neuere, spezialisierte Anwendungsfälle:

  • AI Agents: Orchestrierung von KI-Agenten mit LLMs und langlebigen Konversationen
  • Serverless Integration: Kombination von Temporal mit AWS Lambda und anderen FaaS-Plattformen
  • Polyglot Workflows: Mehrsprachige Workflows über verschiedene SDKs hinweg

Diese Muster repräsentieren den aktuellen Stand der Temporal-Nutzung in der Industrie (Stand 2024/2025).

15.2 AI Agents mit Temporal

15.2.1 Warum Temporal für AI Agents?

Die Entwicklung von AI-Agenten bringt spezifische Herausforderungen mit sich:

  • Langlebige Konversationen: Gespräche können über Stunden oder Tage verlaufen
  • Zustandsverwaltung: Kontext, Ziele und bisherige Interaktionen müssen persistent gespeichert werden
  • Fehlertoleranz: LLM-APIs können fehlschlagen, Rate-Limits erreichen oder inkonsistente Antworten liefern
  • Human-in-the-Loop: Menschen müssen in kritischen Momenten eingreifen können
  • Tool-Orchestrierung: Agenten rufen verschiedene externe Tools auf

Temporal bietet für all diese Herausforderungen native Lösungen:

graph TB
    subgraph "AI Agent Architecture mit Temporal"
        WF[Workflow: Agent Orchestrator]

        subgraph "Activities"
            LLM[LLM API Call]
            TOOL1[Tool: Database Query]
            TOOL2[Tool: Web Search]
            TOOL3[Tool: File Analysis]
            HUMAN[Human Intervention]
        end

        STATE[(Workflow State:<br/>- Conversation History<br/>- Agent Goal<br/>- Tool Results)]

        WF --> LLM
        WF --> TOOL1
        WF --> TOOL2
        WF --> TOOL3
        WF --> HUMAN
        WF -.stores.-> STATE
    end

    style WF fill:#e1f5ff
    style LLM fill:#ffe1f5
    style STATE fill:#fff4e1

15.2.2 Real-World Adoption

Unternehmen, die Temporal für AI Agents nutzen (Stand 2024):

  • Lindy, Dust, ZoomInfo: AI Agents mit State-Durability
  • Descript & Neosync: Datenpipelines und GPU-Ressourcen-Koordination
  • OpenAI Integration: Temporal hat eine offizielle Integration mit dem OpenAI Agents SDK (Public Preview, Python SDK)

15.2.3 Grundlegendes AI Agent Pattern

from temporalio import workflow, activity
from datetime import timedelta
from dataclasses import dataclass, field
from typing import List, Optional
import openai

@dataclass
class Message:
    role: str  # "system", "user", "assistant", "tool"
    content: str
    name: Optional[str] = None  # Tool name
    tool_call_id: Optional[str] = None

@dataclass
class AgentState:
    goal: str
    conversation_history: List[Message] = field(default_factory=list)
    tools_used: List[str] = field(default_factory=list)
    completed: bool = False
    result: Optional[str] = None

# Activities: Non-deterministische LLM und Tool Calls

@activity.defn
async def call_llm(messages: List[Message], tools: List[dict]) -> dict:
    """
    Ruft LLM API auf (OpenAI, Claude, etc.).
    Vollständig non-deterministisch - perfekt für Activity.
    """
    activity.logger.info(f"Calling LLM with {len(messages)} messages")

    try:
        response = await openai.ChatCompletion.acreate(
            model="gpt-4",
            messages=[{"role": m.role, "content": m.content} for m in messages],
            tools=tools,
            temperature=0.7,
        )

        return {
            "content": response.choices[0].message.content,
            "tool_calls": response.choices[0].message.tool_calls,
            "finish_reason": response.choices[0].finish_reason
        }
    except Exception as e:
        activity.logger.error(f"LLM API error: {e}")
        raise

@activity.defn
async def execute_tool(tool_name: str, arguments: dict) -> str:
    """
    Führt Tool-Aufrufe aus (Database, APIs, File System, etc.).
    """
    activity.logger.info(f"Executing tool: {tool_name}")

    if tool_name == "search_database":
        # Simuliere Datenbanksuche
        query = arguments.get("query")
        results = await database_search(query)
        return f"Found {len(results)} results: {results}"

    elif tool_name == "web_search":
        # Web-Suche
        query = arguments.get("query")
        results = await web_search_api(query)
        return f"Web search results: {results}"

    elif tool_name == "read_file":
        # Datei lesen
        filepath = arguments.get("filepath")
        content = await read_file_async(filepath)
        return content

    else:
        raise ValueError(f"Unknown tool: {tool_name}")

@activity.defn
async def request_human_input(question: str, context: dict) -> str:
    """
    Fordert menschliche Eingabe an (via UI, Email, Slack, etc.).
    """
    activity.logger.info(f"Requesting human input: {question}")

    # In Production: Sende Notification via Slack/Email
    # und warte auf Webhook/API Call zurück
    notification_result = await send_notification(
        channel="slack",
        message=f"AI Agent needs your input: {question}",
        context=context
    )

    # Placeholder - in Reality würde hier ein Signal empfangen
    return notification_result

# Workflow: Deterministische Orchestrierung

@workflow.defn
class AIAgentWorkflow:
    """
    Orchestriert einen AI Agent mit Tools und optionalem Human-in-the-Loop.

    Der Workflow ist deterministisch, aber die LLM-Calls und Tools sind
    non-deterministisch (daher als Activities implementiert).
    """

    def __init__(self) -> None:
        self.state = AgentState(goal="")
        self.human_input_received = None
        self.max_iterations = 20  # Verhindere infinite loops

    @workflow.run
    async def run(self, goal: str, initial_context: str = "") -> AgentState:
        """
        Führe Agent aus bis Ziel erreicht oder max_iterations.

        Args:
            goal: Das zu erreichende Ziel des Agents
            initial_context: Optionaler initialer Kontext
        """
        self.state.goal = goal

        # System Message
        system_msg = Message(
            role="system",
            content=f"""You are a helpful AI assistant. Your goal is: {goal}

You have access to the following tools:
- search_database: Search internal database
- web_search: Search the web
- read_file: Read a file from the file system
- request_human_help: Ask a human for help

When you have achieved the goal, respond with "GOAL_ACHIEVED: [result]"."""
        )
        self.state.conversation_history.append(system_msg)

        # Initial User Message
        if initial_context:
            user_msg = Message(role="user", content=initial_context)
            self.state.conversation_history.append(user_msg)

        # Available Tools
        tools = [
            {
                "type": "function",
                "function": {
                    "name": "search_database",
                    "description": "Search the internal database",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "Search query"}
                        },
                        "required": ["query"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "web_search",
                    "description": "Search the web",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "Search query"}
                        },
                        "required": ["query"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "read_file",
                    "description": "Read a file",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "filepath": {"type": "string", "description": "Path to file"}
                        },
                        "required": ["filepath"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "request_human_help",
                    "description": "Ask a human for help",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "question": {"type": "string", "description": "Question for human"}
                        },
                        "required": ["question"]
                    }
                }
            }
        ]

        # Agent Loop
        for iteration in range(self.max_iterations):
            workflow.logger.info(f"Agent iteration {iteration + 1}/{self.max_iterations}")

            # Call LLM
            llm_response = await workflow.execute_activity(
                call_llm,
                args=[self.state.conversation_history, tools],
                start_to_close_timeout=timedelta(seconds=60),
                retry_policy=workflow.RetryPolicy(
                    initial_interval=timedelta(seconds=1),
                    maximum_interval=timedelta(seconds=30),
                    maximum_attempts=5,
                    non_retryable_error_types=["InvalidRequestError"]
                )
            )

            # Prüfe, ob Ziel erreicht
            if llm_response.get("content") and "GOAL_ACHIEVED:" in llm_response["content"]:
                self.state.completed = True
                self.state.result = llm_response["content"].replace("GOAL_ACHIEVED:", "").strip()

                # Füge finale Antwort zur History hinzu
                self.state.conversation_history.append(
                    Message(role="assistant", content=llm_response["content"])
                )

                workflow.logger.info(f"Goal achieved: {self.state.result}")
                return self.state

            # Verarbeite Tool Calls
            if llm_response.get("tool_calls"):
                for tool_call in llm_response["tool_calls"]:
                    tool_name = tool_call["function"]["name"]
                    tool_args = tool_call["function"]["arguments"]

                    workflow.logger.info(f"Executing tool: {tool_name}")
                    self.state.tools_used.append(tool_name)

                    # Spezialbehandlung: Human Input
                    if tool_name == "request_human_help":
                        # Warte auf menschliche Eingabe via Signal
                        question = tool_args.get("question")

                        workflow.logger.info(f"Waiting for human input: {question}")

                        # Sende Benachrichtigung (Fire-and-Forget Activity)
                        await workflow.execute_activity(
                            request_human_input,
                            args=[question, {"goal": self.state.goal}],
                            start_to_close_timeout=timedelta(seconds=30)
                        )

                        # Warte auf Signal (kann Stunden/Tage dauern!)
                        await workflow.wait_condition(
                            lambda: self.human_input_received is not None,
                            timeout=timedelta(hours=24)
                        )

                        tool_result = self.human_input_received
                        self.human_input_received = None  # Reset für nächstes Mal

                    else:
                        # Normale Tool Execution
                        tool_result = await workflow.execute_activity(
                            execute_tool,
                            args=[tool_name, tool_args],
                            start_to_close_timeout=timedelta(minutes=5),
                            retry_policy=workflow.RetryPolicy(
                                initial_interval=timedelta(seconds=2),
                                maximum_attempts=3
                            )
                        )

                    # Füge Tool-Result zur Conversation History hinzu
                    self.state.conversation_history.append(
                        Message(
                            role="tool",
                            name=tool_name,
                            content=str(tool_result),
                            tool_call_id=tool_call["id"]
                        )
                    )

            # Füge LLM Response zur History hinzu (wenn kein Tool Call)
            elif llm_response.get("content"):
                self.state.conversation_history.append(
                    Message(role="assistant", content=llm_response["content"])
                )

        # Max iterations erreicht
        workflow.logger.warning("Max iterations reached without achieving goal")
        self.state.completed = False
        self.state.result = "Max iterations reached"
        return self.state

    @workflow.signal
    async def provide_human_input(self, input_text: str):
        """Signal: Menschliche Eingabe bereitstellen."""
        workflow.logger.info(f"Received human input: {input_text}")
        self.human_input_received = input_text

    @workflow.signal
    async def add_user_message(self, message: str):
        """Signal: Neue User-Message hinzufügen (für Multi-Turn)."""
        self.state.conversation_history.append(
            Message(role="user", content=message)
        )

    @workflow.query
    def get_state(self) -> AgentState:
        """Query: Aktueller Agent State."""
        return self.state

    @workflow.query
    def get_conversation_history(self) -> List[Message]:
        """Query: Conversation History."""
        return self.state.conversation_history

    @workflow.query
    def get_tools_used(self) -> List[str]:
        """Query: Welche Tools wurden verwendet?"""
        return self.state.tools_used

15.2.4 Client: Agent starten und überwachen

from temporalio.client import Client

async def run_ai_agent():
    """Starte AI Agent und überwache Progress."""

    client = await Client.connect("localhost:7233")

    # Starte Agent Workflow
    handle = await client.start_workflow(
        AIAgentWorkflow.run,
        args=[
            "Analyze the sales data from Q4 2024 and create a summary report",
            "Please focus on trends and outliers."
        ],
        id=f"ai-agent-{uuid.uuid4()}",
        task_queue="ai-agents"
    )

    print(f"Started AI Agent: {handle.id}")

    # Überwache Progress
    while True:
        state = await handle.query(AIAgentWorkflow.get_state)

        print(f"\nAgent Status:")
        print(f"  Completed: {state.completed}")
        print(f"  Tools used: {', '.join(state.tools_used)}")
        print(f"  Conversation length: {len(state.conversation_history)} messages")

        if state.completed:
            print(f"\n✅ Goal achieved!")
            print(f"Result: {state.result}")
            break

        await asyncio.sleep(5)

    # Hole finale Conversation History
    history = await handle.query(AIAgentWorkflow.get_conversation_history)
    print("\n=== Conversation History ===")
    for msg in history:
        print(f"{msg.role}: {msg.content[:100]}...")

    result = await handle.result()
    return result

15.2.5 Multi-Agent Orchestration

Für komplexere Szenarien können mehrere Agents koordiniert werden:

@workflow.defn
class MultiAgentCoordinatorWorkflow:
    """
    Koordiniert mehrere spezialisierte AI Agents.

    Beispiel: Research Agent → Analysis Agent → Report Agent
    """

    @workflow.run
    async def run(self, task: str) -> dict:
        workflow.logger.info(f"Multi-Agent Coordinator started for: {task}")

        # Agent 1: Research Agent
        research_handle = await workflow.start_child_workflow(
            AIAgentWorkflow.run,
            args=[
                f"Research the following topic: {task}",
                "Collect relevant data from database and web."
            ],
            id=f"research-agent-{workflow.info().workflow_id}",
            task_queue="ai-agents"
        )

        research_result = await research_handle

        # Agent 2: Analysis Agent
        analysis_handle = await workflow.start_child_workflow(
            AIAgentWorkflow.run,
            args=[
                "Analyze the following research data and identify key insights",
                f"Research data: {research_result.result}"
            ],
            id=f"analysis-agent-{workflow.info().workflow_id}",
            task_queue="ai-agents"
        )

        analysis_result = await analysis_handle

        # Agent 3: Report Agent
        report_handle = await workflow.start_child_workflow(
            AIAgentWorkflow.run,
            args=[
                "Create a professional report based on the analysis",
                f"Analysis: {analysis_result.result}"
            ],
            id=f"report-agent-{workflow.info().workflow_id}",
            task_queue="ai-agents"
        )

        report_result = await report_handle

        return {
            "task": task,
            "research": research_result.result,
            "analysis": analysis_result.result,
            "report": report_result.result,
            "total_tools_used": (
                len(research_result.tools_used) +
                len(analysis_result.tools_used) +
                len(report_result.tools_used)
            )
        }

15.2.6 Best Practices für AI Agents

1. LLM Calls immer als Activities

# ✅ Richtig: LLM Call als Activity
@activity.defn
async def call_llm(prompt: str) -> str:
    return await openai.complete(prompt)

# ❌ Falsch: LLM Call direkt im Workflow
@workflow.defn
class BadWorkflow:
    @workflow.run
    async def run(self):
        # NICHT deterministisch! Workflow wird fehlschlagen beim Replay
        result = await openai.complete("Hello")

2. Retry Policies für LLM APIs

# LLMs können Rate-Limits haben oder temporär fehlschlagen
llm_response = await workflow.execute_activity(
    call_llm,
    prompt,
    start_to_close_timeout=timedelta(seconds=60),
    retry_policy=workflow.RetryPolicy(
        initial_interval=timedelta(seconds=1),
        backoff_coefficient=2.0,
        maximum_interval=timedelta(seconds=60),
        maximum_attempts=5,
        # Nicht wiederholen bei Invalid Request
        non_retryable_error_types=["InvalidRequestError", "AuthenticationError"]
    )
)

3. Conversation History Management

# Begrenze History-Größe für lange Konversationen
def truncate_history(messages: List[Message], max_tokens: int = 4000) -> List[Message]:
    """Behalte nur neueste Messages innerhalb Token-Limit."""
    # Behalte immer System Message
    system_msgs = [m for m in messages if m.role == "system"]
    other_msgs = [m for m in messages if m.role != "system"]

    # Schneide älteste Messages ab
    # (In Production: Token Counting nutzen)
    return system_msgs + other_msgs[-50:]  # Letzte 50 Messages

4. Timeouts für Human-in-the-Loop

try:
    await workflow.wait_condition(
        lambda: self.human_input_received is not None,
        timeout=timedelta(hours=24)
    )
except asyncio.TimeoutError:
    # Automatische Eskalation oder Fallback
    workflow.logger.warning("Human input timeout - using fallback")
    self.human_input_received = "TIMEOUT: Proceeding without human input"

15.3 Serverless Integration (AWS Lambda & Co.)

15.3.1 Das Serverless-Dilemma

Temporal und Serverless haben unterschiedliche Ausführungsmodelle:

AspektTemporal WorkerAWS Lambda
AusführungLong-running ProzessKurzlebige Invocations (max 15 Min)
StateIn-MemoryStateless
InfrastrukturVM, Container (persistent)On-Demand
KostenBasierend auf LaufzeitPay-per-Invocation

Kernproblem: Temporal Worker benötigen lange laufende Compute-Infrastruktur, während Lambda/Serverless kurzlebig und stateless ist.

Aber: Temporal kann trotzdem genutzt werden, um Serverless-Funktionen zu orchestrieren!

15.3.2 Integration Pattern 1: SQS + Lambda + Temporal

graph LR
    S3[S3 Upload] --> SQS[SQS Queue]
    SQS --> Lambda[Lambda Function]
    Lambda -->|Start Workflow| Temporal[Temporal Service]

    Temporal --> Worker[Temporal Worker<br/>ECS/EC2]
    Worker -->|Invoke| Lambda2[Lambda Activities]

    style Lambda fill:#ff9900
    style Lambda2 fill:#ff9900
    style Temporal fill:#ffd700
    style Worker fill:#e1f5ff

Architecture:

  1. S3 Upload triggert SQS Message
  2. Lambda Function startet Temporal Workflow
  3. Temporal Worker (auf ECS/EC2) führt Workflow aus
  4. Workflow ruft Lambda-Funktionen als Activities auf

Implementierung:

# Lambda Function: Workflow Starter
import json
import boto3
from temporalio.client import Client

async def lambda_handler(event, context):
    """
    AWS Lambda: Startet Temporal Workflow basierend auf SQS Message.
    """
    # Parse SQS Event
    for record in event['Records']:
        body = json.loads(record['body'])
        s3_key = body['Records'][0]['s3']['object']['key']

        # Connect zu Temporal
        client = await Client.connect("temporal.example.com:7233")

        # Starte Workflow
        handle = await client.start_workflow(
            DataProcessingWorkflow.run,
            args=[s3_key],
            id=f"process-{s3_key}",
            task_queue="data-processing"
        )

        print(f"Started workflow: {handle.id}")

    return {
        'statusCode': 200,
        'body': json.dumps('Workflow started')
    }
# Temporal Worker (auf ECS/EC2): Ruft Lambda als Activity auf
import boto3
import json
from temporalio import activity

lambda_client = boto3.client('lambda')

@activity.defn
async def invoke_lambda_activity(function_name: str, payload: dict) -> dict:
    """
    Activity: Ruft AWS Lambda Function auf.
    """
    activity.logger.info(f"Invoking Lambda: {function_name}")

    try:
        response = lambda_client.invoke(
            FunctionName=function_name,
            InvocationType='RequestResponse',  # Synchron
            Payload=json.dumps(payload)
        )

        result = json.loads(response['Payload'].read())

        activity.logger.info(f"Lambda response: {result}")
        return result

    except Exception as e:
        activity.logger.error(f"Lambda invocation failed: {e}")
        raise

@workflow.defn
class DataProcessingWorkflow:
    """
    Workflow: Orchestriert mehrere Lambda Functions.
    """

    @workflow.run
    async def run(self, s3_key: str) -> dict:
        workflow.logger.info(f"Processing S3 file: {s3_key}")

        # Step 1: Lambda für Data Extraction
        extraction_result = await workflow.execute_activity(
            invoke_lambda_activity,
            args=[
                "data-extraction-function",
                {"s3_key": s3_key}
            ],
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=workflow.RetryPolicy(
                maximum_attempts=3,
                initial_interval=timedelta(seconds=5),
            )
        )

        # Step 2: Lambda für Data Transformation
        transform_result = await workflow.execute_activity(
            invoke_lambda_activity,
            args=[
                "data-transform-function",
                {"data": extraction_result}
            ],
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Step 3: Lambda für Data Loading
        load_result = await workflow.execute_activity(
            invoke_lambda_activity,
            args=[
                "data-load-function",
                {"data": transform_result}
            ],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return {
            "s3_key": s3_key,
            "records_processed": load_result.get("count"),
            "status": "completed"
        }

15.3.3 Integration Pattern 2: Step Functions Alternative

Temporal kann als robustere Alternative zu AWS Step Functions dienen:

FeatureAWS Step FunctionsTemporal
SpracheJSON (ASL)Python, Go, Java, TypeScript, etc.
DebuggingSchwierigNative IDE Support
TestingKomplexUnit Tests möglich
VersionierungLimitiertNative Code-Versionierung
Local DevSchwierig (Localstack)Temporal Dev Server
Vendor Lock-InAWS onlyCloud-agnostisch
KostenPro State TransitionSelbst gehostet oder Cloud

Migration von Step Functions zu Temporal:

# Vorher: Step Functions (JSON ASL)
"""
{
  "StartAt": "ProcessData",
  "States": {
    "ProcessData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:process",
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:transform",
      "Next": "LoadData"
    },
    "LoadData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:load",
      "End": true
    }
  }
}
"""

# Nachher: Temporal Workflow (Python)
@workflow.defn
class ETLWorkflow:
    @workflow.run
    async def run(self, input_data: dict) -> dict:
        # Step 1: Process
        processed = await workflow.execute_activity(
            process_data,
            input_data,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Step 2: Transform
        transformed = await workflow.execute_activity(
            transform_data,
            processed,
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Step 3: Load
        result = await workflow.execute_activity(
            load_data,
            transformed,
            start_to_close_timeout=timedelta(minutes=5)
        )

        return result

15.3.4 Deployment-Strategien für Worker

Option 1: AWS ECS (Fargate oder EC2)

# ecs-task-definition.json
{
  "family": "temporal-worker",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "temporal-worker",
      "image": "myorg/temporal-worker:latest",
      "environment": [
        {
          "name": "TEMPORAL_ADDRESS",
          "value": "temporal.example.com:7233"
        },
        {
          "name": "TASK_QUEUE",
          "value": "data-processing"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/temporal-worker",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Option 2: Kubernetes (EKS)

# temporal-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: temporal-worker
spec:
  replicas: 3
  selector:
    matchLabels:
      app: temporal-worker
  template:
    metadata:
      labels:
        app: temporal-worker
    spec:
      containers:
      - name: worker
        image: myorg/temporal-worker:latest
        env:
        - name: TEMPORAL_ADDRESS
          value: "temporal.example.com:7233"
        - name: TASK_QUEUE
          value: "data-processing"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"

15.3.5 Cost Optimization

Hybrid Approach: Worker auf Reserved Instances + Lambda für Burst

@workflow.defn
class HybridWorkflow:
    """
    Nutzt reguläre Activities für Standard-Tasks,
    Lambda für CPU-intensive Burst-Workloads.
    """

    @workflow.run
    async def run(self, data: dict) -> dict:
        # Standard Processing auf ECS Worker
        normalized = await workflow.execute_activity(
            normalize_data,
            data,
            start_to_close_timeout=timedelta(minutes=2)
        )

        # CPU-intensive Task auf Lambda (burst capacity)
        if data.get("requires_heavy_processing"):
            processed = await workflow.execute_activity(
                invoke_lambda_activity,
                args=["heavy-processing-function", normalized],
                start_to_close_timeout=timedelta(minutes=10)
            )
        else:
            processed = normalized

        # Finale Speicherung auf ECS Worker
        result = await workflow.execute_activity(
            save_to_database,
            processed,
            start_to_close_timeout=timedelta(minutes=1)
        )

        return result

15.4 Polyglot Workflows

15.4.1 Warum Polyglot?

In der Realität nutzen Teams unterschiedliche Sprachen für unterschiedliche Aufgaben:

  • Python: Data Science, ML, Scripting
  • Go: High-Performance Services, Infrastructure
  • TypeScript/Node.js: Frontend-Integration, APIs
  • Java: Enterprise Applications, Legacy Systems

Temporal ermöglicht es, diese Sprachen in einem Workflow zu kombinieren!

15.4.2 Architektur-Prinzipien

graph TB
    Client[Client: TypeScript]

    subgraph "Temporal Service"
        TS[Temporal Server]
    end

    subgraph "Workflow: Python"
        WF[Workflow Definition<br/>Python]
    end

    subgraph "Activities"
        ACT1[Activity: Python<br/>Data Processing]
        ACT2[Activity: Go<br/>Image Processing]
        ACT3[Activity: Java<br/>Legacy Integration]
        ACT4[Activity: TypeScript<br/>API Calls]
    end

    subgraph "Workers"
        W1[Worker: Python<br/>Task Queue: python-tasks]
        W2[Worker: Go<br/>Task Queue: go-tasks]
        W3[Worker: Java<br/>Task Queue: java-tasks]
        W4[Worker: TypeScript<br/>Task Queue: ts-tasks]
    end

    Client -->|Start Workflow| TS
    TS <--> WF

    WF -->|Execute Activity| ACT1
    WF -->|Execute Activity| ACT2
    WF -->|Execute Activity| ACT3
    WF -->|Execute Activity| ACT4

    ACT1 -.-> W1
    ACT2 -.-> W2
    ACT3 -.-> W3
    ACT4 -.-> W4

    style WF fill:#e1f5ff
    style W1 fill:#ffe1e1
    style W2 fill:#e1ffe1
    style W3 fill:#ffe1ff
    style W4 fill:#ffffe1

Wichtige Regel:

  • ✅ Ein Workflow wird in einer Sprache geschrieben
  • ✅ Activities können in verschiedenen Sprachen sein
  • ❌ Ein Workflow kann nicht mehrere Sprachen mischen

15.4.3 Beispiel: Media Processing Pipeline

Workflow: Python (Orchestration)

# workflow.py (Python Worker)
from temporalio import workflow
from datetime import timedelta

@workflow.defn
class MediaProcessingWorkflow:
    """
    Polyglot Workflow: Orchestriert Activities in Python, Go, TypeScript.
    """

    @workflow.run
    async def run(self, video_url: str) -> dict:
        workflow.logger.info(f"Processing video: {video_url}")

        # Activity 1: Download Video (Python)
        # Task Queue: python-tasks
        downloaded_path = await workflow.execute_activity(
            "download_video",  # Activity Name (String-based)
            video_url,
            task_queue="python-tasks",
            start_to_close_timeout=timedelta(minutes=10)
        )

        # Activity 2: Extract Frames (Go - High Performance)
        # Task Queue: go-tasks
        frames = await workflow.execute_activity(
            "extract_frames",
            downloaded_path,
            task_queue="go-tasks",
            start_to_close_timeout=timedelta(minutes=5)
        )

        # Activity 3: AI Analysis (Python - ML Libraries)
        # Task Queue: python-tasks
        analysis_result = await workflow.execute_activity(
            "analyze_frames",
            frames,
            task_queue="python-tasks",
            start_to_close_timeout=timedelta(minutes=15)
        )

        # Activity 4: Generate Thumbnail (Go - Image Processing)
        # Task Queue: go-tasks
        thumbnail_url = await workflow.execute_activity(
            "generate_thumbnail",
            frames[0],
            task_queue="go-tasks",
            start_to_close_timeout=timedelta(minutes=2)
        )

        # Activity 5: Store Metadata (TypeScript - API Integration)
        # Task Queue: ts-tasks
        metadata_id = await workflow.execute_activity(
            "store_metadata",
            args=[{
                "video_url": video_url,
                "analysis": analysis_result,
                "thumbnail": thumbnail_url
            }],
            task_queue="ts-tasks",
            start_to_close_timeout=timedelta(minutes=1)
        )

        return {
            "video_url": video_url,
            "thumbnail_url": thumbnail_url,
            "analysis": analysis_result,
            "metadata_id": metadata_id
        }

Activity 1: Python (Download & ML)

# activities_python.py (Python Worker)
from temporalio import activity
import httpx
import tensorflow as tf

@activity.defn
async def download_video(url: str) -> str:
    """Download video from URL."""
    activity.logger.info(f"Downloading video: {url}")

    async with httpx.AsyncClient() as client:
        response = await client.get(url)

        filepath = f"/tmp/video_{activity.info().workflow_id}.mp4"
        with open(filepath, "wb") as f:
            f.write(response.content)

        return filepath

@activity.defn
async def analyze_frames(frames: list[str]) -> dict:
    """Analyze frames using ML model (Python/TensorFlow)."""
    activity.logger.info(f"Analyzing {len(frames)} frames")

    # Load ML Model
    model = tf.keras.models.load_model("/models/video_classifier.h5")

    results = []
    for frame_path in frames:
        image = tf.keras.preprocessing.image.load_img(frame_path)
        image_array = tf.keras.preprocessing.image.img_to_array(image)
        prediction = model.predict(image_array)
        results.append(prediction.tolist())

    return {
        "frames_analyzed": len(frames),
        "predictions": results
    }

# Worker
async def main():
    from temporalio.client import Client
    from temporalio.worker import Worker

    client = await Client.connect("localhost:7233")

    worker = Worker(
        client,
        task_queue="python-tasks",
        workflows=[],  # Nur Activities auf diesem Worker
        activities=[download_video, analyze_frames]
    )

    await worker.run()

Activity 2: Go (High-Performance Image Processing)

// activities_go.go (Go Worker)
package main

import (
    "context"
    "fmt"
    "os/exec"

    "go.temporal.io/sdk/activity"
    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
)

// ExtractFrames extracts frames from video using FFmpeg
func ExtractFrames(ctx context.Context, videoPath string) ([]string, error) {
    logger := activity.GetLogger(ctx)
    logger.Info("Extracting frames", "video", videoPath)

    // FFmpeg command: Extract 1 frame per second
    outputPattern := "/tmp/frame_%04d.jpg"
    cmd := exec.Command(
        "ffmpeg",
        "-i", videoPath,
        "-vf", "fps=1",
        outputPattern,
    )

    if err := cmd.Run(); err != nil {
        return nil, fmt.Errorf("ffmpeg failed: %w", err)
    }

    // Return list of generated frame paths
    frames := []string{
        "/tmp/frame_0001.jpg",
        "/tmp/frame_0002.jpg",
        // ... would actually scan directory
    }

    logger.Info("Extracted frames", "count", len(frames))
    return frames, nil
}

// GenerateThumbnail creates a thumbnail from image
func GenerateThumbnail(ctx context.Context, imagePath string) (string, error) {
    logger := activity.GetLogger(ctx)
    logger.Info("Generating thumbnail", "image", imagePath)

    thumbnailPath := "/tmp/thumbnail.jpg"

    // ImageMagick: Resize to 300x300
    cmd := exec.Command(
        "convert",
        imagePath,
        "-resize", "300x300",
        thumbnailPath,
    )

    if err := cmd.Run(); err != nil {
        return "", fmt.Errorf("thumbnail generation failed: %w", err)
    }

    // Upload to S3 (simplified)
    s3Url := uploadToS3(thumbnailPath)

    return s3Url, nil
}

func main() {
    c, err := client.Dial(client.Options{
        HostPort: "localhost:7233",
    })
    if err != nil {
        panic(err)
    }
    defer c.Close()

    w := worker.New(c, "go-tasks", worker.Options{})

    // Register Activities
    w.RegisterActivity(ExtractFrames)
    w.RegisterActivity(GenerateThumbnail)

    if err := w.Run(worker.InterruptCh()); err != nil {
        panic(err)
    }
}

Activity 3: TypeScript (API Integration)

// activities_typescript.ts (TypeScript Worker)
import { Context } from '@temporalio/activity';
import { log } from '@temporalio/activity';

interface MetadataInput {
  video_url: string;
  analysis: any;
  thumbnail: string;
}

/**
 * Store metadata in external API
 */
export async function storeMetadata(
  metadata: MetadataInput
): Promise<string> {
  log.info('Storing metadata', { videoUrl: metadata.video_url });

  // Call external API
  const response = await fetch('https://api.example.com/videos', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url: metadata.video_url,
      thumbnailUrl: metadata.thumbnail,
      analysis: metadata.analysis,
      processedAt: new Date().toISOString(),
    }),
  });

  if (!response.ok) {
    throw new Error(`API call failed: ${response.statusText}`);
  }

  const result = await response.json();

  log.info('Metadata stored', { id: result.id });
  return result.id;
}

// Worker
import { Worker } from '@temporalio/worker';

async function run() {
  const worker = await Worker.create({
    workflowsPath: require.resolve('./workflows'),
    activities: {
      storeMetadata,
    },
    taskQueue: 'ts-tasks',
  });

  await worker.run();
}

run().catch((err) => {
  console.error(err);
  process.exit(1);
});

15.4.4 Data Serialization zwischen Sprachen

Temporal konvertiert automatisch zwischen Sprachen:

# Python → Go
await workflow.execute_activity(
    "extract_frames",
    "/tmp/video.mp4",  # Python string → Go string
    task_queue="go-tasks"
)

# Python → TypeScript
await workflow.execute_activity(
    "store_metadata",
    {  # Python dict → TypeScript object
        "video_url": "https://...",
        "analysis": {"score": 0.95}
    },
    task_queue="ts-tasks"
)

Unterstützte Typen (Automatic Conversion):

  • Primitives: int, float, string, bool
  • Collections: list, dict, array, object
  • Custom Types: Dataclasses, Structs, Interfaces (als JSON)

Komplexe Typen:

# Python
from dataclasses import dataclass

@dataclass
class VideoMetadata:
    url: str
    duration_seconds: int
    resolution: dict
    tags: list[str]

# Temporal serialisiert automatisch zu JSON
metadata = VideoMetadata(
    url="https://...",
    duration_seconds=120,
    resolution={"width": 1920, "height": 1080},
    tags=["tutorial", "python"]
)

# Go empfängt als Struct
"""
type VideoMetadata struct {
    URL             string   `json:"url"`
    DurationSeconds int      `json:"duration_seconds"`
    Resolution      struct {
        Width  int `json:"width"`
        Height int `json:"height"`
    } `json:"resolution"`
    Tags            []string `json:"tags"`
}
"""

15.4.5 Workflow Starter in verschiedenen Sprachen

# Python Client
from temporalio.client import Client

client = await Client.connect("localhost:7233")
handle = await client.start_workflow(
    "MediaProcessingWorkflow",  # Workflow Name (String)
    "https://example.com/video.mp4",
    id="video-123",
    task_queue="python-tasks"  # Workflow läuft auf Python Worker
)
// TypeScript Client
import { Client } from '@temporalio/client';

const client = new Client();
const handle = await client.workflow.start('MediaProcessingWorkflow', {
  args: ['https://example.com/video.mp4'],
  workflowId: 'video-123',
  taskQueue: 'python-tasks',
});
// Go Client
import (
    "go.temporal.io/sdk/client"
)

c, _ := client.Dial(client.Options{})
defer c.Close()

options := client.StartWorkflowOptions{
    ID:        "video-123",
    TaskQueue: "python-tasks",
}

we, _ := c.ExecuteWorkflow(
    context.Background(),
    options,
    "MediaProcessingWorkflow",
    "https://example.com/video.mp4",
)

15.4.6 Best Practices für Polyglot

1. Task Queue Naming Convention

# Sprache im Task Queue Namen
task_queue = f"{language}-{service}-tasks"

# Beispiele:
"python-ml-tasks"
"go-image-processing-tasks"
"typescript-api-tasks"
"java-legacy-integration-tasks"

2. Activity Namen als Strings

# ✅ Verwende String-Namen für Cross-Language
await workflow.execute_activity(
    "extract_frames",  # String name
    video_path,
    task_queue="go-tasks"
)

# ❌ Funktionsreferenzen funktionieren nur innerhalb einer Sprache
await workflow.execute_activity(
    extract_frames,  # Function reference
    video_path
)

3. Schema Validation

# Nutze Pydantic für Schema-Validierung
from pydantic import BaseModel

class VideoProcessingInput(BaseModel):
    video_url: str
    resolution: dict
    tags: list[str]

@workflow.defn
class MediaWorkflow:
    @workflow.run
    async def run(self, input_dict: dict) -> dict:
        # Validiere Input
        input_data = VideoProcessingInput(**input_dict)

        # Arbeite mit validiertem Input
        result = await workflow.execute_activity(
            "process_video",
            input_data.dict(),  # Serialize zu dict
            task_queue="go-tasks"
        )
        return result

4. Deployment Coordination

# docker-compose.yaml für Multi-Language Development
version: '3.8'
services:
  temporal:
    image: temporalio/auto-setup:latest
    ports:
      - "7233:7233"

  python-worker:
    build: ./python-worker
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TASK_QUEUE=python-tasks
    depends_on:
      - temporal

  go-worker:
    build: ./go-worker
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TASK_QUEUE=go-tasks
    depends_on:
      - temporal

  typescript-worker:
    build: ./typescript-worker
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TASK_QUEUE=ts-tasks
    depends_on:
      - temporal

15.5 Zusammenfassung

In diesem Kapitel haben wir drei fortgeschrittene Temporal-Patterns kennengelernt:

AI Agents mit Temporal

Kernkonzepte:

  • LLM Calls als Activities (non-deterministisch)
  • Langlebige Konversationen mit State Management
  • Tool-Orchestrierung und Human-in-the-Loop
  • Multi-Agent Coordination mit Child Workflows

Vorteile:

  • ✅ State persistiert automatisch über Stunden/Tage
  • ✅ Retry Policies für fehleranfällige LLM APIs
  • ✅ Vollständige Observability der Agent-Aktionen
  • ✅ Einfache Integration von Tools und menschlicher Intervention

Real-World Adoption:

  • OpenAI Agents SDK Integration (2024)
  • Genutzt von Lindy, Dust, ZoomInfo

Serverless Integration

Kernkonzepte:

  • Temporal Worker auf ECS/EKS (long-running)
  • Lambda Functions als Activities invoken
  • SQS + Lambda als Workflow-Trigger
  • Alternative zu AWS Step Functions

Deployment-Optionen:

  • ECS Fargate: Serverless Workers
  • EKS: Kubernetes-basierte Workers
  • Hybrid: Worker auf Reserved Instances + Lambda für Burst

Vorteile:

  • ✅ Cloud-agnostisch (vs. Step Functions)
  • ✅ Echte Programmiersprachen (vs. JSON ASL)
  • ✅ Besseres Debugging und Testing
  • ✅ Cost Optimization durch Hybrid-Ansatz

Polyglot Workflows

Kernkonzepte:

  • Ein Workflow = Eine Sprache
  • Activities in verschiedenen Sprachen
  • Task Queues pro Sprache/Service
  • Automatische Serialisierung zwischen SDKs

Unterstützte Sprachen:

  • Python, Go, Java, TypeScript, .NET, PHP, Ruby

Vorteile:

  • ✅ Nutze beste Sprache für jede Aufgabe
  • ✅ Integration von Legacy-Systemen
  • ✅ Team-Autonomie (jedes Team nutzt eigene Sprache)
  • ✅ Einfache Daten-Konvertierung
graph TB
    Start[Erweiterte Temporal Patterns]

    AI[AI Agents]
    Lambda[Serverless/Lambda]
    Polyglot[Polyglot Workflows]

    Start --> AI
    Start --> Lambda
    Start --> Polyglot

    AI --> AI1[LLM Orchestration]
    AI --> AI2[Tool Integration]
    AI --> AI3[Multi-Agent Systems]

    Lambda --> L1[Worker auf ECS/EKS]
    Lambda --> L2[Lambda als Activities]
    Lambda --> L3[Step Functions Alternative]

    Polyglot --> P1[Cross-Language Activities]
    Polyglot --> P2[Task Queue per Language]
    Polyglot --> P3[Automatic Serialization]

    AI1 --> Production[Production-Ready Advanced Workflows]
    AI2 --> Production
    AI3 --> Production
    L1 --> Production
    L2 --> Production
    L3 --> Production
    P1 --> Production
    P2 --> Production
    P3 --> Production

    style AI fill:#e1f5ff
    style Lambda fill:#ff9900,color:#fff
    style Polyglot fill:#90EE90
    style Production fill:#ffd700

Gemeinsame Themen

Alle drei Patterns profitieren von Temporals Kernstärken:

  1. State Durability: Workflows können unterbrochen und wiederaufgenommen werden
  2. Retry Policies: Automatische Wiederholung bei Fehlern
  3. Observability: Vollständige Event History für Debugging
  4. Scalability: Horizontal skalierbare Worker
  5. Flexibility: Anpassbar an verschiedene Architekturen

Im nächsten Kapitel würden wir Testing-Strategien für diese komplexen Workflows behandeln (falls weitere Kapitel geplant sind).


⬆ Zurück zum Inhaltsverzeichnis

Vorheriges Kapitel: Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)

Weiterführende Ressourcen:

Praktische Übung: Implementieren Sie einen AI Agent mit Tool-Calls oder eine Polyglot-Pipeline mit mindestens zwei verschiedenen Sprachen!

Ressourcen und Referenzen

Hier finden Sie eine kuratierte Liste von Ressourcen, die Ihnen beim Lernen und Arbeiten mit Temporal helfen.

Offizielle Temporal-Ressourcen

Dokumentation

  • Temporal Documentation: https://docs.temporal.io/
  • Temporal Python SDK Documentation: https://docs.temporal.io/develop/python
  • Temporal TypeScript SDK Documentation: https://docs.temporal.io/develop/typescript

Community und Support

  • Temporal Community Forum: https://community.temporal.io/
  • Temporal GitHub: https://github.com/temporalio
  • Temporal Slack: https://temporal.io/slack

Lernmaterialien

  • Temporal Blog: https://temporal.io/blog
  • Temporal YouTube Channel: https://www.youtube.com/c/Temporalio
  • Temporal Academy: https://learn.temporal.io/

Python-spezifische Ressourcen

  • temporalio/sdk-python: https://github.com/temporalio/sdk-python
  • Python Samples Repository: https://github.com/temporalio/samples-python
  • uv Package Manager: https://github.com/astral-sh/uv

Dieses Buch

  • GitHub Repository: https://github.com/TheCodeEngine/temporal-durable-execution-mastery
  • Issues und Feedback: https://github.com/TheCodeEngine/temporal-durable-execution-mastery/issues

Hinweis: Links werden regelmäßig aktualisiert. Bei Problemen erstellen Sie bitte ein Issue auf GitHub.