Temporal.io – Durable Execution Mastery
Ein umfassender Deep Dive in die Orchestrierung verteilter Systeme mit Temporal
Über dieses Buch
Dieses Buch ist eine vollständige Einführung in Temporal.io, die führende Plattform für Durable Execution. Hier lernen Sie, wie Sie zuverlässige, skalierbare und wartbare verteilte Systeme entwickeln, indem Sie komplexe Workflows als einfachen Code schreiben.
Das Buch kombiniert theoretische Grundlagen mit praktischen Python-Beispielen, die Sie direkt ausführen können. Jedes Kapitel enthält lauffähige Code-Beispiele aus dem GitHub Repository, die Temporal-Konzepte demonstrieren.
Entstehung und Methodik
Dieses Buch wurde als persönliches Lernprojekt entwickelt, um Temporal.io umfassend zu verstehen und zu meistern. Die Inhalte entstanden in Zusammenarbeit mit generativer KI (Claude by Anthropic), wobei ich als Autor:
- Die Lernziele, Struktur und inhaltliche Ausrichtung definiert habe
- Alle Konzepte aktiv erarbeitet und hinterfragt habe
- Die Code-Beispiele entwickelt und getestet habe
- Die technische Korrektheit und praktische Anwendbarkeit sichergestellt habe
Die KI diente dabei als interaktiver Lernpartner: Sie half mir, komplexe Temporal-Konzepte zu strukturieren, verschiedene Perspektiven zu beleuchten und das Gelernte in verständliche Erklärungen zu übersetzen. Dieser kollaborative Ansatz ermöglichte es mir, tiefer in die Materie einzutauchen und ein umfassendes Verständnis von Durable Execution zu entwickeln.
Das Ergebnis ist ein Buch, das meine persönliche Lernreise dokumentiert und anderen helfen soll, Temporal.io systematisch zu erlernen.
Voraussetzungen
- Python 3.13+
- uv package manager
- Temporal CLI oder Docker (für Code-Beispiele)
- Grundkenntnisse in Python und verteilten Systemen
Was Sie lernen werden
Teil I: Grundlagen der Durable Execution
Lernen Sie die Kernkonzepte von Temporal kennen und verstehen Sie, warum Durable Execution die Zukunft verteilter Systeme ist.
- Kapitel 1: Einführung in Temporal
- Kapitel 2: Kernbausteine: Workflows, Activities, Worker
- Kapitel 3: Architektur des Temporal Service
Teil II: Entwicklung von Temporal-Anwendungen (SDK-Fokus)
Tauchen Sie ein in die praktische Entwicklung mit dem Temporal Python SDK.
- Kapitel 4: Entwicklungs-Setup und SDK-Auswahl
- Kapitel 5: Workflows programmieren
- Kapitel 6: Kommunikation (Signale und Queries)
Teil III: Resilienz, Evolution und Muster
Meistern Sie fortgeschrittene Muster für robuste, evolvierbare Systeme.
- Kapitel 7: Fehlerbehandlung und Retries
- Kapitel 8: SAGA Pattern
- Kapitel 9: Workflow-Evolution und Versionierung
Teil IV: Betrieb, Skalierung und Best Practices
Bringen Sie Ihre Temporal-Anwendungen in die Produktion.
- Kapitel 10: Produktions-Deployment
- Kapitel 11: Monitoring und Observability
- Kapitel 12: Testing Strategies
- Kapitel 13: Best Practices und Anti-Muster
Teil V: Das Temporal Kochbuch
Praktische Rezepte für häufige Anwendungsfälle.
- Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)
- Kapitel 15: Erweiterte Rezepte (AI Agents, Lambda, Polyglot)
Code-Beispiele
Alle Code-Beispiele aus diesem Buch finden Sie im GitHub Repository unter examples/. Jedes Kapitel hat sein eigenes lauffähiges Python-Projekt:
# Beispiel ausführen (z.B. Kapitel 1)
cd examples/part-01/chapter-01
uv sync
uv run python simple_workflow.py
Ressourcen
- Temporal Documentation: https://docs.temporal.io/
- Temporal Python SDK: https://docs.temporal.io/develop/python
- Temporal Community: https://community.temporal.io/
Viel Erfolg beim Lernen von Temporal!
Kapitel 1: Einführung in Temporal
Lernziele:
- Verstehen, was Temporal ist und warum es wichtig ist
- Die Grundprinzipien der Durable Execution kennenlernen
- Die Geschichte von Temporal nachvollziehen
- Anwendungsfälle für Temporal identifizieren können
1.1 Das Problem verteilter Systeme
Stellen Sie sich vor, Sie entwickeln ein E-Commerce-System. Ein Kunde bestellt ein Produkt, und Ihr System muss folgende Schritte ausführen:
- Zahlung bei einem Zahlungsdienstleister (z.B. Stripe) durchführen
- Lagerbestand im Inventory-Service reduzieren
- Versand beim Logistikpartner beauftragen
- Bestätigungs-E-Mail versenden
Was passiert, wenn:
- Der Zahlungsdienstleister nach 30 Sekunden antwortet, aber Ihre Anfrage bereits timeout hatte?
- Der Inventory-Service abstürzt, nachdem die Zahlung durchging?
- Der Versanddienstleister nicht erreichbar ist?
- Ihr Server während des Prozesses neu startet?
Bei traditionellen Ansätzen müssen Sie:
- Manuell Zustand in einer Datenbank speichern
- Komplexe Retry-Logik implementieren
- Kompensations-Transaktionen für Rollbacks programmieren
- Idempotenz-Schlüssel verwalten
- Worker-Prozesse und Message Queues koordinieren
Dies führt zu hunderten Zeilen Boilerplate-Code, nur um sicherzustellen, dass Ihr Geschäftsprozess zuverlässig funktioniert.
Temporal löst diese Probleme auf fundamentale Weise.
1.2 Was ist Temporal?
Definition
Temporal ist eine Open-Source-Plattform (MIT-Lizenz) für Durable Execution – dauerhafte, ausfallsichere Codeausführung. Es handelt sich um eine zuverlässige Laufzeitumgebung, die garantiert, dass Ihr Code vollständig ausgeführt wird, unabhängig davon, wie viele Fehler auftreten.
Das Kernversprechen von Temporal:
“Build applications that never lose state, even when everything else fails”
Entwickeln Sie Anwendungen, die niemals ihren Zustand verlieren, selbst wenn alles andere ausfällt.
Was ist Durable Execution?
Durable Execution (Dauerhafte Ausführung) ist crash-sichere Codeausführung mit folgenden Eigenschaften:
1. Virtualisierte Ausführung
Ihr Code läuft über mehrere Prozesse hinweg, potenziell auf verschiedenen Maschinen. Bei einem Crash wird die Arbeit transparent in einem neuen Prozess fortgesetzt, wobei der Anwendungszustand automatisch wiederhergestellt wird.
sequenceDiagram
participant Code as Ihr Workflow-Code
participant W1 as Worker 1
participant TS as Temporal Service
participant W2 as Worker 2
Code->>W1: Schritt 1: Zahlung
W1->>TS: Event: Zahlung erfolgreich
W1-xW1: ❌ Worker abstürzt
TS->>W2: Wiederherstellung: Replay Events
W2->>Code: Zustand wiederhergestellt
Code->>W2: Schritt 2: Inventory
W2->>TS: Event: Inventory aktualisiert
2. Automatische Zustandspersistierung
Der Zustand wird bei jedem Schritt automatisch erfasst und gespeichert. Bei einem Fehler kann die Ausführung exakt dort fortgesetzt werden, wo sie aufgehört hat – ohne Fortschrittsverlust.
3. Zeitunabhängiger Betrieb
Anwendungen können unbegrenzt laufen – von Millisekunden bis zu Jahren – ohne zeitliche Beschränkungen und ohne externe Scheduler.
4. Hardware-agnostisches Design
Zuverlässigkeit ist in die Software eingebaut, nicht abhängig von teurer fehlertoleranter Hardware. Funktioniert in VMs, Containern und Cloud-Umgebungen.
Temporal vs. Traditionelle Ansätze
Die folgende Tabelle zeigt den fundamentalen Unterschied:
| Aspekt | Traditionelle Zustandsmaschine | Temporal Durable Execution |
|---|---|---|
| Zustandsmanagement | Manuell in Datenbanken persistieren | Automatisch durch Event Sourcing |
| Fehlerbehandlung | Manuell Retries und Timeouts implementieren | Eingebaute, konfigurierbare Retry-Policies |
| Wiederherstellung | Komplexe Checkpoint-Logik programmieren | Automatische Wiederherstellung am exakten Unterbrechungspunkt |
| Debugging | Zustand über verteilte Logs suchen | Vollständige Event-History in einem Log |
| Code-Stil | Zustandsübergänge explizit definieren | Normale if/else und Schleifen in Ihrer Programmiersprache |
1.3 Geschichte: Von AWS SWF über Cadence zu Temporal
Die Ursprünge bei Amazon (2002-2010)
Max Fateev arbeitete bei Amazon und leitete die Architektur und Entwicklung von:
- AWS Simple Workflow Service (SWF) – Einer der ersten Workflow-Engines in der Cloud
- AWS Simple Queue Service (SQS) – Das Storage-Backend für eine der meistgenutzten Queue-Services weltweit
Diese Erfahrungen zeigten die Notwendigkeit für zuverlässige Orchestrierung verteilter Systeme.
Microsoft Azure Durable Functions
Parallel entwickelte Samar Abbas bei Microsoft das Durable Task Framework – eine Orchestrierungs-Bibliothek für langlebige, zustandsbehaftete Workflows, die zur Grundlage für Azure Functions wurde.
Cadence bei Uber (2015)
timeline
title Von Cadence zu Temporal
2002-2010 : Max Fateev bei Amazon
: AWS SWF & SQS
2015 : Cadence bei Uber
: Max Fateev + Samar Abbas
: Open Source von Anfang an
2019 : Temporal gegründet
: 2. Mai 2019
: Fork von Cadence
2020 : Series A
: 18,75 Mio USD
2021 : Series B
: 75 Mio USD
2024 : Bewertung > 1,5 Mrd USD
: Tausende Kunden weltweit
2015 kamen Max Fateev und Samar Abbas bei Uber zusammen und schufen Cadence – eine transformative Workflow-Engine, die von Anfang an vollständig Open Source war.
Produktionsdaten bei Uber:
- 100+ verschiedene Anwendungsfälle
- 50 Millionen laufende Ausführungen zu jedem Zeitpunkt
- 3+ Milliarden Ausführungen pro Monat
Temporal gegründet (2019)
Am 2. Mai 2019 gründeten die ursprünglichen Tech-Leads von Cadence – Maxim Fateev und Samar Abbas – Temporal Technologies und forkten das Cadence-Projekt.
Warum ein Fork?
Temporal wurde gegründet, um:
- Die Entwicklung zu beschleunigen
- Cloud-nativen Support zu verbessern
- Eine bessere Developer Experience zu schaffen
- Ein nachhaltiges Business-Modell zu etablieren
Finanzierung und Wachstum:
- Oktober 2020: Series A mit 18,75 Millionen USD
- Juni 2021: Series B mit 75 Millionen USD
- 2024: Series B erweitert auf 103 Millionen USD, Unternehmensbewertung über 1,5 Milliarden USD
1.4 Kernkonzepte im Überblick
Temporal basiert auf drei Hauptkomponenten:
1. Workflows
Ein Workflow definiert eine Abfolge von Schritten durch Code.
Eigenschaften:
- Geschrieben in Ihrer bevorzugten Programmiersprache (Go, Java, Python, TypeScript, .NET, PHP, Ruby)
- Resilient: Workflows können jahrelang laufen, selbst bei Infrastrukturausfällen
- Ressourceneffizient: Im Wartezustand verbrauchen sie null Rechenressourcen
- Deterministisch: Muss bei gleichen Eingaben immer gleich ablaufen (für Replay-Mechanismus)
2. Activities
Eine Activity ist eine Methode oder Funktion, die fehleranfällige Geschäftslogik kapselt.
Eigenschaften:
- Führt eine einzelne, klar definierte Aktion aus (z.B. API-Aufruf, E-Mail senden, Datei verarbeiten)
- Nicht deterministisch: Darf externe Systeme aufrufen
- Automatisch wiederholbar: Das System kann Activities bei Fehlern automatisch wiederholen
- Timeout-geschützt: Konfigurierbare Timeouts verhindern hängende Operations
3. Workers
Ein Worker führt Workflow- und Activity-Code aus.
Eigenschaften:
- Prozess, der als Brücke zwischen Anwendungslogik und Temporal Server dient
- Pollt eine Task Queue, die ihm Aufgaben zur Ausführung zuweist
- Meldet Ergebnisse zurück an den Temporal Service
- Kann horizontal skaliert werden
graph TB
subgraph "Ihre Anwendung"
WF[Workflow Definition]
ACT[Activity Implementierung]
WORKER[Worker Prozess]
WF -->|ausgeführt von| WORKER
ACT -->|ausgeführt von| WORKER
end
subgraph "Temporal Platform"
TS[Temporal Service]
DB[(Event History Database)]
TS -->|speichert| DB
end
WORKER <-->|Task Queue| TS
style WF fill:#e1f5ff
style ACT fill:#ffe1f5
style WORKER fill:#f5ffe1
style TS fill:#ffd700
style DB fill:#ddd
1.5 Hauptanwendungsfälle
Temporal wird von tausenden Unternehmen für mission-critical Anwendungen eingesetzt. Hier sind reale Beispiele:
Financial Operations
- Stripe: Payment Processing
- Coinbase: Jede Coinbase-Transaktion nutzt Temporal für Geldtransfers
- ANZ Bank: Hypotheken-Underwriting – langlebige, zustandsbehaftete Prozesse über Wochen
E-Commerce und Logistik
- Turo: Buchungssystem für Carsharing
- Maersk: Logistik-Orchestrierung – Verfolgung von Containern weltweit
- Box: Content Management
Infrastruktur und DevOps
- Netflix: Custom CI/CD-Systeme – “fundamentaler Wandel in der Art, wie Anwendungen entwickelt werden können”
- Datadog: Infrastruktur-Services – von einer Anwendung auf über 100 Nutzer in Dutzenden Teams innerhalb eines Jahres
- Snap: Jede Snap Story verwendet Temporal
Kommunikation
- Twilio: Jede Nachricht auf Twilio nutzt Temporal
- Airbnb: Marketing-Kampagnen-Orchestrierung
AI und Machine Learning
- Lindy, Dust, ZoomInfo: AI Agents mit State-Durability und menschlicher Intervention
- Descript & Neosync: Datenpipelines und GPU-Ressourcen-Koordination
1.6 Warum ist Temporal wichtig?
Problem 1: Fehlerresilienz
Traditionell:
def process_order(order_id):
try:
payment = charge_credit_card(order_id) # Was, wenn Timeout?
save_payment_to_db(payment) # Was, wenn Server hier abstürzt?
inventory = update_inventory(order_id) # Was, wenn Service nicht erreichbar?
save_inventory_to_db(inventory) # Was, wenn DB-Connection verloren?
shipping = schedule_shipping(order_id) # Was, wenn nach 2 Retries immer noch Fehler?
send_confirmation_email(order_id) # Was, wenn E-Mail-Service down ist?
except Exception as e:
# Manuelle Rollback-Logik für jeden möglichen Fehlerzustand?
# Welche Schritte waren erfolgreich?
# Wie kompensieren wir bereits durchgeführte Aktionen?
# Wie stellen wir sicher, dass wir nicht doppelt buchen?
pass
Mit Temporal:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str):
# Temporal garantiert, dass dieser Code vollständig ausgeführt wird
payment = await workflow.execute_activity(
charge_credit_card,
order_id,
retry_policy=RetryPolicy(maximum_attempts=5)
)
inventory = await workflow.execute_activity(
update_inventory,
order_id,
retry_policy=RetryPolicy(maximum_attempts=3)
)
shipping = await workflow.execute_activity(
schedule_shipping,
order_id
)
await workflow.execute_activity(send_confirmation_email, order_id)
# Kein manuelles State-Management
# Keine manuellen Retries
# Automatische Wiederherstellung bei Crashes
Problem 2: Langlebige Prozesse
Beispiel: Kreditantrag
Ein Hypothekenantrag kann Wochen dauern:
- Antrag eingereicht → Wartet auf Dokumente
- Dokumente hochgeladen → Wartet auf manuelle Prüfung
- Prüfung abgeschlossen → Wartet auf Gutachten
- Gutachten erhalten → Finale Entscheidung
Mit traditionellen Ansätzen:
- Cron-Jobs, die den Status in der DB prüfen
- Komplexe Zustandsmaschinen
- Anfällig für Race Conditions
- Schwer zu debuggen
Mit Temporal:
@workflow.defn
class MortgageApplicationWorkflow:
@workflow.run
async def run(self, application_id: str):
# Wartet auf Dokumente (kann Tage dauern)
documents = await workflow.wait_condition(
lambda: self.documents_uploaded
)
# Wartet auf manuelle Prüfung
review_result = await workflow.wait_condition(
lambda: self.review_completed
)
# Wartet auf Gutachten
appraisal = await workflow.wait_condition(
lambda: self.appraisal_received
)
# Finale Entscheidung
decision = await workflow.execute_activity(
make_decision,
application_id,
documents,
review_result,
appraisal
)
return decision
Der Workflow kann Wochen oder Monate laufen, ohne Ressourcen zu verbrauchen, während er wartet.
Problem 3: Observability
graph LR
subgraph "Ohne Temporal"
A[Logs in Service A]
B[Logs in Service B]
C[DB State]
D[Queue Messages]
E[Entwickler sucht Fehler]
E -.->|durchsucht| A
E -.->|durchsucht| B
E -.->|prüft| C
E -.->|prüft| D
end
subgraph "Mit Temporal"
F[Temporal Web UI]
G[Event History]
H[Entwickler sieht komplette History]
H -->|ein Klick| F
F -->|zeigt| G
end
style E fill:#ffcccc
style H fill:#ccffcc
Mit Temporal haben Sie:
- Vollständige Event-History jeder Workflow-Ausführung
- Time-Travel Debugging: Sehen Sie exakt, was zu jedem Zeitpunkt passiert ist
- Web UI: Visualisierung aller laufenden und abgeschlossenen Workflows
- Stack Traces: Sehen Sie, wo ein Workflow gerade “hängt”
1.7 Fundamentaler Paradigmenwechsel
Temporal hebt die Anwendungsentwicklung auf eine neue Ebene, indem es die Last der Fehlerbehandlung entfernt – ähnlich wie höhere Programmiersprachen die Komplexität der Maschinenprogrammierung abstrahiert haben.
Analogie: Von Assembler zu Python
| Assembler (1950er) | Python (heute) |
|---|---|
| Manuelle Speicherverwaltung | Garbage Collection |
| Register manuell verwalten | Variablen einfach deklarieren |
| Goto-Statements | Strukturierte Programmierung |
| Hunderte Zeilen für einfache Aufgaben | Wenige Zeilen aussagekräftiger Code |
| Ohne Temporal | Mit Temporal |
|---|---|
| Manuelle Zustandsspeicherung in DB | Automatisches State-Management |
| Retry-Logik überall | Deklarative Retry-Policies |
| Timeout-Handling manuell | Automatische Timeouts |
| Fehlersuche über viele Services | Zentrale Event-History |
| Defensive Programmierung | Fokus auf Geschäftslogik |
Temporal macht verteilte Systeme so zuverlässig wie Schwerkraft.
1.8 Zusammenfassung
In diesem Kapitel haben Sie gelernt:
✅ Was Temporal ist: Eine Plattform für Durable Execution, die garantiert, dass Ihr Code vollständig ausgeführt wird, unabhängig von Fehlern
✅ Die Geschichte: Von AWS SWF über Cadence bei Uber zu Temporal als führende Open-Source-Lösung mit Milliarden-Bewertung
✅ Kernkonzepte: Workflows (Orchestrierung), Activities (Aktionen), Workers (Ausführung)
✅ Anwendungsfälle: Von Payment Processing bei Stripe/Coinbase über Logistik bei Maersk bis hin zu CI/CD bei Netflix
✅ Warum es wichtig ist: Temporal löst fundamentale Probleme verteilter Systeme – Fehlerresilienz, langlebige Prozesse, Observability
Im nächsten Kapitel werden wir tiefer in die Kernbausteine eintauchen und verstehen, wie Workflows, Activities und Worker im Detail funktionieren.
Praktisches Beispiel
Im Verzeichnis ../examples/part-01/chapter-01/ finden Sie ein lauffähiges Beispiel eines einfachen Temporal Workflows:
cd ../examples/part-01/chapter-01
uv sync
uv run python simple_workflow.py
Dieses Beispiel zeigt:
- Wie ein Workflow definiert wird
- Wie eine Verbindung zum Temporal Server hergestellt wird
- Wie ein Workflow gestartet und ausgeführt wird
- Wie das Ergebnis abgerufen wird
Weiterführende Ressourcen
- 📚 Offizielle Dokumentation: https://docs.temporal.io/
- 🎥 Temporal YouTube Channel: Tutorials und Talks
- 💬 Community Slack: https://temporal.io/slack
- 🐙 GitHub: https://github.com/temporalio/temporal
- 📰 Temporal Blog: https://temporal.io/blog – Case Studies und Best Practices
Zurück zum Inhaltsverzeichnis | Nächstes Kapitel: Kernbausteine →
Kapitel 2: Kernbausteine – Workflows, Activities, Worker
Nach der Einführung in Temporal im vorherigen Kapitel tauchen wir nun tief in die drei fundamentalen Bausteine ein, die das Herzstück jeder Temporal-Anwendung bilden: Workflows, Activities und Worker. Das Verständnis dieser Komponenten und ihres Zusammenspiels ist entscheidend für die erfolgreiche Entwicklung mit Temporal.
2.1 Überblick: Die drei Säulen von Temporal
Temporal basiert auf einer klaren Trennung der Verantwortlichkeiten (Separation of Concerns), die in drei Hauptkomponenten unterteilt ist:
graph TB
subgraph "Temporal Application"
W[Workflows<br/>Orchestrierung]
A[Activities<br/>Ausführung]
WK[Workers<br/>Hosting]
end
subgraph "Charakteristika"
W --> W1[Deterministisch]
W --> W2[Koordinieren]
W --> W3[Event Sourcing]
A --> A1[Nicht-deterministisch]
A --> A2[Ausführen]
A --> A3[Side Effects]
WK --> WK1[Stateless]
WK --> WK2[Polling]
WK --> WK3[Skalierbar]
end
style W fill:#e1f5ff
style A fill:#ffe1e1
style WK fill:#e1ffe1
Die Rollen im Detail:
-
Workflows: Die Dirigenten des Orchesters – sie definieren was passieren soll und in welcher Reihenfolge, führen aber selbst keine Business Logic aus.
-
Activities: Die Musiker – sie führen die eigentliche Arbeit aus, von Datenbankzugriffen über API-Aufrufe bis hin zu komplexen Berechnungen.
-
Workers: Die Konzerthalle – sie bieten die Infrastruktur, in der Workflows und Activities ausgeführt werden, und kommunizieren mit dem Temporal Service.
2.2 Workflows: Die Orchestrierungslogik
2.2.1 Was ist ein Workflow?
Ein Workflow in Temporal ist eine Funktion oder Methode, die die Orchestrierungslogik Ihrer Anwendung definiert. Anders als in vielen anderen Workflow-Engines wird ein Temporal-Workflow in einer echten Programmiersprache geschrieben – nicht in YAML, XML oder einer DSL.
Fundamentale Eigenschaften:
- Deterministisch: Bei gleichen Inputs immer gleiche Outputs und Commands
- Langlebig: Kann Tage, Monate oder Jahre laufen
- Ausfallsicher: Übersteht Infrastruktur- und Code-Deployments
- Versionierbar: Unterstützt Code-Änderungen bei laufenden Workflows
Ein einfaches Beispiel aus dem Code:
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class DataProcessingWorkflow:
"""
Ein Workflow orchestriert Activities - er führt sie nicht selbst aus.
"""
@workflow.run
async def run(self, data: str) -> dict:
# Ruft Activity auf - delegiert die eigentliche Arbeit
processed = await workflow.execute_activity(
process_data,
data,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
initial_interval=timedelta(seconds=1),
),
)
# Orchestriert weitere Schritte
await workflow.execute_activity(
send_notification,
f"Processed: {processed}",
start_to_close_timeout=timedelta(seconds=10),
)
return {"input": data, "output": processed, "status": "completed"}
📁 Code-Beispiel:
../examples/part-01/chapter-02/workflow.py
2.2.2 Der Determinismus-Constraint
Warum Determinismus?
Temporal nutzt einen Replay-Mechanismus, um Workflow-State zu rekonstruieren. Stellen Sie sich vor, ein Worker-Prozess stürzt ab, während ein Workflow läuft. Wenn der Workflow später auf einem anderen Worker fortgesetzt wird, muss Temporal den exakten State wiederherstellen. Dies geschieht durch:
- Laden der Event History (alle bisherigen Events)
- Replay des Workflow-Codes gegen diese History
- Vergleich der generierten Commands mit der History
- Bei Übereinstimmung: Workflow kann fortgesetzt werden
sequenceDiagram
participant WC as Workflow Code
participant Worker
participant History as Event History
participant Service as Temporal Service
Note over Worker: Worker startet neu nach Crash
Worker->>Service: Poll für Workflow Task
Service->>Worker: Workflow Task + Event History
Worker->>History: Lade alle Events
History-->>Worker: [Start, Activity1Scheduled, Activity1Complete, ...]
Worker->>WC: Replay Code von Anfang
WC->>WC: Führe Code aus
WC-->>Worker: Commands [ScheduleActivity1, ...]
Worker->>Worker: Validiere Commands gegen History
alt Commands stimmen überein
Worker->>Service: Workflow Task Complete
Note over Worker: State erfolgreich rekonstruiert
else Commands weichen ab
Worker->>Service: Non-Deterministic Error
Note over Worker: Code hat sich geändert!
end
Was ist verboten in Workflows?
# ❌ FALSCH: Nicht-deterministisch
@workflow.defn
class BadWorkflow:
@workflow.run
async def run(self):
# ❌ Zufallszahlen
random_value = random.random()
# ❌ Aktuelle Zeit
now = datetime.now()
# ❌ Direkte I/O-Operationen
with open('file.txt') as f:
data = f.read()
# ❌ Direkte API-Aufrufe
response = requests.get('https://api.example.com')
return random_value
# ✅ RICHTIG: Deterministisch
@workflow.defn
class GoodWorkflow:
@workflow.run
async def run(self):
# ✅ Temporal's Zeit-API
now = workflow.now()
# ✅ Temporal's Sleep
await workflow.sleep(timedelta(hours=1))
# ✅ I/O in Activities auslagern
data = await workflow.execute_activity(
read_file,
'file.txt',
start_to_close_timeout=timedelta(seconds=10)
)
# ✅ API-Aufrufe in Activities
response = await workflow.execute_activity(
call_external_api,
'https://api.example.com',
start_to_close_timeout=timedelta(seconds=30)
)
return {"data": data, "response": response}
Die goldene Regel: Workflows orchestrieren, Activities führen aus.
2.2.3 Long-Running Workflows und Continue-As-New
Temporal-Workflows können theoretisch unbegrenzt lange laufen. In der Praxis gibt es jedoch technische Grenzen:
Event History Limits:
- Maximum 50.000 Events (technisch 51.200)
- Maximum 50 MB History-Größe
- Performance-Degradation ab ~10.000 Events
Continue-As-New Pattern:
Für langlebige Workflows nutzt man das Continue-As-New Pattern:
@workflow.defn
class LongRunningWorkflow:
@workflow.run
async def run(self, iteration: int = 0) -> str:
# Führe Batch von Arbeit aus
for i in range(100):
await workflow.execute_activity(
process_item,
f"item-{iteration}-{i}",
start_to_close_timeout=timedelta(minutes=1)
)
# Nach 100 Items: Continue-As-New
# Neue Workflow-Instanz mit frischer Event History
workflow.continue_as_new(iteration + 1)
Wie Continue-As-New funktioniert:
timeline
title Workflow Lifecycle mit Continue-As-New
section Run 1
Start Workflow : Event History [0-100 Events]
Process Items : 100 Activities
Decision Point : Continue-As-New?
section Run 2
New Run ID : Neue Event History [0 Events]
Transfer State : iteration = 1
Process Items : 100 Activities
Decision Point : Continue-As-New?
section Run 3
New Run ID : Neue Event History [0 Events]
Transfer State : iteration = 2
Continue... : Unbegrenzt möglich
Wann Continue-As-New nutzen?
- Bei regelmäßigen Checkpoints (täglich, wöchentlich)
- Wenn Event History >10.000 Events erreicht
- Bei häufigen Code-Deployments (vermeidet Versioning-Probleme)
2.2.4 Workflow Versioning
Code ändert sich. Workflows laufen lange. Was passiert, wenn laufende Workflows auf neue Code-Versionen treffen?
Problem:
# Version 1 (deployed, Workflows laufen)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str):
await workflow.execute_activity(process_payment, ...)
return "done"
# Version 2 (neues Deployment)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str):
# NEU: Validierung hinzugefügt
await workflow.execute_activity(validate_order, ...)
await workflow.execute_activity(process_payment, ...)
return "done"
Beim Replay eines alten Workflows würde der neue Code eine zusätzliche Activity schedulen – Non-Deterministic Error!
Lösung: Patching API
from temporalio import workflow
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_id: str):
# Patching: Unterstütze alte UND neue Workflows
if workflow.patched("add-validation"):
# Neuer Code
await workflow.execute_activity(validate_order, ...)
# Alter Code (läuft in beiden Versionen)
await workflow.execute_activity(process_payment, ...)
return "done"
Patching-Workflow:
- Patch hinzufügen mit neuem und altem Code
- Warten bis alle alten Workflows abgeschlossen sind
- deprecate_patch() verwenden
- Patch-Code entfernen im nächsten Deployment
2.2.5 Workflow-Timeouts
Temporal bietet verschiedene Timeout-Typen:
graph LR
subgraph "Workflow Execution Timeouts"
WET[Workflow Execution Timeout<br/>Gesamte Execution inkl. Retries]
WRT[Workflow Run Timeout<br/>Ein einzelner Run]
WTT[Workflow Task Timeout<br/>Worker Task Processing]
end
WET --> WRT
WRT --> WTT
style WET fill:#ffcccc
style WRT fill:#ffffcc
style WTT fill:#ccffcc
Empfehlung: Workflow-Timeouts werden generell nicht empfohlen. Workflows sind für langlebige, resiliente Ausführung konzipiert. Timeouts sollten nur in Ausnahmefällen gesetzt werden.
2.3 Activities: Die Business Logic
2.3.1 Was sind Activities?
Activities sind normale Funktionen, die einzelne, klar definierte Aktionen ausführen. Im Gegensatz zu Workflows dürfen Activities:
- ✅ I/O-Operationen durchführen
- ✅ Externe APIs aufrufen
- ✅ Datenbanken lesen/schreiben
- ✅ Zufallszahlen generieren
- ✅ Aktuelle Systemzeit verwenden
- ✅ Side Effects haben
Activities sind der Ort für die eigentliche Business Logic.
Beispiel aus dem Code:
from temporalio import activity
@activity.defn
async def process_data(data: str) -> str:
"""
Activity für Datenverarbeitung.
Darf nicht-deterministische Operationen durchführen.
"""
activity.logger.info(f"Processing data: {data}")
# Simuliert externe API-Aufrufe, DB-Zugriffe, etc.
result = data.upper()
activity.logger.info(f"Data processed: {result}")
return result
@activity.defn
async def send_notification(message: str) -> None:
"""
Activity für Side Effects (E-Mail, Webhook, etc.)
"""
activity.logger.info(f"Sending notification: {message}")
# In der Praxis: Echter API-Aufruf
# await email_service.send(message)
# await webhook.post(message)
print(f"📧 Notification: {message}")
📁 Code-Beispiel:
../examples/part-01/chapter-02/activities.py
2.3.2 Activity-Timeouts
Activities haben vier verschiedene Timeout-Typen:
graph TB
subgraph "Activity Lifecycle"
Scheduled[Activity Scheduled<br/>in Queue]
Started[Activity Started<br/>by Worker]
Running[Activity Executing]
Complete[Activity Complete]
Scheduled -->|Schedule-To-Start| Started
Started -->|Start-To-Close| Complete
Running -->|Heartbeat| Running
Scheduled -->|Schedule-To-Close| Complete
end
style Scheduled fill:#e1f5ff
style Started fill:#fff4e1
style Running fill:#ffe1e1
style Complete fill:#e1ffe1
1. Start-To-Close Timeout (wichtigster):
await workflow.execute_activity(
process_data,
data,
start_to_close_timeout=timedelta(minutes=5), # Max. 5 Min pro Versuch
)
2. Schedule-To-Close Timeout (inkl. Retries):
await workflow.execute_activity(
process_data,
data,
schedule_to_close_timeout=timedelta(minutes=30), # Max. 30 Min total
)
3. Schedule-To-Start Timeout (selten benötigt):
await workflow.execute_activity(
process_data,
data,
schedule_to_start_timeout=timedelta(minutes=10), # Max. 10 Min in Queue
)
4. Heartbeat Timeout (für langlebige Activities):
await workflow.execute_activity(
long_running_task,
params,
heartbeat_timeout=timedelta(seconds=30), # Heartbeat alle 30s
)
Mindestens ein Timeout erforderlich: Start-To-Close ODER Schedule-To-Close.
2.3.3 Retry-Policies und Error Handling
Default Retry Policy (wenn nicht anders konfiguriert):
RetryPolicy(
initial_interval=timedelta(seconds=1),
backoff_coefficient=2.0,
maximum_interval=timedelta(seconds=100),
maximum_attempts=0, # 0 = unbegrenzt
)
Retry-Berechnung:
retry_wait = min(
initial_interval × (backoff_coefficient ^ retry_count),
maximum_interval
)
Beispiel: Bei initial_interval=1s und backoff_coefficient=2:
- Retry 1: nach 1 Sekunde
- Retry 2: nach 2 Sekunden
- Retry 3: nach 4 Sekunden
- Retry 4: nach 8 Sekunden
- …
Custom Retry Policy:
from temporalio.common import RetryPolicy
@workflow.defn
class RobustWorkflow:
@workflow.run
async def run(self):
result = await workflow.execute_activity(
flaky_external_api,
params,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(minutes=1),
backoff_coefficient=2,
maximum_attempts=5,
# Diese Fehler NICHT wiederholen
non_retryable_error_types=["InvalidInputError", "AuthError"],
),
)
return result
Non-Retryable Errors:
from temporalio.exceptions import ApplicationError
@activity.defn
async def validate_input(data: str) -> str:
if not data:
# Dieser Fehler wird NICHT wiederholt
raise ApplicationError(
"Input cannot be empty",
non_retryable=True
)
return data
2.3.4 Heartbeats für langlebige Activities
Für Activities, die lange laufen (mehrere Minuten oder länger), bieten Heartbeats zwei Vorteile:
- Schnellere Failure Detection: Service erkennt Worker-Crashes sofort
- Progress Tracking: Bei Restart kann Activity von letztem Checkpoint fortsetzen
from temporalio import activity
@activity.defn
async def process_large_file(file_path: str, total_items: int) -> str:
"""
Verarbeitet große Datei mit Progress-Tracking.
"""
start_index = 0
# Recover von vorherigem Progress
if activity.info().is_heartbeat_details_available():
start_index = activity.heartbeat_details()[0]
activity.logger.info(f"Resuming from index {start_index}")
for i in range(start_index, total_items):
# Verarbeite Item
await process_item(i)
# Heartbeat mit Progress
activity.heartbeat(i)
return f"Processed {total_items} items"
Workflow-Seite:
result = await workflow.execute_activity(
process_large_file,
args=["big_file.csv", 10000],
start_to_close_timeout=timedelta(hours=2),
heartbeat_timeout=timedelta(seconds=30), # Erwarte Heartbeat alle 30s
)
Wann Heartbeats nutzen?
- ✅ Große Datei-Downloads oder -Verarbeitung
- ✅ ML-Model-Training
- ✅ Batch-Processing mit vielen Items
- ❌ Schnelle API-Aufrufe (< 1 Minute)
2.3.5 Idempotenz – Die wichtigste Best Practice
Activities sollten IMMER idempotent sein: Mehrfache Ausführung = gleiches Ergebnis.
Warum?
- Temporal garantiert At-Least-Once Execution für Activities
- Bei Netzwerkfehlern kann unklar sein, ob Activity erfolgreich war
- Temporal wiederholt die Activity im Zweifel
Beispiel: Geldüberweisung (nicht idempotent):
# ❌ GEFÄHRLICH: Nicht idempotent
@activity.defn
async def transfer_money(from_account: str, to_account: str, amount: float):
# Was passiert bei Retry?
# → Geld wird mehrfach überwiesen!
await bank_api.transfer(from_account, to_account, amount)
Lösung: Idempotency Keys
# ✅ SICHER: Idempotent
@activity.defn
async def transfer_money(
from_account: str,
to_account: str,
amount: float,
idempotency_key: str
):
# Prüfe ob bereits ausgeführt
if await bank_api.is_processed(idempotency_key):
return await bank_api.get_result(idempotency_key)
# Führe Überweisung aus
result = await bank_api.transfer(
from_account,
to_account,
amount,
idempotency_key=idempotency_key
)
return result
Idempotency Key Generierung im Workflow:
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, order_id: str, amount: float):
# Generiere deterministischen Idempotency Key
idempotency_key = f"payment-{order_id}-{workflow.info().run_id}"
await workflow.execute_activity(
transfer_money,
args=[
"account-A",
"account-B",
amount,
idempotency_key
],
start_to_close_timeout=timedelta(minutes=5),
)
2.3.6 Local Activities – Der Spezialfall
Local Activities werden im gleichen Prozess wie der Workflow ausgeführt, ohne separate Task Queue:
result = await workflow.execute_local_activity(
quick_calculation,
params,
start_to_close_timeout=timedelta(seconds=5),
)
Wann nutzen?
- ✅ Sehr kurze Activities (< 1 Sekunde)
- ✅ Hoher Throughput erforderlich (1000+ Activities/Sekunde)
- ✅ Einfache Berechnungen ohne externe Dependencies
Limitierungen:
- ❌ Keine Heartbeats
- ❌ Bei Retry wird gesamte Activity wiederholt (kein Checkpoint)
- ❌ Höheres Risiko bei nicht-idempotenten Operationen
Empfehlung: Nutze reguläre Activities als Default. Local Activities nur für sehr spezifische Performance-Optimierungen.
2.4 Workers: Die Laufzeitumgebung
2.4.1 Worker-Architektur
Workers sind eigenständige Prozesse, die außerhalb des Temporal Service laufen und:
- Task Queues pollen (long-polling RPC)
- Workflow- und Activity-Code ausführen
- Ergebnisse zurück an Temporal Service senden
graph TB
subgraph "Worker Process"
WW[Workflow Worker<br/>Führt Workflow-Code aus]
AW[Activity Worker<br/>Führt Activity-Code aus]
Poller1[Workflow Task Poller]
Poller2[Activity Task Poller]
end
subgraph "Temporal Service"
WQ[Workflow Task Queue]
AQ[Activity Task Queue]
end
Poller1 -.->|Long Poll| WQ
WQ -.->|Task| Poller1
Poller1 --> WW
Poller2 -.->|Long Poll| AQ
AQ -.->|Task| Poller2
Poller2 --> AW
style WW fill:#e1f5ff
style AW fill:#ffe1e1
Worker Setup – Beispiel aus dem Code:
from temporalio.worker import Worker
from shared.temporal_helpers import create_temporal_client
async def main():
# 1. Verbinde zu Temporal
client = await create_temporal_client()
# 2. Erstelle Worker
worker = Worker(
client,
task_queue="book-examples",
workflows=[DataProcessingWorkflow], # Registriere Workflows
activities=[process_data, send_notification], # Registriere Activities
)
# 3. Starte Worker (blockiert bis Ctrl+C)
await worker.run()
📁 Code-Beispiel:
../examples/part-01/chapter-02/worker.py
2.4.2 Task Queues und Polling
Task Queue Eigenschaften:
- Lightweight: Dynamisch erstellt, keine explizite Registration
- On-Demand: Wird beim ersten Workflow/Activity-Start erstellt
- Persistent: Tasks bleiben erhalten bei Worker-Ausfällen
- Load Balancing: Automatische Verteilung über alle Worker
Long-Polling Mechanismus:
sequenceDiagram
participant Worker
participant Service as Temporal Service
participant Queue as Task Queue
loop Kontinuierliches Polling
Worker->>Service: Poll für Tasks (RPC)
alt Task verfügbar
Queue->>Service: Task
Service-->>Worker: Task
Worker->>Worker: Execute Task
Worker->>Service: Complete Task
else Keine Tasks
Note over Service: Verbindung bleibt offen
Note over Service: Wartet bis Task oder Timeout
Service-->>Worker: Keine Tasks (nach Timeout)
end
end
Pull-basiert, nicht Push:
- Worker holen Tasks nur, wenn Kapazität vorhanden
- Verhindert Überlastung
- Automatisches Backpressure-Handling
2.4.3 Task Queue Routing und Partitioning
Routing-Strategien:
# 1. Standard: Ein Task Queue für alles
worker = Worker(
client,
task_queue="default",
workflows=[WorkflowA, WorkflowB],
activities=[activity1, activity2, activity3],
)
# 2. Separierung nach Funktion
workflow_worker = Worker(
client,
task_queue="workflows",
workflows=[WorkflowA, WorkflowB],
)
activity_worker = Worker(
client,
task_queue="activities",
activities=[activity1, activity2, activity3],
)
# 3. Isolation kritischer Activities (Bulkheading)
critical_worker = Worker(
client,
task_queue="critical-activities",
activities=[payment_activity],
)
background_worker = Worker(
client,
task_queue="background-activities",
activities=[send_email, generate_report],
)
Warum Isolation?
- Verhindert, dass langsame Activities kritische blockieren
- Bessere Ressourcen-Allokation
- Dedizierte Skalierung möglich
Task Queue Partitioning:
# Default: 4 Partitionen
# → Höherer Throughput, keine FIFO-Garantie
# Single Partition für FIFO-Garantie
# (via Temporal Server Config)
2.4.4 Sticky Execution – Performance-Optimierung
Problem: Bei jedem Workflow Task muss Worker die komplette Event History laden und Workflow replayed.
Lösung: Sticky Execution
sequenceDiagram
participant W1 as Worker 1
participant W2 as Worker 2
participant Service
participant NQ as Normal Queue
participant SQ as Sticky Queue (Worker 1)
W1->>Service: Poll Normal Queue
Service-->>W1: Workflow Task (WF-123)
W1->>W1: Execute + Cache State
W1->>Service: Complete
Service->>SQ: Nächster Task für WF-123 → Sticky Queue
W1->>Service: Poll Sticky Queue
Service-->>W1: Workflow Task (WF-123)
Note over W1: State im Cache!<br/>Kein History Reload
W1->>W1: Execute (sehr schnell)
W1->>Service: Complete
Note over Service: Timeout (5s default)
Service->>NQ: Task zurück zu Normal Queue
W2->>Service: Poll Normal Queue
Service-->>W2: Workflow Task (WF-123)
Note over W2: Kein Cache<br/>History Reload + Replay
Vorteile:
- 10-100x schnellere Task-Verarbeitung
- Reduzierte Last auf History Service
- Geringere Latenz
Automatisch aktiviert – keine Konfiguration erforderlich!
2.4.5 Worker Scaling und Deployment
Horizontal Scaling:
Workers sind stateless – Workflow-State ist im Temporal Service, nicht im Worker.
# Gleicher Code auf allen Workers
# Kann beliebig skaliert werden
# Worker 1 (Server A)
worker1 = Worker(client, task_queue="production", ...)
await worker1.run()
# Worker 2 (Server B)
worker2 = Worker(client, task_queue="production", ...)
await worker2.run()
# Worker 3 (Server C)
worker3 = Worker(client, task_queue="production", ...)
await worker3.run()
Deployment Patterns:
- Dedizierte Worker Processes (empfohlen für Production):
# Separate Prozesse nur für Temporal
python worker.py
- Combined Worker + Application:
# Im gleichen Prozess wie Web Server
# Nur für Development/kleine Apps
async def start_services():
# Starte Web Server
web_server = await start_web_server()
# Starte Worker (im Hintergrund)
worker = Worker(...)
asyncio.create_task(worker.run())
- Worker Fleets (High Availability):
Kubernetes Deployment:
- 10+ Worker Pods
- Auto-Scaling basierend auf Task Queue Länge
- Rolling Updates ohne Downtime
Skalierungs-Strategien:
| Szenario | Lösung |
|---|---|
| Höherer Workflow-Throughput | Mehr Worker Processes |
| Langlebige Activities | Mehr Activity Task Slots pro Worker |
| CPU-intensive Activities | Weniger Slots, mehr CPU pro Worker |
| I/O-bound Activities | Mehr Slots, weniger CPU pro Worker |
| Kritische Activities isolieren | Separate Task Queue + dedizierte Worker |
2.4.6 Worker Tuning und Konfiguration
Task Slots – Concurrency Control:
worker = Worker(
client,
task_queue="production",
workflows=[...],
activities=[...],
max_concurrent_workflow_tasks=100, # Max. parallele Workflow Tasks
max_concurrent_activities=50, # Max. parallele Activities
max_concurrent_local_activities=100, # Max. parallele Local Activities
)
Resource-Based Auto-Tuning (empfohlen):
from temporalio.worker import ResourceBasedTunerConfig, ResourceBasedSlotConfig
worker = Worker(
client,
task_queue="production",
workflows=[...],
activities=[...],
tuner=ResourceBasedTunerConfig(
# Workflow Task Slots
workflow_task_slot_supplier=ResourceBasedSlotConfig(
target_cpu_usage=0.8, # Ziel: 80% CPU
target_memory_usage=0.8, # Ziel: 80% Memory
minimum_slots=5,
maximum_slots=100,
),
# Activity Task Slots
activity_task_slot_supplier=ResourceBasedSlotConfig(
target_cpu_usage=0.7,
target_memory_usage=0.7,
),
),
)
Vorteile:
- Verhindert Out-of-Memory Errors
- Optimiert Durchsatz automatisch
- Passt sich an Workload an
2.5 Das Zusammenspiel: Ein komplettes Beispiel
Betrachten wir ein vollständiges Beispiel: Datenverarbeitung mit Benachrichtigung.
2.5.1 Der komplette Flow
sequenceDiagram
participant Client
participant Service as Temporal Service
participant WQ as Workflow Task Queue
participant AQ as Activity Task Queue
participant Worker
Client->>Service: Start Workflow "DataProcessing"
Service->>Service: Create Event History
Service->>Service: Write WorkflowExecutionStarted
Service->>WQ: Create Workflow Task
Worker->>WQ: Poll
WQ-->>Worker: Workflow Task
Worker->>Worker: Execute Workflow Code
Note over Worker: Code: execute_activity(process_data)
Worker->>Service: Commands [ScheduleActivity(process_data)]
Service->>Service: Write ActivityTaskScheduled
Service->>AQ: Create Activity Task
Worker->>AQ: Poll
AQ-->>Worker: Activity Task (process_data)
Worker->>Worker: Execute Activity Function
Note over Worker: Actual data processing
Worker->>Service: Activity Result
Service->>Service: Write ActivityTaskCompleted
Service->>WQ: Create new Workflow Task
Worker->>WQ: Poll
WQ-->>Worker: Workflow Task
Worker->>Worker: Replay + Continue
Note over Worker: Code: execute_activity(send_notification)
Worker->>Service: Commands [ScheduleActivity(send_notification)]
Service->>AQ: Create Activity Task
Worker->>AQ: Poll
AQ-->>Worker: Activity Task (send_notification)
Worker->>Worker: Execute send_notification
Worker->>Service: Activity Result
Service->>Service: Write ActivityTaskCompleted
Service->>WQ: Create Workflow Task
Worker->>WQ: Poll
WQ-->>Worker: Workflow Task
Worker->>Worker: Replay + Complete
Worker->>Service: Commands [CompleteWorkflow]
Service->>Service: Write WorkflowExecutionCompleted
Client->>Service: Get Result
Service-->>Client: {"status": "completed", ...}
2.5.2 Event History Timeline
Die Event History für diesen Flow:
1. WorkflowExecutionStarted
- WorkflowType: DataProcessingWorkflow
- Input: "Sample Data"
2. WorkflowTaskScheduled
3. WorkflowTaskStarted
4. WorkflowTaskCompleted
- Commands: [ScheduleActivityTask(process_data)]
5. ActivityTaskScheduled
- ActivityType: process_data
6. ActivityTaskStarted
7. ActivityTaskCompleted
- Result: "SAMPLE DATA"
8. WorkflowTaskScheduled
9. WorkflowTaskStarted
10. WorkflowTaskCompleted
- Commands: [ScheduleActivityTask(send_notification)]
11. ActivityTaskScheduled
- ActivityType: send_notification
12. ActivityTaskStarted
13. ActivityTaskCompleted
14. WorkflowTaskScheduled
15. WorkflowTaskStarted
16. WorkflowTaskCompleted
- Commands: [CompleteWorkflowExecution]
17. WorkflowExecutionCompleted
- Result: {"input": "Sample Data", "output": "SAMPLE DATA", "status": "completed"}
2.5.3 Code-Beispiel ausführen
Voraussetzungen:
# 1. Temporal Server starten (Docker)
docker compose up -d
# 2. Dependencies installieren
cd ../examples/part-01/chapter-02
uv sync
Terminal 1 – Worker starten:
uv run python worker.py
Ausgabe:
INFO - Starting Temporal Worker...
INFO - Worker registered workflows and activities:
INFO - - Workflows: ['DataProcessingWorkflow']
INFO - - Activities: ['process_data', 'send_notification']
INFO - Worker is running and polling for tasks...
INFO - Press Ctrl+C to stop
Terminal 2 – Workflow starten:
uv run python workflow.py
Ausgabe:
INFO - Processing data: Sample Data
INFO - Data processed successfully: SAMPLE DATA
INFO - Sending notification: Processed: SAMPLE DATA
📧 Notification: Processed: SAMPLE DATA
INFO - Notification sent successfully
✅ Workflow Result: {'input': 'Sample Data', 'output': 'SAMPLE DATA', 'status': 'completed'}
2.6 Best Practices
2.6.1 Workflow Best Practices
-
Orchestrieren, nicht Implementieren
# ❌ Schlecht: Business Logic im Workflow @workflow.defn class BadWorkflow: @workflow.run async def run(self, data: list): result = [] for item in data: # Komplexe Business Logic processed = item.strip().upper().replace("_", "-") result.append(processed) return result # ✅ Gut: Logic in Activity @workflow.defn class GoodWorkflow: @workflow.run async def run(self, data: list): return await workflow.execute_activity( process_items, data, start_to_close_timeout=timedelta(minutes=5) ) -
Kurze Workflow-Funktionen
- Lange Workflows in kleinere Child Workflows aufteilen
- Verbessert Wartbarkeit und Testbarkeit
-
Continue-As-New bei langen Laufzeiten
- Spätestens bei 10.000 Events
- Oder: Regelmäßige Checkpoints (täglich/wöchentlich)
-
Determinismus-Tests schreiben
from temporalio.testing import WorkflowEnvironment async def test_workflow_determinism(): async with await WorkflowEnvironment.start_time_skipping() as env: # Teste Workflow mit verschiedenen Szenarien ...
2.6.2 Activity Best Practices
-
IMMER idempotent
- Nutze Idempotency Keys
- Prüfe ob Operation bereits durchgeführt wurde
-
Passende Granularität
- Nicht zu fein: Bloated History
- Nicht zu grob: Schwierige Idempotenz, ineffiziente Retries
-
Timeouts immer setzen
- Mindestens Start-To-Close
- Heartbeats für langlebige Activities
-
Error Handling
@activity.defn async def robust_activity(params): try: return await external_api.call(params) except TemporaryError as e: # Retry durch Temporal raise except PermanentError as e: # Nicht wiederholen raise ApplicationError(str(e), non_retryable=True)
2.6.3 Worker Best Practices
-
Dedizierte Worker Processes in Production
- Nicht im gleichen Prozess wie Web Server
-
Task Queue Isolation für kritische Activities
# Zahlungen isoliert payment_worker = Worker( client, task_queue="payments", activities=[payment_activity], ) # Background Jobs separat background_worker = Worker( client, task_queue="background", activities=[email_activity, report_activity], ) -
Resource-Based Tuning nutzen
- Verhindert Out-of-Memory
- Optimiert Throughput automatisch
-
Monitoring und Metriken
# Wichtige Metriken überwachen: # - worker_task_slots_available (sollte >0 sein) # - temporal_sticky_cache_hit_total # - temporal_activity_execution_failed_total
2.7 Zusammenfassung
In diesem Kapitel haben wir die drei Kernbausteine von Temporal kennengelernt:
Workflows orchestrieren den gesamten Prozess:
- Deterministisch und replay-fähig
- Langlebig (Tage bis Jahre)
- Geschrieben in normalen Programmiersprachen
- Dürfen KEINE I/O-Operationen durchführen
Activities führen die eigentliche Arbeit aus:
- Nicht-deterministisch
- Dürfen I/O, externe APIs, Side Effects
- Automatische Retries mit konfigurierbaren Policies
- Sollten IMMER idempotent sein
Workers hostet Workflow- und Activity-Code:
- Pollen Task Queues via long-polling
- Stateless und horizontal skalierbar
- Führen Workflow-Replay und Activity-Execution aus
- Sticky Execution für Performance
Das große Bild:
graph TB
Client[Client Code]
Service[Temporal Service]
Worker[Worker Process]
Client -->|Start Workflow| Service
Service -->|Tasks via Queue| Worker
Worker -->|Workflow Code| WF[Workflows<br/>Orchestrierung]
Worker -->|Activity Code| AC[Activities<br/>Ausführung]
WF -->|Schedule| AC
Worker -->|Results| Service
Service -->|History| DB[(Event<br/>History)]
style WF fill:#e1f5ff
style AC fill:#ffe1e1
style Worker fill:#e1ffe1
Mit diesem Verständnis der Kernbausteine können wir im nächsten Kapitel tiefer in die Architektur des Temporal Service eintauchen und verstehen, wie Frontend, History Service, Matching Service und Persistence Layer zusammenarbeiten.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 3: Architektur des Temporal Service
Code-Beispiele für dieses Kapitel: examples/part-01/chapter-02/
Kapitel 3: Architektur des Temporal Service
Nachdem wir in den vorherigen Kapiteln die Grundkonzepte und Kernbausteine von Temporal kennengelernt haben, tauchen wir nun tief in die Architektur des Temporal Service ein. Der Temporal Service ist das Herzstück des gesamten Systems – er koordiniert Workflows, speichert den State, verwaltet Task Queues und garantiert die Ausführung. Ein fundiertes Verständnis dieser Architektur ist entscheidend für den Betrieb und die Skalierung von Temporal in Production.
3.1 Architektur-Übersicht
3.1.1 Die vier Kernkomponenten
Der Temporal Service besteht aus vier unabhängig skalierbaren Diensten:
graph TB
subgraph "Temporal Service"
FE[Frontend Service<br/>API Gateway]
HS[History Service<br/>State Management]
MS[Matching Service<br/>Task Queues]
WS[Worker Service<br/>Internal Operations]
end
subgraph "External Components"
Client[Clients]
Workers[Worker Processes]
DB[(Persistence<br/>Database)]
ES[(Visibility<br/>Elasticsearch)]
end
Client -->|gRPC| FE
Workers -->|Long Poll| FE
FE --> HS
FE --> MS
HS -->|Read/Write| DB
HS --> MS
MS -->|Tasks| DB
WS --> HS
HS -->|Events| ES
style FE fill:#e1f5ff
style HS fill:#ffe1e1
style MS fill:#fff4e1
style WS fill:#e1ffe1
Frontend Service:
- Stateless API Gateway
- Entry Point für alle Client- und Worker-Requests
- Request-Validierung und Rate Limiting
- Routing zu History und Matching Service
History Service:
- Verwaltet Workflow Execution State
- Speichert Event History (Event Sourcing)
- Koordiniert Workflow-Lifecycle
- Sharded: Feste Anzahl von Shards, die Workflow-Executions zugeordnet werden
Matching Service:
- Verwaltet Task Queues
- Dispatcht Tasks an Worker
- Long-Polling Mechanismus
- Partitioned: Task Queues in Partitionen für Skalierung
Worker Service (interner Dienst):
- Führt interne System-Workflows aus
- Replication Queue Processing
- Archival Operations
- Nicht die Worker-Prozesse der Anwender!
3.1.2 Architekturprinzipien
Event Sourcing als Fundament: Temporal speichert eine append-only Event History für jede Workflow Execution. Der komplette Workflow-State kann durch Replay dieser History rekonstruiert werden.
Separation of Concerns:
- Frontend: API und Routing
- History: State Management und Koordination
- Matching: Task Dispatching
- Persistence: Daten-Speicherung
Unabhängige Skalierung: Jeder Dienst kann unabhängig horizontal skaliert werden, um verschiedenen Workload-Charakteristiken gerecht zu werden.
3.2 Frontend Service: Das API Gateway
3.2.1 Rolle und Verantwortlichkeiten
Der Frontend Service ist der einzige öffentliche Entry Point zum Temporal Service:
graph LR
subgraph "External"
C1[Client 1]
C2[Client 2]
W1[Worker 1]
W2[Worker 2]
end
subgraph "Frontend Service"
FE1[Frontend Instance 1]
FE2[Frontend Instance 2]
FE3[Frontend Instance 3]
end
LB[Load Balancer]
C1 --> LB
C2 --> LB
W1 --> LB
W2 --> LB
LB --> FE1
LB --> FE2
LB --> FE3
FE1 -.->|Route| History[History Service]
FE2 -.->|Route| Matching[Matching Service]
FE3 -.->|Route| History
style LB fill:#cccccc
style FE1 fill:#e1f5ff
style FE2 fill:#e1f5ff
style FE3 fill:#e1f5ff
API Exposure:
- gRPC API (Port 7233): Primäres Protokoll für Clients und Workers
- HTTP API (Port 8233): HTTP-Proxy für Web UI und HTTP-Clients
- Protocol Buffers: Serialisierung mit protobuf
Request Handling:
- Empfängt API-Requests (StartWorkflow, SignalWorkflow, PollWorkflowTask, etc.)
- Validiert Requests auf Korrektheit
- Führt Rate Limiting durch
- Routet zu History oder Matching Service
3.2.2 Rate Limiting
Frontend implementiert Multi-Level Rate Limiting:
# Namespace-Level RPS Limit
# Pro Namespace maximal N Requests/Sekunde
frontend.namespacerps = 1200
# Persistence-Level QPS Limit
# Schützt Datenbank vor Überlastung
frontend.persistenceMaxQPS = 10000
# Task Queue-Level Limits
# Pro Task Queue maximal M Dispatches/Sekunde
Warum Rate Limiting?
- Schutz vor übermäßiger Last
- Fairness zwischen Namespaces (Multi-Tenancy)
- Vermeidung von Database-Überlastung
- Backpressure für Clients
3.2.3 Namespace Routing
Multi-Tenancy durch Namespaces:
Namespaces bieten logische Isolation:
- Workflow Executions isoliert pro Namespace
- Separate Resource Limits
- Unabhängige Retention Policies
- Verschiedene Archival Configurations
Routing-Mechanismus: Frontend bestimmt aus Request-Header, welcher Namespace betroffen ist, und routet entsprechend.
3.2.4 Stateless Design
Horizontale Skalierung ohne Limits:
# Einfaches Hinzufügen neuer Frontend Instances
kubectl scale deployment temporal-frontend --replicas=10
Eigenschaften:
- Keine Session-Affinität erforderlich
- Kein Shared State zwischen Instances
- Load Balancer verteilt Traffic
- Einfaches Rolling Update
3.3 History Service: Das Herzstück
3.3.1 Event Sourcing und State Management
Der History Service verwaltet den kompletten Lifecycle jeder Workflow Execution:
stateDiagram-v2
[*] --> WorkflowStarted: Client starts workflow
WorkflowStarted --> WorkflowTaskScheduled: Create first task
WorkflowTaskScheduled --> WorkflowTaskStarted: Worker polls
WorkflowTaskStarted --> WorkflowTaskCompleted: Worker returns commands
WorkflowTaskCompleted --> ActivityTaskScheduled: Schedule activity
ActivityTaskScheduled --> ActivityTaskStarted: Worker polls
ActivityTaskStarted --> ActivityTaskCompleted: Activity finishes
ActivityTaskCompleted --> WorkflowTaskScheduled: New workflow task
WorkflowTaskScheduled --> WorkflowTaskStarted
WorkflowTaskStarted --> WorkflowExecutionCompleted: Workflow completes
WorkflowExecutionCompleted --> [*]
Zwei Formen von State:
-
Mutable State (veränderlich):
- Aktueller Snapshot der Workflow Execution
- Tracked: Laufende Activities, Timer, Child Workflows, pending Signals
- In-Memory Cache für kürzlich verwendete Executions
- In Database persistiert (typischerweise eine Zeile)
- Wird bei jeder State Transition aktualisiert
-
Immutable Event History (unveränderlich):
- Append-Only Log aller Workflow Events
- Source of Truth: Workflow-State kann komplett rekonstruiert werden
- Definiert in Protocol Buffer Specifications
- Limits: 51.200 Events oder 50 MB (Warnung bei 10.240 Events/10 MB)
3.3.2 Sharding-Architektur
Fixed Shard Count:
Der History Service nutzt Sharding für Parallelität:
graph TB
subgraph "Workflow Executions"
WF1[Workflow 1<br/>ID: order-123]
WF2[Workflow 2<br/>ID: payment-456]
WF3[Workflow 3<br/>ID: order-789]
WF4[Workflow 4<br/>ID: shipment-111]
end
subgraph "History Shards (Fixed: 512)"
S1[Shard 1]
S2[Shard 2]
S3[Shard 3]
S4[Shard 512]
end
WF1 -->|Hash| S1
WF2 -->|Hash| S2
WF3 -->|Hash| S1
WF4 -->|Hash| S3
style S1 fill:#ffe1e1
style S2 fill:#ffe1e1
style S3 fill:#ffe1e1
style S4 fill:#ffe1e1
Shard Assignment:
shard_id = hash(workflow_id + namespace) % shard_count
Eigenschaften:
- Shard Count wird bei Cluster-Erstellung festgelegt
- Nicht änderbar nach Cluster-Start
- Empfohlen: 128-512 Shards für kleine Cluster, selten >4096
- Jeder Shard ist eine Unit of Parallelism
- Alle Updates innerhalb eines Shards sind sequenziell
Performance-Implikationen:
Max Throughput pro Shard = 1 / (Database Operation Latency)
Beispiel:
- DB Latency: 10ms
- Max Throughput: 1 / 0.01s = 100 Updates/Sekunde pro Shard
- 512 Shards → ~51.200 Updates/Sekunde gesamt
3.3.3 Interne Task Queues
Jeder History Shard verwaltet interne Queues für verschiedene Task-Typen:
graph TB
subgraph "History Shard"
TQ[Transfer Queue<br/>Sofort ausführbar]
TimerQ[Timer Queue<br/>Zeitbasiert]
VisQ[Visibility Queue<br/>Search Updates]
RepQ[Replication Queue<br/>Multi-Cluster]
ArchQ[Archival Queue<br/>Long-term Storage]
end
TQ -->|Triggers| Matching[Matching Service]
TimerQ -->|Fires at time| TQ
VisQ -->|Updates| ES[(Elasticsearch)]
RepQ -->|Replicates| Remote[Remote Cluster]
ArchQ -->|Archives| S3[(S3/GCS)]
style TQ fill:#e1f5ff
style TimerQ fill:#fff4e1
style VisQ fill:#ffe1e1
style RepQ fill:#e1ffe1
style ArchQ fill:#ffffcc
1. Transfer Queue:
- Sofort ausführbare Tasks
- Enqueued Workflow/Activity Tasks zu Matching
- Erzeugt Timer
2. Timer Queue:
- Zeitbasierte Events
- Workflow Timeouts, Retries, Delays
- Cron Triggers
- Fires zur definierten Zeit, erzeugt oft Transfer Tasks
3. Visibility Queue:
- Updates für Visibility Store (Elasticsearch)
- Ermöglicht Workflow-Suche und -Filterung
- Powert Web UI Queries
4. Replication Queue (Multi-Cluster):
- Repliziert Events zu Remote Clusters
- Async Replication für High Availability
5. Archival Queue:
- Triggert Archivierung nach Retention Period
- Langzeitspeicherung (S3, GCS, etc.)
3.3.4 Workflow State Transition
Transaktionaler Ablauf:
sequenceDiagram
participant Input as Input<br/>(RPC, Timer, Signal)
participant HS as History Service
participant Mem as In-Memory State
participant DB as Database
Input->>HS: State Transition Trigger
HS->>Mem: Load Mutable State (from cache/DB)
HS->>Mem: Create new Events
HS->>Mem: Update Mutable State
HS->>Mem: Generate Internal Tasks
HS->>DB: BEGIN TRANSACTION
HS->>DB: Write Events to History Table
HS->>DB: Update Mutable State Row
HS->>DB: Write Transfer/Timer Tasks
HS->>DB: COMMIT TRANSACTION
DB-->>HS: Transaction Success
HS->>HS: Cache Updated State
Consistency durch Transactions:
- Mutable State und Event History werden atomar committed
- Verhindert Inkonsistenzen bei Crashes
- Database Transactions garantieren ACID-Eigenschaften
Transactional Outbox Pattern:
- Transfer Tasks werden mit State in DB persistiert
- Task Processing erfolgt asynchron
- Verhindert Divergenz zwischen State und Task Queues
3.3.5 Cache-Mechanismen
Mutable State Cache:
# Pro-Shard Cache
# Cached kürzlich verwendete Workflow Executions
# Vermeidet teure History Replays
cache_size_per_shard = 1000 # Beispiel
Vorteile:
- Schneller Zugriff auf aktiven Workflow State
- Reduziert Database Reads
- Kritisch für Performance bei hoher Last
Cache Miss: Bei Cache Miss muss History Service:
- Event History aus DB laden
- Komplette History replayed
- State rekonstruieren
- In Cache einfügen
Geplante Verbesserung: Host-Level Cache, der von allen Shards geteilt wird.
3.4 Matching Service: Task Queue Management
3.4.1 Aufgaben und Verantwortlichkeiten
Der Matching Service verwaltet alle user-facing Task Queues:
graph TB
subgraph "Task Queues"
WQ[Workflow Task Queue<br/>'production']
AQ[Activity Task Queue<br/>'production']
AQ2[Activity Task Queue<br/>'background']
end
subgraph "Matching Service"
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
P4[Partition 4]
end
subgraph "Workers"
W1[Worker 1]
W2[Worker 2]
W3[Worker 3]
end
History[History Service] -->|Enqueue| P1
History -->|Enqueue| P2
W1 -->|Long Poll| P1
W2 -->|Long Poll| P3
W3 -->|Long Poll| P4
P1 -.-> WQ
P2 -.-> AQ
P3 -.-> AQ2
P4 -.-> WQ
style P1 fill:#fff4e1
style P2 fill:#fff4e1
style P3 fill:#fff4e1
style P4 fill:#fff4e1
Core Functions:
- Task Queue Verwaltung
- Task Dispatching an Workers
- Long-Poll Protocol Implementation
- Load Balancing über Worker Processes
3.4.2 Task Queue Partitioning
Default: 4 Partitionen pro Task Queue
# Task Queue "production" mit 4 Partitionen
task_queue_partitions = {
"production": [
"production_partition_0",
"production_partition_1",
"production_partition_2",
"production_partition_3",
]
}
Partition Charakteristika:
- Tasks werden zufällig einer Partition zugeordnet
- Worker Polls werden gleichmäßig verteilt
- Partitionen sind Units of Scaling für Matching Service
- Partition Count anpassbar basierend auf Last
Hierarchische Organisation:
graph TB
Root[Root Partition]
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
P4[Partition 4]
Root --> P1
Root --> P2
Root --> P3
Root --> P4
P1 -.->|Forward if empty| Root
P2 -.->|Forward if empty| Root
P3 -.->|Forward tasks| Root
P4 -.->|Forward if no pollers| Root
Forwarding Mechanismus:
- Leere Partitionen forwarden Polls zur Parent Partition
- Partitionen ohne Poller forwarden Tasks zur Parent
- Ermöglicht effiziente Ressourcennutzung
3.4.3 Sync Match vs Async Match
Zwei Dispatch-Modi:
sequenceDiagram
participant HS as History Service
participant MS as Matching Service
participant W as Worker
participant DB as Database
Note over MS,W: Sync Match (Optimal Path)
HS->>MS: Enqueue Task
W->>MS: Poll (waiting)
MS->>W: Task (immediate)
Note over MS: No DB write!
Note over MS,DB: Async Match (Backlog Path)
HS->>MS: Enqueue Task
MS->>DB: Persist Task
Note over W: Worker polls later
W->>MS: Poll
MS->>DB: Read Task
DB-->>MS: Task
MS->>W: Task
Sync Match (Optimal):
- Task sofort an wartenden Worker geliefert
- Keine Database-Persistierung erforderlich
- Zero Backlog Szenario
- Höchste Performance
- Metrik:
sync_match_ratesollte hoch sein (>90%)
Async Match (Backlog):
- Task wird in DB persistiert
- Worker holt später aus Backlog
- Tritt auf wenn keine Worker verfügbar
- Niedrigere Performance (DB Round-Trip)
- Tasks FIFO aus Backlog
Special Cases:
- Nexus/Query Tasks: Niemals persistiert, nur Sync Match
- Sticky Workflow Tasks: Bei Sync Match Fail → DB Persistence
3.4.4 Load Balancing
Worker-Pull Model:
graph LR
subgraph "Workers (Pull-Based)"
W1[Worker 1<br/>Capacity: 50]
W2[Worker 2<br/>Capacity: 30]
W3[Worker 3<br/>Capacity: 100]
end
subgraph "Matching Service"
TQ[Task Queue<br/>Tasks: 200]
end
W1 -.->|Poll when capacity| TQ
W2 -.->|Poll when capacity| TQ
W3 -.->|Poll when capacity| TQ
TQ -->|Distribute| W1
TQ -->|Distribute| W2
TQ -->|Distribute| W3
style W1 fill:#e1ffe1
style W2 fill:#e1ffe1
style W3 fill:#e1ffe1
Vorteile:
- Natürliches Load Balancing
- Workers holen nur wenn Kapazität vorhanden
- Verhindert Worker-Überlastung
- Kein Worker Discovery/DNS erforderlich
Backlog Management:
- Monitor
BacklogIncreaseRateMetrik - Balance Worker Count mit Task Volume
- Scale Workers um Sync Match Rate zu maximieren
3.4.5 Sticky Execution Optimization
Problem: Bei jedem Workflow Task muss Worker Event History laden und replayed.
Lösung: Sticky Task Queues
sequenceDiagram
participant HS as History Service
participant MS as Matching Service
participant NQ as Normal Queue
participant SQ as Sticky Queue (Worker 1)
participant W1 as Worker 1
participant W2 as Worker 2
HS->>MS: Enqueue Task (WF-123, first time)
MS->>NQ: Add to Normal Queue
W1->>MS: Poll Normal Queue
MS-->>W1: Task (WF-123)
W1->>W1: Execute + Cache State
W1->>HS: Complete
Note over MS: Create Sticky Queue for Worker 1
HS->>MS: Enqueue Task (WF-123, second time)
MS->>SQ: Add to Sticky Queue (Worker 1)
W1->>MS: Poll Sticky Queue
MS-->>W1: Task (WF-123)
Note over W1: State im Cache!<br/>Kein Replay!
W1->>W1: Execute (sehr schnell)
W1->>HS: Complete
Note over MS: Timeout (5s) - Worker 1 nicht verfügbar
HS->>MS: Enqueue Task (WF-123, third time)
MS->>SQ: Try Sticky Queue
MS->>NQ: Fallback to Normal Queue
W2->>MS: Poll Normal Queue
MS-->>W2: Task (WF-123)
Note over W2: Kein Cache<br/>History Reload + Replay
Vorteile:
- 10-100x schnellere Task-Verarbeitung
- Reduzierte Last auf History Service
- Geringere Latenz für Workflows
Automatisch aktiviert – keine Konfiguration erforderlich!
3.5 Worker Service: Interne Operationen
3.5.1 Unterschied zu User Workers
WICHTIG: Worker Service ≠ User Worker Processes!
graph TB
subgraph "Temporal Cluster (Managed)"
WS[Worker Service<br/>Internal System Service]
end
subgraph "User Application (Self-Hosted)"
UW1[User Worker 1]
UW2[User Worker 2]
UW3[User Worker 3]
end
WS -->|Processes| IWF[Internal System Workflows]
WS -->|Handles| Rep[Replication Queue]
WS -->|Manages| Arch[Archival Operations]
UW1 -->|Executes| AppWF[Application Workflows]
UW2 -->|Executes| AppWF
UW3 -->|Executes| AppWF
style WS fill:#e1ffe1
style UW1 fill:#e1f5ff
style UW2 fill:#e1f5ff
style UW3 fill:#e1f5ff
3.5.2 Aufgaben des Worker Service
Interne Background-Verarbeitung:
-
System Workflows:
- Workflow Deletions
- Dead-Letter Queue Handling
- Batch Operations
-
Replication Queue Processing:
- Multi-Cluster Replication
- Event-Synchronisation zu Remote Clusters
-
Archival Operations:
- Langzeit-Archivierung abgeschlossener Workflows
- Upload zu S3, GCS, etc.
-
Kafka Visibility Processor (Version < 1.5.0):
- Event Processing für Elasticsearch
Self-Hosting: Nutzt Temporal’s eigene Workflow Engine für Cluster-Level Operationen – “Temporal orchestriert Temporal”!
3.6 Persistence Layer: Datenspeicherung
3.6.1 Unterstützte Datenbanken
Primary Persistence (temporal_default):
graph TB
subgraph "Supported Databases"
Cass[Cassandra 3.x+<br/>NoSQL, Horizontal Scaling]
PG[PostgreSQL 9.6+<br/>SQL, Transactional]
MySQL[MySQL 5.7+<br/>SQL, Transactional]
end
subgraph "Temporal Services"
HS[History Service]
MS[Matching Service]
end
HS -->|Read/Write| Cass
HS -->|Read/Write| PG
HS -->|Read/Write| MySQL
MS -->|Task Backlog| Cass
MS -->|Task Backlog| PG
MS -->|Task Backlog| MySQL
style Cass fill:#e1f5ff
style PG fill:#ffe1e1
style MySQL fill:#fff4e1
Cassandra:
- Natürliche horizontale Skalierung
- Multi-Datacenter Replication
- Eventual Consistency Model
- Empfohlen für massive Scale
PostgreSQL/MySQL:
- Vertikale Skalierung
- Read Replicas für Visibility Queries
- Connection Pooling kritisch
- Ausreichend für die meisten Production Deployments
3.6.2 Datenmodell
Zwei-Schema-Ansatz:
1. temporal_default (Core Persistence):
Tables:
- executions: Mutable State of Workflow Executions
- history_node: Append-Only Event Log (partitioned)
- tasks: Transfer, Timer, Visibility, Replication Queues
- namespaces: Namespace Metadata, Retention Policies
- queue_metadata: Task Queue Checkpoints
2. temporal_visibility (Search/Query):
Tables:
- executions_visibility: Indexed Workflow Metadata
- workflow_id, workflow_type, status, start_time, close_time
- custom_search_attributes (JSON/Searchable)
Event History Storage Pattern:
# Events werden in Batches gespeichert (History Nodes)
# Jeder Node: ~100-200 Events
# Optimiert für sequentielles Lesen
history_nodes = [
{
"node_id": 1,
"events": [1..100], # WorkflowStarted bis Event 100
"prev_txn_id": 0,
"txn_id": 12345
},
{
"node_id": 2,
"events": [101..200],
"prev_txn_id": 12345,
"txn_id": 12456
},
]
3.6.3 Visibility Store
Database Visibility (Basic):
-- Einfache SQL Queries
SELECT * FROM executions_visibility
WHERE workflow_type = 'OrderProcessing'
AND status = 'Running'
AND start_time > '2025-01-01'
ORDER BY start_time DESC
LIMIT 100;
Limitierungen:
- Begrenzte Query-Capabilities
- Performance-Probleme bei großen Datasets
- Verfügbar: PostgreSQL 12+, MySQL 8.0.17+
Elasticsearch Visibility (Advanced, empfohlen):
// Komplexe Queries möglich
{
"query": {
"bool": {
"must": [
{"term": {"WorkflowType": "OrderProcessing"}},
{"term": {"ExecutionStatus": "Running"}},
{"range": {"StartTime": {"gte": "2025-01-01"}}}
],
"filter": [
{"term": {"CustomStringField": "VIP"}}
]
}
},
"sort": [{"StartTime": "desc"}],
"size": 100
}
Vorteile:
- High-Performance Indexing
- Komplexe Such-Queries
- Custom Attributes und Filter
- Entlastet Haupt-Datenbank
Design Consideration: Elasticsearch nimmt Query-Last von der Main Database – kritisch für Skalierung!
3.6.4 Konsistenz-Garantien
Strong Consistency (Writes):
# Database Transaction gewährleistet Konsistenz
BEGIN TRANSACTION
UPDATE executions SET mutable_state = ... WHERE ...
INSERT INTO history_node VALUES (...)
INSERT INTO tasks VALUES (...)
COMMIT
- History Service nutzt DB Transactions
- Mutable State + Events atomar committed
- Einzelner Shard verarbeitet alle Updates sequenziell
- Verhindert Race Conditions
Eventual Consistency (Reads):
- Visibility Data eventual consistent
- Multi-Cluster Replication asynchron
- Replication Lag möglich bei Failover
Event Sourcing Benefits:
- Exactly-Once Execution Semantics
- Komplette Audit Trail
- State Reconstruction jederzeit möglich
- Replay für Debugging
3.7 Kommunikationsflüsse
3.7.1 Workflow Start Flow
Der komplette Flow vom Client bis zur ersten Workflow Task Execution:
sequenceDiagram
participant C as Client
participant FE as Frontend
participant HS as History
participant DB as Database
participant MS as Matching
participant W as Worker
C->>FE: StartWorkflowExecution(id, type, input)
FE->>FE: Validate & Rate Limit
FE->>FE: Hash(workflow_id) → Shard 42
FE->>HS: Forward to History Shard 42
HS->>DB: BEGIN TRANSACTION
HS->>DB: INSERT WorkflowExecutionStarted Event
HS->>DB: INSERT WorkflowTaskScheduled Event
HS->>DB: INSERT Mutable State
HS->>DB: INSERT Transfer Task (workflow task)
HS->>DB: COMMIT TRANSACTION
DB-->>HS: Success
HS-->>FE: Execution Created
FE-->>C: RunID + Success
Note over HS: Transfer Queue Processor
HS->>MS: AddWorkflowTask(task_queue, task)
MS->>MS: Try Sync Match
alt Sync Match Success
W->>MS: PollWorkflowTaskQueue (waiting)
MS-->>W: Task (immediate)
else No Pollers
MS->>DB: Persist Task to Backlog
Note over W: Worker polls later
W->>MS: PollWorkflowTaskQueue
MS->>DB: Read from Backlog
DB-->>MS: Task
MS-->>W: Task
end
W->>W: Execute Workflow Code
W->>FE: RespondWorkflowTaskCompleted(commands)
FE->>HS: Process Commands
3.7.2 Activity Execution Flow
sequenceDiagram
participant W as Worker<br/>(Workflow)
participant FE as Frontend
participant HS as History
participant MS as Matching
participant AW as Worker<br/>(Activity)
Note over W: Workflow Code schedules Activity
W->>FE: RespondWorkflowTask([ScheduleActivity])
FE->>HS: Process Commands
HS->>HS: Create ActivityTaskScheduled Event
HS->>HS: Generate Transfer Task
HS->>MS: AddActivityTask(task_queue, task)
MS->>MS: Try Sync Match
AW->>MS: PollActivityTaskQueue
MS-->>AW: Activity Task
AW->>AW: Execute Activity Function
alt Activity Success
AW->>FE: RespondActivityTaskCompleted(result)
FE->>HS: Process Result
HS->>HS: Create ActivityTaskCompleted Event
else Activity Failure
AW->>FE: RespondActivityTaskFailed(error)
FE->>HS: Process Failure
HS->>HS: Create ActivityTaskFailed Event
Note over HS: Retry Logic applies
end
HS->>HS: Create new WorkflowTask
HS->>MS: AddWorkflowTask
Note over W: Worker receives continuation task
3.7.3 Long-Polling Mechanismus
Worker Long-Poll Detail:
# Worker SDK Code (vereinfacht)
async def poll_workflow_tasks():
while True:
try:
# Long Poll mit ~60s Timeout
response = await client.poll_workflow_task_queue(
task_queue="production",
timeout=60 # Sekunden
)
if response.has_task:
# Task sofort erhalten (Sync Match!)
await execute_workflow_task(response.task)
else:
# Timeout - keine Tasks verfügbar
# Sofort erneut pollen
continue
except Exception as e:
# Fehlerbehandlung
await asyncio.sleep(1)
Server-Seite (Matching Service):
# Matching Service (konzeptuell)
async def handle_poll_request(poll_request):
# Try Sync Match
task = try_get_task_immediately(poll_request.task_queue)
if task:
# Sync Match erfolgreich!
return task
# Kein Task verfügbar - halte Verbindung offen
task = await wait_for_task_or_timeout(
poll_request.task_queue,
timeout=60
)
if task:
return task
else:
return empty_response
Vorteile:
- Minimale Latenz bei Task-Verfügbarkeit
- Reduzierte Netzwerk-Overhead (keine Poll-Loops)
- Natürliches Backpressure Handling
3.8 Skalierung und High Availability
3.8.1 Unabhängige Service-Skalierung
graph TB
subgraph "Scaling Strategy"
FE1[Frontend<br/>3 Instances]
HS1[History<br/>10 Instances]
MS1[Matching<br/>5 Instances]
WS1[Worker<br/>2 Instances]
end
subgraph "Charakteristika"
FE1 -.-> FE_C[Stateless<br/>Unbegrenzt skalierbar]
HS1 -.-> HS_C[Sharded<br/>Shards über Instances verteilt]
MS1 -.-> MS_C[Partitioned<br/>Partitionen über Instances]
WS1 -.-> WS_C[Internal Workload<br/>Separat skalierbar]
end
Frontend Service:
- Stateless → Beliebig horizontal skalieren
- Hinter Load Balancer
- Keine Koordinations-Overhead
History Service:
- Instanzen hinzufügen
- Shards dynamisch über Instances verteilt
- Ringpop koordiniert Shard Ownership
- Constraint: Total Shard Count fixed
Matching Service:
- Instanzen hinzufügen
- Task Queue Partitionen über Instances verteilt
- Consistent Hashing für Partition Placement
3.8.2 Database Scaling
Bottleneck: Database oft ultimatives Performance-Limit
Cassandra:
# Natürliche horizontale Skalierung
# Neue Nodes hinzufügen
nodetool status
# Rebalancing automatisch
PostgreSQL/MySQL:
-- Vertikale Skalierung: Größere Instances
-- Read Replicas für Visibility Queries
-- Connection Pooling kritisch
max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
3.8.3 Multi-Cluster Replication
Global Namespaces für High Availability:
graph TB
subgraph "Cluster 1 (Primary - US-West)"
NS1[Namespace: production<br/>Active]
HS1[History Service]
DB1[(Database)]
end
subgraph "Cluster 2 (Standby - US-East)"
NS2[Namespace: production<br/>Standby]
HS2[History Service]
DB2[(Database)]
end
Client[Client Application]
Client -->|Writes| NS1
Client -.->|Reads| NS1
Client -.->|Reads| NS2
NS1 -->|Async Replication| NS2
style NS1 fill:#90EE90
style NS2 fill:#FFB6C1
Charakteristika:
- Async Replication: Hoher Throughput
- Nicht strongly consistent über Clusters
- Replication Lag bei Failover → potentieller Progress Loss
- Visibility APIs funktionieren auf Active und Standby
Failover Process:
- Namespace auf Backup Cluster aktiviert
- Workflows setzen fort vom letzten replizierten State
- Einige in-flight Activity Tasks können re-executed werden
- Akzeptabel für Disaster Recovery Szenarien
3.8.4 Performance-Metriken
Key Metrics zu überwachen:
# History Service
"shard_lock_latency": < 5ms, # Idealerweise ~1ms
"cache_hit_rate": > 80%,
"transfer_task_latency": < 100ms,
# Matching Service
"sync_match_rate": > 90%, # Hoch halten!
"backlog_size": < 1000,
"poll_success_rate": > 95%,
# Database
"query_latency_p99": < 50ms,
"connection_pool_utilization": 60-80%,
"persistence_rps": < max_qps,
Sticky Execution Optimization:
sticky_cache_hit_rate: > 70%
→ Drastisch reduzierte History Replays
→ 10-100x schnellere Task-Verarbeitung
3.9 Praktisches Beispiel: Service Interaction
Schauen wir uns das Code-Beispiel für Kapitel 3 an:
@workflow.defn
class ServiceArchitectureWorkflow:
"""
Demonstriert Service-Architektur-Konzepte.
"""
@workflow.run
async def run(self) -> dict:
workflow.logger.info("Workflow started - event logged in history")
# Frontend → History: Workflow gestartet
# History → Database: WorkflowExecutionStarted Event
# History → History Cache: Mutable State gecached
steps = ["Frontend processing", "History service update", "Task scheduling"]
for i, step in enumerate(steps, 1):
workflow.logger.info(f"Step {i}: {step}")
# Jedes Log → Event in History
# History → Matching: Workflow Task scheduled
# Matching → Worker: Task dispatched (hoffentlich Sync Match!)
workflow.logger.info("Workflow completed - final event in history")
return {
"message": "Architecture demonstration complete",
"steps_completed": len(steps)
}
📁 Code-Beispiel:
../examples/part-01/chapter-03/service_interaction.py
Workflow ausführen:
# Terminal 1: Worker starten
cd ../examples/part-01/chapter-03
uv run python -m temporalio.worker \
--task-queue book-examples \
service_interaction
# Terminal 2: Workflow starten
uv run python service_interaction.py
Ausgabe zeigt Service-Interaktionen:
=== Temporal Service Architecture Demonstration ===
1. Client connecting to Temporal Frontend...
✓ Connected to Temporal service
2. Starting workflow (ID: architecture-demo-001)
Frontend schedules task...
History service creates event log...
✓ Workflow started
3. Waiting for workflow completion...
Worker polls task queue...
Worker executes workflow code...
History service logs each event...
✓ Workflow completed
4. Accessing workflow history...
✓ Retrieved 17 events from history service
=== Architecture Components Demonstrated ===
✓ Client - Initiated workflow
✓ Frontend - Accepted workflow request
✓ History Service - Stored event log
✓ Task Queue - Delivered tasks to worker
✓ Worker - Executed workflow code
3.10 Zusammenfassung
In diesem Kapitel haben wir die Architektur des Temporal Service im Detail kennengelernt:
Die vier Kernkomponenten:
-
Frontend Service – Stateless API Gateway
- Entry Point für alle Requests
- Rate Limiting und Validation
- Routing zu History und Matching
-
History Service – State Management
- Verwaltet Workflow Execution Lifecycle
- Event Sourcing mit Mutable State + Immutable Events
- Sharded für Parallelität
- Interne Task Queues (Transfer, Timer, Visibility, etc.)
-
Matching Service – Task Queue Management
- Verwaltet alle user-facing Task Queues
- Partitioned für Skalierung
- Sync Match (optimal) vs Async Match (Backlog)
- Long-Polling Protocol
-
Worker Service – Interne Operationen
- Replication, Archival, System Workflows
- Unterschied zu User Worker Processes
Persistence Layer:
- Cassandra, PostgreSQL, MySQL
- Event History + Mutable State
- Visibility Store (Database oder Elasticsearch)
- Strong Consistency bei Writes
Kommunikationsflüsse:
- Client → Frontend → History → Database
- History → Matching → Worker (Long-Poll)
- Event Sourcing garantiert Consistency
Skalierung:
- Unabhängige Service-Skalierung
- Frontend: Unbegrenzt horizontal
- History: Via Shard-Distribution
- Matching: Via Partition-Distribution
- Multi-Cluster für High Availability
Performance-Optimierungen:
- Sticky Execution (10-100x schneller)
- Sync Match (kein DB Round-Trip)
- Mutable State Cache
- Partitioning für Parallelität
graph TB
Client[Client/Worker]
FE[Frontend<br/>Stateless API]
HS[History<br/>Sharded State]
MS[Matching<br/>Partitioned Queues]
DB[(Database<br/>Cassandra/PG/MySQL)]
ES[(Elasticsearch<br/>Visibility)]
Client -->|gRPC| FE
FE --> HS
FE --> MS
HS -->|Events| DB
HS -->|Enqueue| MS
HS -->|Index| ES
MS -->|Backlog| DB
style FE fill:#e1f5ff
style HS fill:#ffe1e1
style MS fill:#fff4e1
style DB fill:#e1ffe1
style ES fill:#ffffcc
Mit diesem tiefen Verständnis der Temporal Service Architektur können wir nun in Teil II eintauchen, wo wir uns auf die praktische Nutzung der SDKs konzentrieren und fortgeschrittene Entwicklungstechniken erlernen.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 4: Entwicklungs-Setup und SDK-Auswahl
Code-Beispiele für dieses Kapitel: examples/part-01/chapter-03/
Kapitel 4: Entwicklungs-Setup und SDK-Auswahl
Willkommen zu Teil II des Buches! Nachdem wir in Teil I die theoretischen Grundlagen von Temporal kennengelernt haben, tauchen wir nun in die praktische Entwicklung ein. In diesem Kapitel richten wir die komplette Entwicklungsumgebung ein, wählen das richtige SDK aus und lernen die Tools kennen, die uns bei der täglichen Arbeit mit Temporal unterstützen.
4.1 SDK-Übersicht: Die Wahl der richtigen Sprache
4.1.1 Verfügbare SDKs
Temporal bietet sieben offizielle SDKs für verschiedene Programmiersprachen (Stand 2025):
graph TB
subgraph "Native Implementations"
Go[Go SDK<br/>Native Implementation]
Java[Java SDK<br/>Native Implementation]
end
subgraph "Rust Core SDK Based"
TS[TypeScript SDK<br/>Built on Rust Core]
Python[Python SDK<br/>Built on Rust Core]
DotNet[.NET SDK<br/>Built on Rust Core]
PHP[PHP SDK<br/>Built on Rust Core]
Ruby[Ruby SDK<br/>Built on Rust Core<br/>Pre-release]
end
RustCore[Rust Core SDK<br/>Shared Implementation]
TS -.-> RustCore
Python -.-> RustCore
DotNet -.-> RustCore
PHP -.-> RustCore
Ruby -.-> RustCore
style Go fill:#e1f5ff
style Java fill:#ffe1e1
style Python fill:#90EE90
style TS fill:#fff4e1
style DotNet fill:#e1ffe1
style PHP fill:#ffffcc
style Ruby fill:#FFB6C1
style RustCore fill:#cccccc
Architektur-Unterschiede:
Native Implementationen (Go, Java):
- Komplett eigenständige Implementierungen
- Eigene Metric-Standards (Sekunden statt Millisekunden)
- Lange etabliert und battle-tested
Rust Core SDK Based (TypeScript, Python, .NET, PHP, Ruby):
- Teilen dieselbe Rust-basierte Core-Implementierung
- Metrics in Millisekunden
- Effizientere Ressourcennutzung
- Einheitliches Verhalten über SDKs hinweg
4.1.2 Python SDK: Unser Fokus
Warum Python?
# Python bietet native async/await Unterstützung
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class DataProcessingWorkflow:
@workflow.run
async def run(self, data: list[str]) -> dict:
# Natürliches async/await für parallele Operationen
results = await asyncio.gather(
*[self.process_item(item) for item in data]
)
return {"processed": len(results)}
Python SDK Stärken:
- Async/Await Native: Perfekt für Workflows mit Timern und parallelen Tasks
- Type Safety: Vollständige Type Hints mit Generics
- Workflow Sandbox: Automatische Modul-Neuladung für Determinismus
- ML/AI Ecosystem: Ideal für Data Science, Machine Learning, LLM-Projekte
- Entwickler-Freundlichkeit: Pythonic API, saubere Syntax
Technische Anforderungen:
# Python Version
Python >= 3.10 (empfohlen: 3.10+)
# Core Dependencies
temporalio >= 1.0.0
protobuf >= 3.20, < 7.0.0
4.1.3 Wann welches SDK?
Entscheidungsmatrix:
| Szenario | Empfohlenes SDK | Grund |
|---|---|---|
| Data Science, ML, AI | Python | Ecosystem, Libraries |
| High-Performance Microservices | Go | Performance, Concurrency |
| Enterprise Backend | Java | JVM Ecosystem, Legacy Integration |
| Web Development | TypeScript | Node.js, Frontend-Integration |
| .NET Shops | .NET | C# Integration, Performance |
| Polyglot Architektur | Mix | Go API + Python Workers für ML |
Feature Parity: Alle Major SDKs (Go, Java, TypeScript, Python, .NET) sind production-ready mit vollständiger Feature-Parität.
4.2 Lokale Entwicklungsumgebung
4.2.1 Temporal Server Optionen
Option 1: Temporal CLI Dev Server (Empfohlen für Einstieg)
# Installation Temporal CLI
# macOS/Linux:
brew install temporal
# Oder: Download binary von CDN
# Dev Server starten
temporal server start-dev
# Mit persistenter SQLite-Datenbank
temporal server start-dev --db-filename temporal.db
# In Docker
docker run --rm -p 7233:7233 -p 8233:8233 \
temporalio/temporal server start-dev --ip 0.0.0.0
Eigenschaften:
- Ports: gRPC auf localhost:7233, Web UI auf http://localhost:8233
- Database: In-Memory (ohne
--db-filename) oder SQLite - Features: Embedded Server, Web UI, Default Namespace
- Ideal für: Erste Schritte, Tutorials, lokales Testen
Option 2: Docker Compose (Production-like)
# Temporal Docker Compose Setup klonen
git clone https://github.com/temporalio/docker-compose.git
cd docker-compose
# Starten
docker compose up
# Im Hintergrund
docker compose up -d
Komponenten:
services:
postgresql: # Port 5432, Credentials: temporal/temporal
elasticsearch: # Port 9200, Single-Node Mode
temporal: # gRPC: 7233, Web UI: 8080
temporal-admin-tools:
temporal-ui:
Ideal für:
- Production-ähnliche lokale Entwicklung
- Testing mit Elasticsearch Visibility
- Multi-Service Integration Tests
Option 3: Temporalite (Leichtgewichtig)
# Standalone Binary mit SQLite
# Weniger Ressourcen als Docker Compose
# Nur für Development/Testing
Vergleich der Optionen:
graph LR
subgraph "Development Journey"
Start[Start Learning]
Dev[Active Development]
PreProd[Pre-Production Testing]
end
Start -->|Use| CLI[CLI Dev Server<br/>Schnell, Einfach]
Dev -->|Use| Docker[Docker Compose<br/>Production-like]
PreProd -->|Use| Full[Full Deployment<br/>Kubernetes]
style CLI fill:#90EE90
style Docker fill:#fff4e1
style Full fill:#ffe1e1
4.2.2 Python Entwicklungsumgebung
Moderne Toolchain mit uv (Empfohlen 2025):
# uv installieren (10-100x schneller als pip)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Projekt erstellen
mkdir my-temporal-project
cd my-temporal-project
# Python-Version festlegen
echo "3.13" > .python-version
# pyproject.toml erstellen
cat > pyproject.toml << 'EOF'
[project]
name = "my-temporal-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"temporalio>=1.0.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.4.0",
"pytest-asyncio>=0.21.0",
]
EOF
# Dependencies installieren
uv sync
# Script ausführen (uv managed venv automatisch)
uv run python worker.py
Traditioneller Ansatz (falls uv nicht verfügbar):
# Virtual Environment erstellen
python -m venv venv
# Aktivieren
source venv/bin/activate # Windows: venv\Scripts\activate
# Dependencies installieren
pip install temporalio
# Mit optionalen Features
pip install temporalio[opentelemetry,pydantic]
Temporal SDK Extras:
# gRPC Support
pip install temporalio[grpc]
# OpenTelemetry für Tracing
pip install temporalio[opentelemetry]
# Pydantic Integration
pip install temporalio[pydantic]
# Alles
pip install temporalio[grpc,opentelemetry,pydantic]
4.2.3 IDE Setup und Debugging
VSCode Configuration:
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Worker",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/workers/worker.py",
"console": "integratedTerminal",
"justMyCode": false,
"env": {
"TEMPORAL_ADDRESS": "localhost:7233",
"LOG_LEVEL": "DEBUG"
}
},
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal"
}
]
}
Debugging-Einschränkungen:
from temporalio import workflow, activity
# ✅ Breakpoints funktionieren
@activity.defn
async def my_activity(param: str) -> str:
# Breakpoint hier funktioniert!
result = process(param)
return result
# ❌ Breakpoints funktionieren NICHT (Sandbox-Limitation)
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self, input: str) -> str:
# Breakpoint hier wird NICHT getroffen
result = await workflow.execute_activity(...)
return result
Workaround: Nutze workflow.logger für Debugging in Workflows:
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self, order_id: str) -> str:
workflow.logger.info(f"Processing order: {order_id}")
result = await workflow.execute_activity(...)
workflow.logger.debug(f"Activity result: {result}")
return result
4.3 Projekt-Struktur und Best Practices
4.3.1 Empfohlene Verzeichnisstruktur
my-temporal-project/
├── .env # Environment Variables (nicht committen!)
├── .gitignore
├── .python-version # Python 3.13
├── pyproject.toml # Projekt-Konfiguration
├── temporal.toml # Temporal Multi-Environment Config
├── README.md
│
├── src/
│ ├── __init__.py
│ │
│ ├── workflows/ # Alle Workflow-Definitionen
│ │ ├── __init__.py
│ │ ├── order_workflow.py
│ │ └── payment_workflow.py
│ │
│ ├── activities/ # Alle Activity-Implementierungen
│ │ ├── __init__.py
│ │ ├── email_activities.py
│ │ └── payment_activities.py
│ │
│ ├── models/ # Shared Types und Dataclasses
│ │ ├── __init__.py
│ │ └── order_models.py
│ │
│ └── workers/ # Worker-Prozesse
│ ├── __init__.py
│ └── main_worker.py
│
├── tests/ # Test Suite
│ ├── __init__.py
│ ├── conftest.py # pytest Fixtures
│ ├── test_workflows/
│ │ └── test_order_workflow.py
│ └── test_activities/
│ └── test_email_activities.py
│
└── scripts/ # Helper Scripts
├── start_worker.sh
└── deploy.sh
4.3.2 Type-Safe Workflow Development
Input/Output mit Dataclasses:
# src/models/order_models.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class OrderInput:
order_id: str
customer_email: str
items: list[str]
total_amount: float
@dataclass
class OrderResult:
success: bool
transaction_id: Optional[str] = None
error_message: Optional[str] = None
# src/workflows/order_workflow.py
from temporalio import workflow
from datetime import timedelta
from ..models.order_models import OrderInput, OrderResult
from ..activities.payment_activities import process_payment
from ..activities.email_activities import send_confirmation
@workflow.defn
class OrderWorkflow:
"""
Orchestriert den Order-Processing-Flow.
"""
@workflow.run
async def run(self, input: OrderInput) -> OrderResult:
"""
Verarbeitet eine Bestellung.
Args:
input: Order-Daten mit allen relevanten Informationen
Returns:
OrderResult mit Transaction ID oder Fehler
"""
workflow.logger.info(f"Processing order {input.order_id}")
try:
# Payment verarbeiten
transaction_id = await workflow.execute_activity(
process_payment,
args=[input.total_amount, input.customer_email],
start_to_close_timeout=timedelta(seconds=30),
)
# Confirmation Email senden
await workflow.execute_activity(
send_confirmation,
args=[input.customer_email, input.order_id, transaction_id],
start_to_close_timeout=timedelta(seconds=10),
)
return OrderResult(
success=True,
transaction_id=transaction_id
)
except Exception as e:
workflow.logger.error(f"Order processing failed: {e}")
return OrderResult(
success=False,
error_message=str(e)
)
Vorteile von Dataclasses:
- ✅ Typsicherheit mit IDE-Unterstützung
- ✅ Einfaches Hinzufügen neuer Felder (mit Defaults)
- ✅ Automatische Serialisierung/Deserialisierung
- ✅ Bessere Lesbarkeit als
*args, **kwargs
4.3.3 Configuration Management
Multi-Environment Setup mit temporal.toml:
# temporal.toml - Multi-Environment Configuration
# Default für lokale Entwicklung
[default]
target = "localhost:7233"
namespace = "default"
# Development Environment
[dev]
target = "localhost:7233"
namespace = "development"
# Staging Environment
[staging]
target = "staging.temporal.example.com:7233"
namespace = "staging"
tls_cert_path = "/path/to/staging-cert.pem"
tls_key_path = "/path/to/staging-key.pem"
# Production Environment
[prod]
target = "prod.temporal.example.com:7233"
namespace = "production"
tls_cert_path = "/path/to/prod-cert.pem"
tls_key_path = "/path/to/prod-key.pem"
Environment-basierte Client-Konfiguration:
# src/config.py
import os
from temporalio.client import Client
from temporalio.envconfig import load_client_config
async def create_client() -> Client:
"""
Erstellt Temporal Client basierend auf TEMPORAL_PROFILE env var.
Beispiel:
export TEMPORAL_PROFILE=prod
python worker.py # Verbindet zu Production
"""
profile = os.getenv("TEMPORAL_PROFILE", "default")
config = load_client_config(profile=profile)
client = await Client.connect(**config)
return client
.env File (nicht committen!):
# .env - Lokale Entwicklung
TEMPORAL_ADDRESS=localhost:7233
TEMPORAL_NAMESPACE=default
TEMPORAL_TASK_QUEUE=order-processing
TEMPORAL_PROFILE=dev
# Application Config
LOG_LEVEL=DEBUG
DATABASE_URL=postgresql://user:pass@localhost:5432/orders
# External Services
SMTP_SERVER=smtp.example.com
SMTP_PORT=587
STRIPE_API_KEY=sk_test_...
Config Loading mit Pydantic:
# src/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache
class Settings(BaseSettings):
# Temporal Config
temporal_address: str = "localhost:7233"
temporal_namespace: str = "default"
temporal_task_queue: str = "default"
# Application Config
log_level: str = "INFO"
database_url: str
# External Services
smtp_server: str
smtp_port: int = 587
stripe_api_key: str
class Config:
env_file = ".env"
case_sensitive = False
@lru_cache()
def get_settings() -> Settings:
"""Singleton Settings Instance."""
return Settings()
# Usage
from config import get_settings
settings = get_settings()
4.4 Development Workflow
4.4.1 Worker Development Loop
Typischer Entwicklungs-Workflow:
graph TB
Start[Code Ändern]
Restart[Worker Neustarten]
Test[Workflow Testen]
Debug[Debuggen mit Web UI]
Fix[Fehler Fixen]
Start --> Restart
Restart --> Test
Test -->|Error| Debug
Debug --> Fix
Fix --> Start
Test -->|Success| Done[Feature Complete]
style Start fill:#e1f5ff
style Test fill:#fff4e1
style Debug fill:#ffe1e1
style Done fill:#90EE90
1. Worker starten:
# Terminal 1: Temporal Server (falls nicht bereits läuft)
temporal server start-dev --db-filename temporal.db
# Terminal 2: Worker starten
uv run python src/workers/main_worker.py
# Oder mit Environment
TEMPORAL_PROFILE=dev LOG_LEVEL=DEBUG uv run python src/workers/main_worker.py
2. Workflow ausführen:
# scripts/run_workflow.py
import asyncio
from temporalio.client import Client
from src.workflows.order_workflow import OrderWorkflow
from src.models.order_models import OrderInput
async def main():
client = await Client.connect("localhost:7233")
input_data = OrderInput(
order_id="ORD-123",
customer_email="customer@example.com",
items=["Item A", "Item B"],
total_amount=99.99
)
result = await client.execute_workflow(
OrderWorkflow.run,
input_data,
id="order-ORD-123",
task_queue="order-processing",
)
print(f"Result: {result}")
if __name__ == "__main__":
asyncio.run(main())
# Workflow ausführen
uv run python scripts/run_workflow.py
3. Debugging mit Web UI:
# Web UI öffnen
http://localhost:8233
# Workflow suchen: order-ORD-123
# → Event History inspizieren
# → Input/Output anzeigen
# → Stack Trace bei Errors
4. Code-Änderungen ohne Downtime:
# Bei Code-Änderungen:
# 1. Worker mit Ctrl+C stoppen
# 2. Code ändern
# 3. Worker neu starten
# Laufende Workflows:
# → Werden automatisch fortgesetzt
# → Event History bleibt erhalten
# → Bei Breaking Changes: Workflow Versioning nutzen (Kapitel 9)
4.4.2 Temporal CLI Commands
Wichtigste Commands für Development:
# Workflows auflisten
temporal workflow list
# Workflow Details
temporal workflow describe -w order-ORD-123
# Event History anzeigen
temporal workflow show -w order-ORD-123
# Event History als JSON exportieren
temporal workflow show -w order-ORD-123 > history.json
# Workflow starten
temporal workflow start \
--task-queue order-processing \
--type OrderWorkflow \
--workflow-id order-ORD-456 \
--input '{"order_id": "ORD-456", "customer_email": "test@example.com", ...}'
# Workflow ausführen und auf Result warten
temporal workflow execute \
--task-queue order-processing \
--type OrderWorkflow \
--workflow-id order-ORD-789 \
--input @input.json
# Workflow canceln
temporal workflow cancel -w order-ORD-123
# Workflow terminieren (hard stop)
temporal workflow terminate -w order-ORD-123
# Signal senden
temporal workflow signal \
--workflow-id order-ORD-123 \
--name add-item \
--input '{"item": "Item C"}'
# Query ausführen
temporal workflow query \
--workflow-id order-ORD-123 \
--type get-status
# Workflow Count
temporal workflow count
Temporal Web UI Navigation:
http://localhost:8233
│
├── Workflows → Alle Workflow Executions
│ ├── Filter (Status, Type, Time Range)
│ ├── Search (Workflow ID, Run ID)
│ └── Details → Event History
│ ├── Timeline View
│ ├── Compact View
│ ├── Full History
│ └── JSON Export
│
├── Namespaces → Namespace Management
├── Archival → Archived Workflows
└── Settings → Server Configuration
4.5 Testing und Debugging
4.5.1 Test-Setup mit pytest
Dependencies installieren:
# pyproject.toml
[project.optional-dependencies]
dev = [
"pytest>=7.4.0",
"pytest-asyncio>=0.21.0",
"pytest-cov>=4.1.0", # Coverage reporting
]
uv sync --all-extras
pytest Configuration:
# pyproject.toml
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
4.5.2 Workflow Testing mit Time-Skipping
Integration Test:
# tests/test_workflows/test_order_workflow.py
import pytest
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
from src.workflows.order_workflow import OrderWorkflow
from src.activities.payment_activities import process_payment
from src.activities.email_activities import send_confirmation
from src.models.order_models import OrderInput
@pytest.mark.asyncio
async def test_order_workflow_success():
"""Test erfolgreicher Order-Processing Flow."""
# Time-Skipping Test Environment
async with await WorkflowEnvironment.start_time_skipping() as env:
# Worker mit Workflows und Activities
async with Worker(
env.client,
task_queue="test-queue",
workflows=[OrderWorkflow],
activities=[process_payment, send_confirmation],
):
# Workflow ausführen
input_data = OrderInput(
order_id="TEST-001",
customer_email="test@example.com",
items=["Item A"],
total_amount=49.99
)
result = await env.client.execute_workflow(
OrderWorkflow.run,
input_data,
id="test-order-001",
task_queue="test-queue",
)
# Assertions
assert result.success is True
assert result.transaction_id is not None
assert result.error_message is None
Test mit gemockten Activities:
from temporalio import activity
@activity.defn
async def mock_process_payment(amount: float, email: str) -> str:
"""Mock Payment Activity für Tests."""
return f"mock-txn-{amount}"
@activity.defn
async def mock_send_confirmation(email: str, order_id: str, txn_id: str) -> None:
"""Mock Email Activity für Tests."""
pass
@pytest.mark.asyncio
async def test_order_workflow_with_mocks():
"""Test mit gemockten Activities."""
async with await WorkflowEnvironment.start_time_skipping() as env:
async with Worker(
env.client,
task_queue="test-queue",
workflows=[OrderWorkflow],
activities=[mock_process_payment, mock_send_confirmation], # Mocks!
):
result = await env.client.execute_workflow(...)
assert result.success is True
Time Skipping für Timeouts:
from datetime import timedelta
@pytest.mark.asyncio
async def test_workflow_with_long_timeout():
"""Test Workflow mit 7 Tagen Sleep - läuft sofort!"""
async with await WorkflowEnvironment.start_time_skipping() as env:
async with Worker(...):
# Workflow mit 7-Tage Sleep
# Time-Skipping: Läuft in Millisekunden statt 7 Tagen!
result = await env.client.execute_workflow(
WeeklyReportWorkflow.run,
...
)
assert result.week_number == 1
4.5.3 Activity Unit Tests
# tests/test_activities/test_payment_activities.py
import pytest
from src.activities.payment_activities import process_payment
@pytest.mark.asyncio
async def test_process_payment_success():
"""Test Payment Activity in Isolation."""
# Activity direkt testen (ohne Temporal)
result = await process_payment(
amount=99.99,
email="test@example.com"
)
assert result.startswith("txn_")
assert len(result) > 10
@pytest.mark.asyncio
async def test_process_payment_invalid_amount():
"""Test Payment Activity mit ungültigem Amount."""
from temporalio.exceptions import ApplicationError
with pytest.raises(ApplicationError) as exc_info:
await process_payment(
amount=-10.00, # Ungültig!
email="test@example.com"
)
assert "Invalid amount" in str(exc_info.value)
assert exc_info.value.non_retryable is True
4.5.4 Replay Testing
Workflow History exportieren:
# Via CLI
temporal workflow show -w order-ORD-123 > workflow_history.json
# Via Web UI
# → Workflow Details → JSON Tab → Copy
Replay Test:
# tests/test_workflows/test_replay.py
import pytest
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory
from src.workflows.order_workflow import OrderWorkflow
@pytest.mark.asyncio
async def test_replay_order_workflow():
"""Test Workflow Replay für Non-Determinism."""
# History laden
with open("tests/fixtures/order_workflow_history.json") as f:
history = WorkflowHistory.from_json(f.read())
# Replayer erstellen
replayer = Replayer(
workflows=[OrderWorkflow]
)
# Replay (wirft Exception bei Non-Determinism)
await replayer.replay_workflow(history)
Warum Replay Testing?
- Erkennt Non-Deterministic Errors bei Code-Änderungen
- Verifiziert Workflow-Kompatibilität mit alten Executions
- Verhindert Production-Crashes durch Breaking Changes
4.5.5 Coverage und CI/CD
# Test Coverage
pytest --cov=src --cov-report=html tests/
# Output
# Coverage: 87%
# HTML Report: htmlcov/index.html
# Im CI/CD (GitHub Actions Beispiel)
# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install dependencies
run: uv sync --all-extras
- name: Run tests
run: uv run pytest --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v3
4.6 Debugging und Observability
4.6.1 Logging Best Practices
import logging
from temporalio import workflow, activity
# Root Logger konfigurieren
logging.basicConfig(
level=logging.INFO, # DEBUG für Development
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, input: OrderInput) -> OrderResult:
# Workflow Logger nutzen (nicht logging.getLogger!)
workflow.logger.info(f"Processing order {input.order_id}")
workflow.logger.debug(f"Input details: {input}")
try:
result = await workflow.execute_activity(...)
workflow.logger.info(f"Payment successful: {result}")
return OrderResult(success=True, transaction_id=result)
except Exception as e:
workflow.logger.error(f"Order failed: {e}", exc_info=True)
return OrderResult(success=False, error_message=str(e))
@activity.defn
async def process_payment(amount: float, email: str) -> str:
# Activity Logger nutzen
activity.logger.info(f"Processing payment: {amount} for {email}")
try:
txn_id = charge_card(amount, email)
activity.logger.info(f"Payment successful: {txn_id}")
return txn_id
except Exception as e:
activity.logger.error(f"Payment failed: {e}", exc_info=True)
raise
Log Levels:
- DEBUG: Development, detailliertes Troubleshooting
- INFO: Production, wichtige Events
- WARNING: Potentielle Probleme
- ERROR: Fehler die gehandled werden
- CRITICAL: Kritische Fehler
4.6.2 Prometheus Metrics
from temporalio.runtime import Runtime, TelemetryConfig, PrometheusConfig
from temporalio.client import Client
# Prometheus Endpoint konfigurieren
runtime = Runtime(
telemetry=TelemetryConfig(
metrics=PrometheusConfig(
bind_address="0.0.0.0:9000" # Metrics Port
)
)
)
client = await Client.connect(
"localhost:7233",
runtime=runtime
)
# Metrics verfügbar auf:
# http://localhost:9000/metrics
Verfügbare Metrics:
temporal_workflow_task_execution_totaltemporal_activity_execution_totaltemporal_workflow_task_execution_latency_mstemporal_activity_execution_latency_mstemporal_worker_task_slots_availabletemporal_sticky_cache_hit_total
4.6.3 OpenTelemetry Tracing
# Installation
pip install temporalio[opentelemetry]
from temporalio.contrib.opentelemetry import TracingInterceptor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, BatchSpanProcessor
# OpenTelemetry Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
# Temporal Client mit Tracing
interceptor = TracingInterceptor()
client = await Client.connect(
"localhost:7233",
interceptors=[interceptor]
)
# Alle Workflow/Activity Executions werden getracet
4.7 Zusammenfassung
In diesem Kapitel haben wir die komplette Entwicklungsumgebung für Temporal aufgesetzt:
SDK-Auswahl:
- 7 offizielle SDKs (Go, Java, TypeScript, Python, .NET, PHP, Ruby)
- Python SDK: Python >= 3.10, Rust Core SDK, Type-Safe, Async/Await
- Feature Parity über alle Major SDKs
Lokale Entwicklung:
- Temporal Server: CLI Dev Server (Einstieg), Docker Compose (Production-like)
- Python Setup: uv (modern, schnell) oder venv (traditionell)
- IDE: VSCode/PyCharm mit Debugging (Limitations in Workflows!)
Projekt-Struktur:
- Separation: workflows/, activities/, models/, workers/
- Type-Safe Dataclasses für Input/Output
- Multi-Environment Config (temporal.toml, .env)
Development Workflow:
- Worker Development Loop: Code → Restart → Test → Debug
- Temporal CLI: workflow list/show/execute/signal/query
- Web UI: Event History, Timeline, Stack Traces
Testing:
- pytest mit pytest-asyncio
- Time-Skipping Environment für schnelle Tests
- Activity Mocking
- Replay Testing für Non-Determinism
- Coverage Tracking
Debugging & Observability:
- Logging: workflow.logger, activity.logger
- Prometheus Metrics auf Port 9000
- OpenTelemetry Tracing
- Web UI Event History Inspection
graph TB
Code[Code Schreiben]
Test[Testen mit pytest]
Debug[Debuggen mit Web UI]
Deploy[Deployment]
Code -->|Type-Safe Dataclasses| Test
Test -->|Time-Skipping| Debug
Debug -->|Logging + Metrics| Deploy
Deploy -->|Observability| Monitor[Monitoring]
style Code fill:#e1f5ff
style Test fill:#fff4e1
style Debug fill:#ffe1e1
style Deploy fill:#90EE90
style Monitor fill:#ffffcc
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 5: Workflows programmieren
Code-Beispiele für dieses Kapitel: examples/part-02/chapter-04/
Praktische Übung: Richten Sie Ihre lokale Entwicklungsumgebung ein und erstellen Sie Ihren ersten eigenen Workflow!
Kapitel 5: Workflows programmieren
Nachdem wir die Entwicklungsumgebung in Kapitel 4 aufgesetzt haben, tauchen wir nun tief in die praktische Programmierung von Workflows ein. In diesem Kapitel lernen Sie fortgeschrittene Patterns kennen, die Ihnen helfen, robuste, skalierbare und wartbare Workflow-Anwendungen zu bauen.
5.1 Workflow-Komposition: Activities vs Child Workflows
5.1.1 Die goldene Regel
Activities sind die Default-Wahl. Nutzen Sie Child Workflows nur für spezifische Use Cases!
graph TB
Start{Aufgabe zu erledigen}
Start -->|Standard| Activity[Activity nutzen]
Start -->|Spezialfall| Decision{Benötigen Sie...}
Decision -->|Workload Partitionierung<br/>1000+ Activities| Child1[Child Workflow]
Decision -->|Service Separation<br/>Eigener Worker Pool| Child2[Child Workflow]
Decision -->|Resource Mapping<br/>Serialisierung| Child3[Child Workflow]
Decision -->|Periodische Logic<br/>Continue-As-New| Child4[Child Workflow]
Decision -->|Einfache Business Logic| Activity
style Activity fill:#90EE90
style Child1 fill:#fff4e1
style Child2 fill:#fff4e1
style Child3 fill:#fff4e1
style Child4 fill:#fff4e1
5.1.2 Wann Activities nutzen
Activities sind perfekt für:
- Business Logic (API-Aufrufe, Datenbank-Operationen)
- Alle nicht-deterministischen Operationen
- Automatische Retries
- Niedrigerer Overhead (weniger Events in History)
from temporalio import workflow, activity
from datetime import timedelta
@activity.defn
async def send_email(to: str, subject: str, body: str) -> bool:
"""Activity für E-Mail-Versand (nicht-deterministisch)."""
# Externer API-Aufruf - perfekt für Activity
result = await email_service.send(to, subject, body)
return result.success
@activity.defn
async def charge_credit_card(amount: float, card_token: str) -> str:
"""Activity für Payment Processing."""
# Externe Payment API
transaction = await payment_api.charge(amount, card_token)
return transaction.id
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order_data: dict) -> dict:
# Activities für Business Logic
transaction_id = await workflow.execute_activity(
charge_credit_card,
args=[order_data["amount"], order_data["card_token"]],
start_to_close_timeout=timedelta(seconds=30)
)
await workflow.execute_activity(
send_email,
args=[
order_data["customer_email"],
"Order Confirmation",
f"Your order is confirmed. Transaction: {transaction_id}"
],
start_to_close_timeout=timedelta(seconds=10)
)
return {"transaction_id": transaction_id, "status": "completed"}
5.1.3 Wann Child Workflows nutzen
Use Case 1: Workload Partitionierung
Für massive Fan-Outs (>1000 Activities):
from temporalio import workflow
from temporalio.workflow import ParentClosePolicy
@workflow.defn
class BatchCoordinatorWorkflow:
"""
Koordiniert Verarbeitung von 100.000 Items.
Nutzt Child Workflows zur Partitionierung.
"""
@workflow.run
async def run(self, total_items: int) -> dict:
batch_size = 1000
num_batches = (total_items + batch_size - 1) // batch_size
workflow.logger.info(f"Processing {total_items} items in {num_batches} batches")
# Starte Child Workflows (max ~1000)
batch_handles = []
for i in range(num_batches):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, total_items)
handle = await workflow.start_child_workflow(
BatchProcessorWorkflow.run,
args=[{"start": start_idx, "end": end_idx}],
id=f"batch-{workflow.info().workflow_id}-{i}",
parent_close_policy=ParentClosePolicy.ABANDON,
)
batch_handles.append(handle)
# Warte auf alle Batches
results = await asyncio.gather(*batch_handles)
return {
"total_batches": num_batches,
"total_processed": sum(r["processed"] for r in results)
}
@workflow.defn
class BatchProcessorWorkflow:
"""
Verarbeitet einen Batch von ~1000 Items.
Jedes Child Workflow hat eigene Event History.
"""
@workflow.run
async def run(self, params: dict) -> dict:
# Verarbeite bis zu 1000 Activities
tasks = [
workflow.execute_activity(
process_single_item,
item_id,
start_to_close_timeout=timedelta(seconds=30)
)
for item_id in range(params["start"], params["end"])
]
results = await asyncio.gather(*tasks)
return {"processed": len(results)}
Warum Child Workflows hier?
- Parent Workflow: 100 Batches → ~200 Events
- Jedes Child: 1000 Activities → ~2000 Events
- Ohne Child Workflows: 100.000 Activities → ~200.000 Events in einer History (Fehler!)
- Mit Child Workflows: Verteilung über 100 separate Histories
Use Case 2: Service Separation
@workflow.defn
class OrderFulfillmentWorkflow:
"""
Koordiniert verschiedene Microservices via Child Workflows.
"""
@workflow.run
async def run(self, order_id: str) -> dict:
# Parallele Child Workflows auf verschiedenen Task Queues
inventory_handle = await workflow.start_child_workflow(
InventoryWorkflow.run,
args=[order_id],
task_queue="inventory-service", # Eigener Worker Pool
id=f"inventory-{order_id}",
)
shipping_handle = await workflow.start_child_workflow(
ShippingWorkflow.run,
args=[order_id],
task_queue="shipping-service", # Anderer Worker Pool
id=f"shipping-{order_id}",
)
# Warte auf beide Services
inventory_result, shipping_result = await asyncio.gather(
inventory_handle,
shipping_handle
)
return {
"inventory": inventory_result,
"shipping": shipping_result
}
Use Case 3: Resource Mapping (Entity Workflows)
@workflow.defn
class HostUpgradeCoordinatorWorkflow:
"""
Upgraded mehrere Hosts - ein Child Workflow pro Host.
"""
@workflow.run
async def run(self, hostnames: list[str]) -> dict:
# Jeder Hostname mapped zu eigenem Child Workflow
# Garantiert serialisierte Operationen pro Host
upgrade_handles = []
for hostname in hostnames:
handle = await workflow.start_child_workflow(
HostUpgradeWorkflow.run,
args=[hostname],
id=f"host-upgrade-{hostname}", # Eindeutige ID pro Host
)
upgrade_handles.append(handle)
results = await asyncio.gather(*upgrade_handles)
return {"upgraded": len(results)}
@workflow.defn
class HostUpgradeWorkflow:
"""
Upgraded einen einzelnen Host.
Multiple Aufrufe mit gleicher ID werden de-duplicated.
"""
@workflow.run
async def run(self, hostname: str) -> dict:
# Alle Operationen für diesen Host serialisiert
await workflow.execute_activity(
stop_host,
hostname,
start_to_close_timeout=timedelta(minutes=5)
)
await workflow.execute_activity(
upgrade_host,
hostname,
start_to_close_timeout=timedelta(minutes=30)
)
await workflow.execute_activity(
start_host,
hostname,
start_to_close_timeout=timedelta(minutes=5)
)
return {"hostname": hostname, "status": "upgraded"}
5.1.4 Parent-Child Kommunikation
from temporalio import workflow
from dataclasses import dataclass
@dataclass
class TaskUpdate:
task_id: str
status: str
progress: int
@workflow.defn
class ChildWorkerWorkflow:
def __init__(self) -> None:
self.task_data = None
self.paused = False
@workflow.run
async def run(self) -> str:
# Warte auf Task-Zuweisung via Signal
await workflow.wait_condition(lambda: self.task_data is not None)
# Verarbeite Task
for i in range(10):
# Prüfe Pause-Signal
if self.paused:
await workflow.wait_condition(lambda: not self.paused)
await workflow.execute_activity(
process_task_step,
args=[self.task_data, i],
start_to_close_timeout=timedelta(minutes=2)
)
return "completed"
@workflow.signal
def assign_task(self, task_data: dict) -> None:
"""Signal vom Parent: Task zuweisen."""
self.task_data = task_data
@workflow.signal
def pause(self) -> None:
"""Signal vom Parent: Pausieren."""
self.paused = True
@workflow.signal
def resume(self) -> None:
"""Signal vom Parent: Fortsetzen."""
self.paused = False
@workflow.query
def get_status(self) -> dict:
"""Query vom Parent oder External Client."""
return {
"has_task": self.task_data is not None,
"paused": self.paused
}
@workflow.defn
class CoordinatorWorkflow:
@workflow.run
async def run(self, tasks: list[dict]) -> dict:
# Starte Worker Child Workflows
worker_handles = []
for i in range(3): # 3 Worker
handle = await workflow.start_child_workflow(
ChildWorkerWorkflow.run,
id=f"worker-{i}",
)
worker_handles.append(handle)
# Verteile Tasks via Signals
for i, task in enumerate(tasks):
worker_idx = i % len(worker_handles)
await worker_handles[worker_idx].signal("assign_task", task)
# Query Worker Status
statuses = []
for handle in worker_handles:
status = await handle.query("get_status")
statuses.append(status)
# Warte auf Completion
await asyncio.gather(*worker_handles)
return {"completed_tasks": len(tasks)}
5.2 Parallele Ausführung
5.2.1 asyncio.gather für parallele Activities
import asyncio
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class ParallelProcessingWorkflow:
@workflow.run
async def run(self, urls: list[str]) -> list[dict]:
# Alle URLs parallel scrapen
tasks = [
workflow.execute_activity(
scrape_url,
url,
start_to_close_timeout=timedelta(minutes=5)
)
for url in urls
]
# Warte auf alle (Results in Order der Input-Liste)
results = await asyncio.gather(*tasks)
return results
5.2.2 Fan-Out/Fan-In Pattern
graph LR
Start[Workflow Start]
FanOut[Fan-Out: Start parallel Activities]
A1[Activity 1]
A2[Activity 2]
A3[Activity 3]
A4[Activity N]
FanIn[Fan-In: Gather Results]
Aggregate[Aggregate Results]
End[Workflow End]
Start --> FanOut
FanOut --> A1
FanOut --> A2
FanOut --> A3
FanOut --> A4
A1 --> FanIn
A2 --> FanIn
A3 --> FanIn
A4 --> FanIn
FanIn --> Aggregate
Aggregate --> End
style FanOut fill:#e1f5ff
style FanIn fill:#ffe1e1
style Aggregate fill:#fff4e1
from typing import List
from dataclasses import dataclass
@dataclass
class ScrapedData:
url: str
title: str
content: str
word_count: int
@workflow.defn
class FanOutFanInWorkflow:
@workflow.run
async def run(self, data_urls: List[str]) -> dict:
workflow.logger.info(f"Fan-out: Scraping {len(data_urls)} URLs")
# Fan-Out: Parallele Activities starten
scrape_tasks = [
workflow.execute_activity(
scrape_url,
url,
start_to_close_timeout=timedelta(minutes=5)
)
for url in data_urls
]
# Fan-In: Alle Results sammeln
scraped_data: List[ScrapedData] = await asyncio.gather(*scrape_tasks)
workflow.logger.info(f"Fan-in: Scraped {len(scraped_data)} pages")
# Aggregation
aggregated = await workflow.execute_activity(
aggregate_scraped_data,
scraped_data,
start_to_close_timeout=timedelta(minutes=2)
)
return {
"total_pages": len(scraped_data),
"total_words": sum(d.word_count for d in scraped_data),
"aggregated_insights": aggregated
}
5.2.3 Performance-Limitierungen bei Fan-Outs
WICHTIG: Ein einzelner Workflow ist auf ~30 Activities/Sekunde limitiert, unabhängig von Ressourcen!
Lösung für massive Fan-Outs:
@workflow.defn
class ScalableFanOutWorkflow:
"""
Für 10.000+ Items: Nutze Child Workflows zur Partitionierung.
"""
@workflow.run
async def run(self, total_items: int) -> dict:
batch_size = 1000 # Items pro Child Workflow
# Berechne Anzahl Batches
num_batches = (total_items + batch_size - 1) // batch_size
workflow.logger.info(
f"Processing {total_items} items via {num_batches} child workflows"
)
# Fan-Out über Child Workflows
batch_workflows = []
for i in range(num_batches):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, total_items)
handle = await workflow.start_child_workflow(
BatchProcessorWorkflow.run,
{"start": start_idx, "end": end_idx},
id=f"batch-{i}",
)
batch_workflows.append(handle)
# Fan-In: Warte auf alle Batches
batch_results = await asyncio.gather(*batch_workflows)
return {
"batches_processed": len(batch_results),
"total_items": total_items
}
Performance-Matrix:
| Items | Strategie | Geschätzte Zeit |
|---|---|---|
| 10-100 | Direkte Activities in Workflow | Sekunden |
| 100-1000 | asyncio.gather | Minuten |
| 1000-10.000 | Batch Processing | 5-10 Minuten |
| 10.000+ | Child Workflows | 30+ Minuten |
5.3 Timers und Scheduling
5.3.1 workflow.sleep() für Delays
import asyncio
from datetime import timedelta
from temporalio import workflow
@workflow.defn
class DelayWorkflow:
@workflow.run
async def run(self) -> str:
workflow.logger.info("Starting workflow")
# Sleep für 10 Sekunden (durable timer)
await asyncio.sleep(10)
workflow.logger.info("10 seconds passed")
# Kann auch Monate schlafen - Resource-Light!
await asyncio.sleep(60 * 60 * 24 * 30) # 30 Tage
workflow.logger.info("30 days passed")
return "Timers completed"
Wichtig: Timers sind persistent! Worker/Service Restarts haben keinen Einfluss.
5.3.2 Timeout Patterns
@workflow.defn
class TimeoutWorkflow:
def __init__(self) -> None:
self.approval_received = False
@workflow.run
async def run(self, order_id: str) -> dict:
workflow.logger.info(f"Awaiting approval for order {order_id}")
try:
# Warte auf Approval Signal oder Timeout
await workflow.wait_condition(
lambda: self.approval_received,
timeout=timedelta(hours=24) # 24h Timeout
)
return {"status": "approved", "order_id": order_id}
except asyncio.TimeoutError:
workflow.logger.warning(f"Approval timeout for order {order_id}")
# Automatische Ablehnung nach Timeout
await workflow.execute_activity(
reject_order,
order_id,
start_to_close_timeout=timedelta(seconds=30)
)
return {"status": "rejected_timeout", "order_id": order_id}
@workflow.signal
def approve(self) -> None:
self.approval_received = True
5.3.3 Cron Workflows mit Schedules
Moderne Methode (Empfohlen):
from temporalio.client import (
Client,
Schedule,
ScheduleActionStartWorkflow,
ScheduleSpec,
ScheduleIntervalSpec
)
from datetime import timedelta
async def create_daily_report_schedule():
client = await Client.connect("localhost:7233")
# Schedule erstellen: Täglich um 9 Uhr
await client.create_schedule(
"daily-report-schedule",
Schedule(
action=ScheduleActionStartWorkflow(
DailyReportWorkflow.run,
task_queue="reports",
),
spec=ScheduleSpec(
# Cron Expression: Minute Hour Day Month Weekday
cron_expressions=["0 9 * * *"], # Täglich 9:00 UTC
),
),
)
# Interval-basiert: Jede Stunde
await client.create_schedule(
"hourly-sync-schedule",
Schedule(
action=ScheduleActionStartWorkflow(
SyncWorkflow.run,
task_queue="sync",
),
spec=ScheduleSpec(
intervals=[
ScheduleIntervalSpec(every=timedelta(hours=1))
],
),
),
)
Cron Expression Beispiele:
# Jede Minute
"* * * * *"
# Jeden Tag um Mitternacht
"0 0 * * *"
# Wochentags um 12 Uhr
"0 12 * * MON-FRI"
# Jeden Montag um 8:00
"0 8 * * MON"
# Am 1. jeden Monats
"0 0 1 * *"
# Alle 15 Minuten
"*/15 * * * *"
5.3.4 Timer Cancellation
@workflow.defn
class CancellableTimerWorkflow:
def __init__(self) -> None:
self.timer_cancelled = False
@workflow.run
async def run(self) -> str:
# Starte 1-Stunden Timer
sleep_task = asyncio.create_task(asyncio.sleep(3600))
# Warte auf Timer oder Cancellation
await workflow.wait_condition(
lambda: self.timer_cancelled or sleep_task.done()
)
if self.timer_cancelled:
# Timer canceln
sleep_task.cancel()
try:
await sleep_task
except asyncio.CancelledError:
return "Timer was cancelled"
return "Timer completed normally"
@workflow.signal
def cancel_timer(self) -> None:
self.timer_cancelled = True
5.4 State Management und Queries
5.4.1 Workflow Instance Variables
from dataclasses import dataclass, field
from typing import Dict, List
@dataclass
class OrderState:
order_id: str
items: List[dict] = field(default_factory=list)
total_amount: float = 0.0
status: str = "pending"
approvals: Dict[str, bool] = field(default_factory=dict)
@workflow.defn
class StatefulOrderWorkflow:
def __init__(self) -> None:
# Instance Variables halten State
self.state = OrderState(order_id="")
self.processing_complete = False
@workflow.run
async def run(self, order_id: str) -> OrderState:
self.state.order_id = order_id
self.state.status = "fetching_items"
# State persistiert über Activities
items = await workflow.execute_activity(
fetch_order_items,
order_id,
start_to_close_timeout=timedelta(minutes=1)
)
self.state.items = items
self.state.total_amount = sum(item["price"] for item in items)
# Conditional basierend auf State
if self.state.total_amount > 1000:
self.state.status = "awaiting_approval"
await workflow.wait_condition(
lambda: "manager" in self.state.approvals
)
self.state.status = "approved"
return self.state
@workflow.signal
def approve(self, approver: str) -> None:
"""Signal updated State."""
self.state.approvals[approver] = True
@workflow.query
def get_state(self) -> OrderState:
"""Query liest State (read-only!)."""
return self.state
@workflow.query
def get_total(self) -> float:
return self.state.total_amount
5.4.2 State Queries für Progress Tracking
from dataclasses import dataclass
from datetime import datetime
@dataclass
class ProgressInfo:
phase: str
current_step: int
total_steps: int
percentage: float
start_time: datetime
estimated_completion: datetime = None
@workflow.defn
class ProgressTrackingWorkflow:
def __init__(self) -> None:
self.progress = ProgressInfo(
phase="initializing",
current_step=0,
total_steps=100,
percentage=0.0,
start_time=None
)
@workflow.run
async def run(self, total_items: int) -> dict:
self.progress.total_steps = total_items
self.progress.start_time = workflow.time()
# Phase 1: Initialization
self.progress.phase = "initialization"
await workflow.execute_activity(
initialize_activity,
start_to_close_timeout=timedelta(minutes=1)
)
self._update_progress(10)
# Phase 2: Processing
self.progress.phase = "processing"
for i in range(total_items):
await workflow.execute_activity(
process_item,
i,
start_to_close_timeout=timedelta(minutes=2)
)
self.progress.current_step = i + 1
self._update_progress()
# Phase 3: Finalization
self.progress.phase = "finalization"
self._update_progress(100)
return {"completed": self.progress.current_step}
def _update_progress(self, override_percentage: float = None) -> None:
if override_percentage is not None:
self.progress.percentage = override_percentage
else:
self.progress.percentage = (
self.progress.current_step / self.progress.total_steps * 100
)
# ETA Berechnung
if self.progress.current_step > 0:
elapsed = (workflow.time() - self.progress.start_time).total_seconds()
rate = self.progress.current_step / elapsed
remaining = self.progress.total_steps - self.progress.current_step
eta_seconds = remaining / rate if rate > 0 else 0
self.progress.estimated_completion = (
workflow.time() + timedelta(seconds=eta_seconds)
)
@workflow.query
def get_progress(self) -> ProgressInfo:
"""Query für aktuellen Progress."""
return self.progress
@workflow.query
def get_percentage(self) -> float:
"""Query nur für Percentage."""
return self.progress.percentage
Client-Side Progress Monitoring:
from temporalio.client import Client
import asyncio
async def monitor_workflow_progress():
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle("workflow-id")
while True:
# Query Progress
progress = await handle.query("get_progress")
print(f"Phase: {progress.phase}")
print(f"Progress: {progress.percentage:.2f}%")
print(f"Step: {progress.current_step}/{progress.total_steps}")
if progress.estimated_completion:
print(f"ETA: {progress.estimated_completion}")
if progress.percentage >= 100:
print("Workflow complete!")
break
await asyncio.sleep(5) # Poll alle 5 Sekunden
5.5 Error Handling und Resilience
5.5.1 try/except in Workflows
from temporalio.exceptions import ApplicationError, ActivityError
@workflow.defn
class ErrorHandlingWorkflow:
@workflow.run
async def run(self, items: List[str]) -> dict:
successful = []
failed = []
for item in items:
try:
result = await workflow.execute_activity(
process_item,
item,
start_to_close_timeout=timedelta(minutes=2),
retry_policy=RetryPolicy(
maximum_attempts=3,
non_retryable_error_types=["InvalidInput"]
)
)
successful.append(result)
except ActivityError as e:
# Activity failed nach allen Retries
workflow.logger.warning(f"Failed to process {item}: {e.cause}")
failed.append({
"item": item,
"error": str(e.cause),
"attempts": e.retry_state.attempt if e.retry_state else 0
})
# Workflow fährt fort!
return {
"successful": len(successful),
"failed": len(failed),
"total": len(items)
}
5.5.2 SAGA Pattern für Compensation
from typing import List, Callable
@workflow.defn
class BookingWorkflow:
"""
SAGA Pattern: Bei Fehler Rollback aller vorherigen Schritte.
"""
@workflow.run
async def run(self, booking_data: dict) -> dict:
compensations: List[Callable] = []
try:
# Step 1: Buche Auto
car_result = await workflow.execute_activity(
book_car,
booking_data,
start_to_close_timeout=timedelta(seconds=10),
)
# Registriere Compensation
compensations.append(
lambda: workflow.execute_activity(
undo_book_car,
booking_data,
start_to_close_timeout=timedelta(seconds=10)
)
)
# Step 2: Buche Hotel
hotel_result = await workflow.execute_activity(
book_hotel,
booking_data,
start_to_close_timeout=timedelta(seconds=10),
)
compensations.append(
lambda: workflow.execute_activity(
undo_book_hotel,
booking_data,
start_to_close_timeout=timedelta(seconds=10)
)
)
# Step 3: Buche Flug
flight_result = await workflow.execute_activity(
book_flight,
booking_data,
start_to_close_timeout=timedelta(seconds=10),
)
compensations.append(
lambda: workflow.execute_activity(
undo_book_flight,
booking_data,
start_to_close_timeout=timedelta(seconds=10)
)
)
return {
"status": "success",
"car": car_result,
"hotel": hotel_result,
"flight": flight_result
}
except Exception as e:
# Fehler - Führe Compensations in umgekehrter Reihenfolge aus
workflow.logger.error(f"Booking failed: {e}, rolling back...")
for compensation in reversed(compensations):
try:
await compensation()
except Exception as comp_error:
workflow.logger.error(f"Compensation failed: {comp_error}")
return {
"status": "rolled_back",
"error": str(e)
}
5.6 Long-Running Workflows und Continue-As-New
5.6.1 Event History Management
Problem: Event History ist auf 51.200 Events oder 50 MB limitiert.
Lösung: Continue-As-New
from dataclasses import dataclass
@dataclass
class WorkflowState:
processed_count: int = 0
iteration: int = 0
@workflow.defn
class LongRunningWorkflow:
@workflow.run
async def run(self, state: WorkflowState = None) -> None:
# Initialisiere oder restore State
if state is None:
self.state = WorkflowState()
else:
self.state = state
workflow.logger.info(f"Resumed at iteration {self.state.iteration}")
# Verarbeite Batch
for i in range(100):
await workflow.execute_activity(
process_item,
self.state.processed_count,
start_to_close_timeout=timedelta(minutes=1)
)
self.state.processed_count += 1
self.state.iteration += 1
# Check Continue-As-New Suggestion
if workflow.info().is_continue_as_new_suggested():
workflow.logger.info(
f"Continuing as new after {self.state.processed_count} items"
)
workflow.continue_as_new(self.state)
# Oder: Custom Trigger
if self.state.processed_count % 10000 == 0:
workflow.continue_as_new(self.state)
5.6.2 Infinite Loop mit Continue-As-New
Entity Workflow Pattern (Actor Model):
from dataclasses import dataclass
from datetime import datetime
@dataclass
class AccountState:
account_id: str
balance: float = 0.0
transaction_count: int = 0
@workflow.defn
class AccountEntityWorkflow:
"""
Läuft unbegrenzt - Entity Workflow für ein Bank-Konto.
"""
def __init__(self) -> None:
self.state: AccountState = None
self.pending_transactions: List[dict] = []
self.should_shutdown = False
@workflow.run
async def run(self, initial_state: AccountState = None) -> None:
# Initialize oder restore
if initial_state:
self.state = initial_state
else:
self.state = AccountState(
account_id=workflow.info().workflow_id
)
workflow.logger.info(
f"Account {self.state.account_id} started. "
f"Balance: {self.state.balance}, "
f"Transactions: {self.state.transaction_count}"
)
# Infinite Loop
while not self.should_shutdown:
# Warte auf Transactions oder Timeout
await workflow.wait_condition(
lambda: len(self.pending_transactions) > 0 or self.should_shutdown,
timeout=timedelta(seconds=30)
)
# Verarbeite Transactions
while self.pending_transactions:
transaction = self.pending_transactions.pop(0)
try:
result = await workflow.execute_activity(
process_transaction,
transaction,
start_to_close_timeout=timedelta(seconds=10)
)
self.state.balance += result["amount"]
self.state.transaction_count += 1
except Exception as e:
workflow.logger.error(f"Transaction failed: {e}")
# Continue-As-New nach 1000 Transactions
if self.state.transaction_count % 1000 == 0:
workflow.logger.info(
f"Continuing as new after {self.state.transaction_count} transactions"
)
workflow.continue_as_new(self.state)
# Graceful Shutdown
workflow.logger.info("Account workflow shutting down gracefully")
@workflow.signal
def deposit(self, amount: float) -> None:
"""Signal: Geld einzahlen."""
self.pending_transactions.append({
"type": "deposit",
"amount": amount,
"timestamp": workflow.time()
})
@workflow.signal
def withdraw(self, amount: float) -> None:
"""Signal: Geld abheben."""
self.pending_transactions.append({
"type": "withdraw",
"amount": -amount,
"timestamp": workflow.time()
})
@workflow.signal
def shutdown(self) -> None:
"""Signal: Workflow beenden."""
self.should_shutdown = True
@workflow.query
def get_balance(self) -> float:
"""Query: Aktueller Kontostand."""
return self.state.balance
@workflow.query
def get_transaction_count(self) -> int:
"""Query: Anzahl Transaktionen."""
return self.state.transaction_count
5.7 Zusammenfassung
In diesem Kapitel haben wir fortgeschrittene Workflow-Programming-Patterns kennengelernt:
Workflow-Komposition:
- Activities: Default für Business Logic
- Child Workflows: Nur für Workload-Partitionierung, Service-Separation, Resource-Mapping
- Parent-Child Kommunikation via Signals
Parallele Ausführung:
asyncio.gather()für parallele Activities- Fan-Out/Fan-In Patterns
- Performance-Limit: ~30 Activities/Sekunde pro Workflow
- Lösung für 10.000+ Items: Child Workflows
Timers und Scheduling:
workflow.sleep()für Delays (Tage, Monate möglich!)- Timeout Patterns mit
wait_condition() - Cron Workflows via Schedules
- Timer Cancellation
State Management:
- Instance Variables für Workflow-State
- Queries für Progress Tracking (read-only!)
- Signals für State Updates
- ETA-Berechnungen
Error Handling:
- try/except für Activity Failures
- SAGA Pattern für Compensations
- Graceful Degradation
- Workflows failen NICHT automatisch bei Activity Errors
Long-Running Workflows:
- Event History Limit: 51.200 Events / 50 MB
workflow.info().is_continue_as_new_suggested()- State Transfer via
workflow.continue_as_new() - Entity Workflows mit Infinite Loops
graph TB
Start[Workflow Development]
Design{Design Pattern}
Design -->|Business Logic| Activities[Use Activities]
Design -->|Massive Scale| ChildWF[Use Child Workflows]
Design -->|Parallel| Gather[asyncio.gather]
Design -->|Long-Running| CAN[Continue-As-New]
Design -->|Error Handling| SAGA[SAGA Pattern]
Activities --> Best[Best Practices]
ChildWF --> Best
Gather --> Best
CAN --> Best
SAGA --> Best
Best --> Production[Production-Ready Workflows]
style Activities fill:#90EE90
style ChildWF fill:#fff4e1
style Gather fill:#e1f5ff
style CAN fill:#ffe1e1
style SAGA fill:#ffffcc
style Production fill:#90EE90
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 6: Kommunikation (Signale und Queries)
Code-Beispiele für dieses Kapitel: examples/part-02/chapter-05/
Praktische Übung: Implementieren Sie einen Entity Workflow mit Signals, Queries und Continue-As-New!
Kapitel 6: Kommunikation mit Workflows - Signale, Queries und Updates
Einleitung
In den vorherigen Kapiteln haben Sie gelernt, wie man Workflows definiert und programmiert. Doch in der realen Welt existieren Workflows nicht isoliert – sie müssen mit externen Systemen, Benutzern und anderen Workflows kommunizieren. Ein Approval-Workflow muss auf Genehmigungsentscheidungen warten, ein Bestellprozess muss Statusabfragen beantworten, und lange laufende Prozesse müssen auf externe Events reagieren können.
Temporal bietet drei leistungsstarke Mechanismen für die Kommunikation mit laufenden Workflows:
- Signale (Signals): Asynchrone, fire-and-forget Nachrichten zur Zustandsänderung
- Queries: Synchrone, read-only Abfragen des Workflow-Zustands
- Updates: Synchrone Zustandsänderungen mit Rückgabewerten und Validierung
Dieses Kapitel erklärt detailliert, wie diese drei Mechanismen funktionieren, wann Sie welchen verwenden sollten, und welche Best Practices und Fallstricke es zu beachten gilt.
Lernziele
Nach diesem Kapitel können Sie:
- Signale implementieren und verstehen, wann sie geeignet sind
- Queries für effiziente Zustandsabfragen nutzen
- Updates mit Validierung für synchrone Operationen einsetzen
- Zwischen den drei Mechanismen fundiert entscheiden
- Human-in-the-Loop Workflows implementieren
- Häufige Fehler und Anti-Patterns vermeiden
6.1 Signale (Signals) - Asynchrone Workflow-Kommunikation
6.1.1 Definition und Zweck
Signale sind asynchrone, fire-and-forget Nachrichten, die den Zustand eines laufenden Workflows ändern, ohne dass der Sender auf eine Antwort warten muss. Wenn ein Signal empfangen wird, persistiert Temporal sowohl das Event als auch die Payload dauerhaft in der Event History.
Hauptmerkmale von Signalen:
- Asynchron: Server akzeptiert Signal sofort ohne auf Verarbeitung zu warten
- Non-blocking: Sender erhält keine Rückgabewerte oder Exceptions
- Dauerhaft: Erzeugt
WorkflowExecutionSignaledEvent in der Event History - Geordnet: Signale werden in der Reihenfolge verarbeitet, in der sie empfangen wurden
- Gepuffert: Können vor Workflow-Start gesendet werden; werden dann gepuffert und beim Start verarbeitet
Sequenzdiagramm: Signal-Ablauf
sequenceDiagram
participant Client as External Client
participant Frontend as Temporal Frontend
participant History as History Service
participant Worker as Worker
participant Workflow as Workflow Code
Client->>Frontend: send_signal("approve", data)
Frontend->>History: Store WorkflowExecutionSignaled Event
History-->>Frontend: Event stored
Frontend-->>Client: Signal accepted (async return)
Note over Worker: Worker polls for tasks
Worker->>Frontend: Poll for Workflow Task
Frontend->>History: Get pending events
History-->>Frontend: Return events (including signal)
Frontend-->>Worker: Workflow Task with signal event
Worker->>Workflow: Replay + execute signal handler
Workflow->>Workflow: approve(data)
Workflow->>Workflow: Update state
Worker->>Frontend: Complete task with new commands
Frontend->>History: Store completion events
Wann Signale verwenden:
- ✅ Asynchrone Benachrichtigungen ohne Rückmeldung
- ✅ Event-gesteuerte Workflows (z.B. “neuer Upload”, “Zahlung eingegangen”)
- ✅ Human-in-the-Loop mit wait conditions
- ✅ Externe System-Integrationen mit fire-and-forget Semantik
- ❌ Wenn Sie wissen müssen, ob Operation erfolgreich war → Update verwenden
- ❌ Wenn Sie Validierung vor Ausführung brauchen → Update verwenden
6.1.2 Signal Handler definieren
Signal Handler werden mit dem @workflow.signal Decorator dekoriert und können synchron (def) oder asynchron (async def) sein.
Einfacher Signal Handler:
from temporalio import workflow
from typing import Optional
from dataclasses import dataclass
@dataclass
class ApprovalInput:
"""Typsichere Signal-Daten mit Dataclass"""
approver_name: str
approved: bool
comment: str = ""
@workflow.defn
class ApprovalWorkflow:
"""Workflow mit Signal-basierter Genehmigung"""
@workflow.init
def __init__(self) -> None:
# WICHTIG: Initialisierung mit @workflow.init
# garantiert Ausführung vor Signal Handlern
self.approved: Optional[bool] = None
self.approver_name: Optional[str] = None
self.comment: str = ""
@workflow.signal
def approve(self, input: ApprovalInput) -> None:
"""Signal Handler - ändert Workflow-Zustand"""
self.approved = input.approved
self.approver_name = input.approver_name
self.comment = input.comment
workflow.logger.info(
f"Approval decision received from {input.approver_name}: "
f"{'approved' if input.approved else 'rejected'}"
)
@workflow.run
async def run(self, request_id: str) -> str:
"""Wartet auf Signal via wait_condition"""
workflow.logger.info(f"Waiting for approval on request {request_id}")
# Warten bis Signal empfangen wurde
await workflow.wait_condition(lambda: self.approved is not None)
if self.approved:
return f"Request approved by {self.approver_name}"
else:
return f"Request rejected by {self.approver_name}: {self.comment}"
Asynchroner Signal Handler mit Activities:
Signal Handler können auch asynchron sein und Activities, Child Workflows oder Timers ausführen:
from datetime import timedelta
@workflow.defn
class OrderWorkflow:
@workflow.init
def __init__(self) -> None:
self.orders: list[dict] = []
@workflow.signal
async def process_order(self, order: dict) -> None:
"""Async Signal Handler kann Activities ausführen"""
# Validierung via Activity
validated_order = await workflow.execute_activity(
validate_order,
order,
start_to_close_timeout=timedelta(seconds=10),
)
# Zustand aktualisieren
self.orders.append(validated_order)
workflow.logger.info(f"Processed order: {validated_order['order_id']}")
Best Practice: Leichtgewichtige Signal Handler
# ✓ Empfohlen: Signal aktualisiert Zustand, Workflow verarbeitet
@workflow.signal
def add_order(self, order: dict) -> None:
"""Leichtgewichtiger Handler - nur Zustand ändern"""
self.pending_orders.append(order)
@workflow.run
async def run(self) -> None:
"""Haupt-Workflow verarbeitet gepufferte Orders"""
while True:
# Warte auf neue Orders
await workflow.wait_condition(lambda: len(self.pending_orders) > 0)
# Verarbeite alle gepufferten Orders
while self.pending_orders:
order = self.pending_orders.pop(0)
# Verarbeitung im Haupt-Workflow
await workflow.execute_activity(
process_order_activity,
order,
start_to_close_timeout=timedelta(minutes=5),
)
6.1.3 Signale von externen Clients senden
Um ein Signal an einen laufenden Workflow zu senden, benötigen Sie ein Workflow Handle:
from temporalio.client import Client
async def send_approval_signal():
"""Signal von externem Client senden"""
# Verbindung zum Temporal Service
client = await Client.connect("localhost:7233")
# Handle für existierenden Workflow abrufen
workflow_handle = client.get_workflow_handle_for(
ApprovalWorkflow,
workflow_id="approval-request-123"
)
# Signal senden (fire-and-forget)
await workflow_handle.signal(
ApprovalWorkflow.approve,
ApprovalInput(
approver_name="Alice Johnson",
approved=True,
comment="Budget approved"
)
)
print("Signal sent successfully")
# Kehrt sofort zurück - wartet nicht auf Verarbeitung!
Typsicheres Signaling mit Workflow-Referenz:
# Empfohlen: Typsicher mit Workflow-Klasse
await workflow_handle.signal(
ApprovalWorkflow.approve, # Methoden-Referenz (typsicher)
ApprovalInput(...)
)
# Alternativ: String-basiert (anfällig für Tippfehler)
await workflow_handle.signal(
"approve", # String-basiert
ApprovalInput(...)
)
6.1.4 Signal-with-Start Pattern
Das Signal-with-Start Pattern ermöglicht lazy Workflow-Initialisierung: Es sendet ein Signal an einen existierenden Workflow oder startet einen neuen, falls keiner existiert.
Use Case: Shopping Cart Workflow
@workflow.defn
class ShoppingCartWorkflow:
"""Lazy-initialisierter Shopping Cart via Signal-with-Start"""
@workflow.init
def __init__(self) -> None:
self.items: list[dict] = []
self.total = 0.0
@workflow.signal
def add_item(self, item: dict) -> None:
"""Items zum Warenkorb hinzufügen"""
self.items.append(item)
self.total += item['price']
workflow.logger.info(f"Added {item['name']} - Total: ${self.total:.2f}")
@workflow.run
async def run(self) -> dict:
"""Warenkorb läuft 24h, dann automatischer Checkout"""
# Warte 24 Stunden auf weitere Items
await asyncio.sleep(timedelta(hours=24))
# Automatischer Checkout
if len(self.items) > 0:
await workflow.execute_activity(
process_checkout,
{"items": self.items, "total": self.total},
start_to_close_timeout=timedelta(minutes=5),
)
return {"items": len(self.items), "total": self.total}
# Client: Signal-with-Start verwenden
async def add_to_cart(user_id: str, item: dict):
"""Item zum User-Warenkorb hinzufügen (lazy init)"""
client = await Client.connect("localhost:7233")
# Start Workflow mit Initial-Signal
await client.start_workflow(
ShoppingCartWorkflow.run,
id=f"cart-{user_id}", # Ein Warenkorb pro User
task_queue="shopping",
start_signal="add_item", # Signal-Name
start_signal_args=[item], # Signal-Argumente
)
print(f"Item added to cart for user {user_id}")
Ablaufdiagramm: Signal-with-Start
flowchart TD
A[Client: Signal-with-Start] --> B{Workflow existiert?}
B -->|Ja| C[Signal an existierenden Workflow senden]
B -->|Nein| D[Neuen Workflow starten]
D --> E[Gepuffertes Signal ausliefern]
C --> F[Signal verarbeitet]
E --> F
Vorteile:
- ✅ Idempotent: Mehrfache Aufrufe sicher
- ✅ Race-condition sicher: Kein “create before send signal” Problem
- ✅ Lazy Initialization: Workflow nur wenn nötig
- ✅ Perfekt für User-Sessions, Shopping Carts, etc.
6.1.5 Wait Conditions mit Signalen
Die workflow.wait_condition() Funktion ist das Kernstück für Signal-basierte Koordination:
Einfache Wait Condition:
@workflow.run
async def run(self) -> str:
# Warten bis Signal empfangen
await workflow.wait_condition(lambda: self.approved is not None)
# Fortfahren nach Signal
return "Approved!"
Wait Condition mit Timeout:
import asyncio
@workflow.run
async def run(self) -> str:
try:
# Warte maximal 7 Tage auf Approval
await workflow.wait_condition(
lambda: self.approved is not None,
timeout=timedelta(days=7)
)
return "Approved!"
except asyncio.TimeoutError:
# Automatische Ablehnung nach Timeout
return "Approval timeout - request auto-rejected"
Mehrere Bedingungen kombinieren:
@workflow.defn
class MultiStageWorkflow:
@workflow.init
def __init__(self) -> None:
self.stage1_complete = False
self.stage2_complete = False
self.payment_confirmed = False
@workflow.signal
def complete_stage1(self) -> None:
self.stage1_complete = True
@workflow.signal
def complete_stage2(self) -> None:
self.stage2_complete = True
@workflow.signal
def confirm_payment(self, amount: float) -> None:
self.payment_confirmed = True
@workflow.run
async def run(self) -> str:
# Warte auf ALLE Bedingungen
await workflow.wait_condition(
lambda: (
self.stage1_complete and
self.stage2_complete and
self.payment_confirmed
),
timeout=timedelta(hours=48)
)
return "All conditions met - proceeding"
Wait Condition Pattern Visualisierung:
stateDiagram-v2
[*] --> Waiting: workflow.wait_condition()
Waiting --> CheckCondition: Worker wakes up
CheckCondition --> Waiting: Condition false
CheckCondition --> Continue: Condition true
CheckCondition --> Timeout: Timeout reached
Continue --> [*]: Workflow proceeds
Timeout --> [*]: asyncio.TimeoutError
note right of Waiting
Workflow blockiert hier
bis Signal empfangen
oder Timeout
end note
6.1.6 Signal-Reihenfolge und Garantien
Ordering Guarantee:
Temporal garantiert, dass Signale in der Reihenfolge verarbeitet werden, in der sie empfangen wurden:
# Client sendet 3 Signale schnell hintereinander
await handle.signal(MyWorkflow.signal1, "first")
await handle.signal(MyWorkflow.signal2, "second")
await handle.signal(MyWorkflow.signal3, "third")
# Workflow-Handler werden GARANTIERT in dieser Reihenfolge ausgeführt:
# 1. signal1("first")
# 2. signal2("second")
# 3. signal3("third")
Event History Eintrag:
Jedes Signal erzeugt einen WorkflowExecutionSignaled Event:
{
"eventType": "WorkflowExecutionSignaled",
"eventId": 42,
"workflowExecutionSignaledEventAttributes": {
"signalName": "approve",
"input": {
"payloads": [...]
}
}
}
Replay-Sicherheit:
Bei Replay werden Signale in derselben Reihenfolge aus der Event History gelesen und ausgeführt - deterministisch.
6.1.7 Praktische Anwendungsfälle für Signale
Use Case 1: Human-in-the-Loop Expense Approval
from decimal import Decimal
from datetime import datetime
@dataclass
class ExpenseRequest:
request_id: str
employee: str
amount: Decimal
description: str
category: str
@dataclass
class ApprovalDecision:
approved: bool
approver: str
timestamp: datetime
comment: str
@workflow.defn
class ExpenseApprovalWorkflow:
@workflow.init
def __init__(self) -> None:
self.decision: Optional[ApprovalDecision] = None
@workflow.signal
def approve(self, approver: str, comment: str = "") -> None:
"""Manager genehmigt Expense"""
self.decision = ApprovalDecision(
approved=True,
approver=approver,
timestamp=datetime.now(timezone.utc),
comment=comment
)
@workflow.signal
def reject(self, approver: str, reason: str) -> None:
"""Manager lehnt Expense ab"""
self.decision = ApprovalDecision(
approved=False,
approver=approver,
timestamp=datetime.now(timezone.utc),
comment=reason
)
@workflow.run
async def run(self, request: ExpenseRequest) -> str:
# Benachrichtigung an Manager senden
await workflow.execute_activity(
send_approval_notification,
request,
start_to_close_timeout=timedelta(minutes=5),
)
# Warte bis zu 7 Tage auf Entscheidung
try:
await workflow.wait_condition(
lambda: self.decision is not None,
timeout=timedelta(days=7)
)
except asyncio.TimeoutError:
# Auto-Reject nach Timeout
self.decision = ApprovalDecision(
approved=False,
approver="system",
timestamp=datetime.now(timezone.utc),
comment="Approval timeout - automatically rejected"
)
# Entscheidung verarbeiten
if self.decision.approved:
await workflow.execute_activity(
process_approved_expense,
request,
start_to_close_timeout=timedelta(minutes=10),
)
return f"Expense ${request.amount} approved by {self.decision.approver}"
else:
await workflow.execute_activity(
notify_rejection,
request,
self.decision.comment,
start_to_close_timeout=timedelta(minutes=5),
)
return f"Expense rejected: {self.decision.comment}"
Use Case 2: Event-getriebener Multi-Stage Prozess
@workflow.defn
class DataPipelineWorkflow:
"""Event-getriebene Daten-Pipeline mit Signalen"""
@workflow.init
def __init__(self) -> None:
self.data_uploaded = False
self.validation_complete = False
self.transform_complete = False
self.uploaded_data_location: Optional[str] = None
@workflow.signal
def notify_upload_complete(self, data_location: str) -> None:
"""Signal: Daten-Upload abgeschlossen"""
self.data_uploaded = True
self.uploaded_data_location = data_location
workflow.logger.info(f"Data uploaded to {data_location}")
@workflow.signal
def notify_validation_complete(self) -> None:
"""Signal: Validierung abgeschlossen"""
self.validation_complete = True
workflow.logger.info("Validation complete")
@workflow.signal
def notify_transform_complete(self) -> None:
"""Signal: Transformation abgeschlossen"""
self.transform_complete = True
workflow.logger.info("Transform complete")
@workflow.run
async def run(self, pipeline_id: str) -> str:
workflow.logger.info(f"Starting pipeline {pipeline_id}")
# Stage 1: Warte auf Upload
await workflow.wait_condition(lambda: self.data_uploaded)
# Stage 2: Starte Validierung
await workflow.execute_activity(
start_validation,
self.uploaded_data_location,
start_to_close_timeout=timedelta(minutes=5),
)
# Warte auf Validierungs-Signal (externe Validierung)
await workflow.wait_condition(lambda: self.validation_complete)
# Stage 3: Starte Transformation
await workflow.execute_activity(
start_transformation,
self.uploaded_data_location,
start_to_close_timeout=timedelta(minutes=5),
)
# Warte auf Transform-Signal
await workflow.wait_condition(lambda: self.transform_complete)
# Stage 4: Finalisierung
await workflow.execute_activity(
finalize_pipeline,
pipeline_id,
start_to_close_timeout=timedelta(minutes=10),
)
return f"Pipeline {pipeline_id} completed successfully"
Pipeline Zustandsdiagramm:
stateDiagram-v2
[*] --> WaitingForUpload: Workflow gestartet
WaitingForUpload --> ValidatingData: notify_upload_complete
ValidatingData --> WaitingForValidation: Activity started
WaitingForValidation --> TransformingData: notify_validation_complete
TransformingData --> WaitingForTransform: Activity started
WaitingForTransform --> Finalizing: notify_transform_complete
Finalizing --> [*]: Pipeline complete
6.1.8 Signal Best Practices
1. Dataclasses für Signal-Parameter:
# ✓ Gut: Typsicher und erweiterbar
@dataclass
class SignalInput:
field1: str
field2: int
field3: Optional[str] = None # Einfach neue Felder hinzufügen
@workflow.signal
def my_signal(self, input: SignalInput) -> None:
...
# ✗ Schlecht: Untypisiert und schwer zu warten
@workflow.signal
def my_signal(self, data: dict) -> None:
...
2. Idempotenz implementieren:
@workflow.signal
def process_payment(self, payment_id: str, amount: Decimal) -> None:
"""Idempotenter Signal Handler"""
# Prüfen ob bereits verarbeitet
if payment_id in self.processed_payments:
workflow.logger.warn(f"Duplicate payment signal: {payment_id}")
return
# Verarbeiten und markieren
self.processed_payments.add(payment_id)
self.total_amount += amount
3. Signal-Limits beachten:
# ⚠️ Problem: Zu viele Signale
for i in range(10000):
await handle.signal(MyWorkflow.process_item, f"item-{i}")
# Kann Event History Limit (50,000 Events) überschreiten!
# ✓ Lösung: Batch-Signale
items = [f"item-{i}" for i in range(10000)]
await handle.signal(MyWorkflow.process_batch, items)
4. @workflow.init für Initialisierung:
# ✓ Korrekt: @workflow.init garantiert Ausführung vor Handlern
@workflow.init
def __init__(self) -> None:
self.counter = 0
self.items = []
# ✗ Falsch: run() könnte NACH Signal Handler ausgeführt werden
@workflow.run
async def run(self) -> None:
self.counter = 0 # Zu spät!
6.2 Queries - Synchrone Zustandsabfragen
6.2.1 Definition und Zweck
Queries sind synchrone, read-only Anfragen, die den Zustand eines Workflows inspizieren ohne ihn zu verändern. Queries erzeugen keine Events in der Event History und können sogar auf abgeschlossene Workflows (innerhalb der Retention Period) ausgeführt werden.
Hauptmerkmale von Queries:
- Synchron: Aufrufer wartet auf Antwort
- Read-only: Können Workflow-Zustand NICHT ändern
- Non-blocking: Können keine Activities oder Timers ausführen
- History-free: Erzeugen KEINE Event History Einträge
- Auf abgeschlossenen Workflows: Query funktioniert nach Workflow-Ende
Query Sequenzdiagramm:
sequenceDiagram
participant Client as External Client
participant Frontend as Temporal Frontend
participant Worker as Worker
participant Workflow as Workflow Code
Client->>Frontend: query("get_status")
Frontend->>Worker: Query Task
Worker->>Workflow: Replay History (deterministic state)
Workflow->>Workflow: Execute query handler (read-only)
Workflow-->>Worker: Return query result
Worker-->>Frontend: Query result
Frontend-->>Client: Return result (synchron)
Note over Frontend: KEINE Event History Einträge!
Wann Queries verwenden:
- ✅ Zustand abfragen ohne zu ändern
- ✅ Progress Tracking (Fortschritt anzeigen)
- ✅ Debugging (aktueller Zustand inspizieren)
- ✅ Dashboards mit Echtzeit-Status
- ✅ Nach Workflow-Ende Status abrufen
- ❌ Zustand ändern → Signal oder Update verwenden
- ❌ Kontinuierliches Polling → Update mit wait_condition besser
6.2.2 Query Handler definieren
Query Handler werden mit @workflow.query dekoriert und MÜSSEN synchron sein (def, NICHT async def):
from enum import Enum
from typing import List
class OrderStatus(Enum):
PENDING = "pending"
PROCESSING = "processing"
SHIPPED = "shipped"
DELIVERED = "delivered"
@dataclass
class OrderProgress:
"""Query-Rückgabewert mit vollständiger Info"""
status: OrderStatus
items_processed: int
total_items: int
percent_complete: float
current_step: str
@workflow.defn
class OrderProcessingWorkflow:
@workflow.init
def __init__(self) -> None:
self.status = OrderStatus.PENDING
self.items_processed = 0
self.total_items = 0
self.current_step = "Initialization"
@workflow.query
def get_status(self) -> str:
"""Einfacher Status-Query"""
return self.status.value
@workflow.query
def get_progress(self) -> OrderProgress:
"""Detaillierter Progress-Query mit Dataclass"""
return OrderProgress(
status=self.status,
items_processed=self.items_processed,
total_items=self.total_items,
percent_complete=(
(self.items_processed / self.total_items * 100)
if self.total_items > 0 else 0
),
current_step=self.current_step
)
@workflow.query
def get_items_remaining(self) -> int:
"""Berechneter Query-Wert"""
return self.total_items - self.items_processed
@workflow.run
async def run(self, order_id: str, items: List[str]) -> str:
self.total_items = len(items)
self.status = OrderStatus.PROCESSING
for i, item in enumerate(items):
self.current_step = f"Processing item {item}"
await workflow.execute_activity(
process_item,
item,
start_to_close_timeout=timedelta(minutes=5),
)
self.items_processed = i + 1
self.status = OrderStatus.SHIPPED
self.current_step = "Shipped to customer"
return f"Order {order_id} completed"
6.2.3 Queries von Clients ausführen
async def monitor_order_progress(order_id: str):
"""Query-basiertes Progress Monitoring"""
client = await Client.connect("localhost:7233")
# Handle für Workflow abrufen
handle = client.get_workflow_handle_for(
OrderProcessingWorkflow,
workflow_id=order_id
)
# Einfacher Status-Query
status = await handle.query(OrderProcessingWorkflow.get_status)
print(f"Order status: {status}")
# Detaillierter Progress-Query
progress = await handle.query(OrderProcessingWorkflow.get_progress)
print(f"Progress: {progress.percent_complete:.1f}%")
print(f"Current step: {progress.current_step}")
print(f"Items: {progress.items_processed}/{progress.total_items}")
# Berechneter Query
remaining = await handle.query(OrderProcessingWorkflow.get_items_remaining)
print(f"Items remaining: {remaining}")
Query auf abgeschlossenen Workflow:
async def query_completed_workflow(workflow_id: str):
"""Query funktioniert auch nach Workflow-Ende"""
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle(workflow_id)
try:
# Funktioniert innerhalb der Retention Period!
final_status = await handle.query("get_status")
print(f"Final status: {final_status}")
except Exception as e:
print(f"Query failed: {e}")
6.2.4 Query-Einschränkungen
1. Queries MÜSSEN synchron sein:
# ✓ Korrekt: Synchrone Query
@workflow.query
def get_data(self) -> dict:
return {"status": self.status}
# ✗ FALSCH: Async nicht erlaubt!
@workflow.query
async def get_data(self) -> dict: # TypeError!
return {"status": self.status}
2. Queries können KEINE Activities ausführen:
# ✗ FALSCH: Keine Activities in Queries!
@workflow.query
def get_external_data(self) -> dict:
# Kompiliert, aber schlägt zur Laufzeit fehl!
result = await workflow.execute_activity(...) # ERROR!
return result
3. Queries dürfen Zustand NICHT ändern:
# ✗ FALSCH: State-Mutation in Query
@workflow.query
def increment_counter(self) -> int:
self.counter += 1 # Verletzt Read-Only Constraint!
return self.counter
# ✓ Korrekt: Read-Only Query
@workflow.query
def get_counter(self) -> int:
return self.counter
# ✓ Für Mutation: Update verwenden
@workflow.update
def increment_counter(self) -> int:
self.counter += 1
return self.counter
Vergleich: Query vs Update für State Access
flowchart LR
A[Workflow State Access] --> B{Lesen oder Schreiben?}
B -->|Nur Lesen| C[Query verwenden]
B -->|Schreiben| D[Update verwenden]
C --> E[Vorteile:<br/>- Keine Event History<br/>- Funktioniert nach Workflow-Ende<br/>- Niedrige Latenz]
D --> F[Vorteile:<br/>- State-Mutation erlaubt<br/>- Validierung möglich<br/>- Fehler-Feedback]
style C fill:#90EE90
style D fill:#87CEEB
6.2.5 Stack Trace Query für Debugging
Temporal bietet einen eingebauten __stack_trace Query für Debugging:
# CLI: Stack Trace abrufen
temporal workflow stack --workflow-id stuck-workflow-123
# Programmatisch: Stack Trace abrufen
async def debug_workflow(workflow_id: str):
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle(workflow_id)
# Eingebauter Stack Trace Query
stack_trace = await handle.query("__stack_trace")
print(f"Workflow stack trace:\n{stack_trace}")
Wann Stack Trace verwenden:
- ✅ Workflow scheint “hängen zu bleiben”
- ✅ Debugging von wait_condition Problemen
- ✅ Verstehen wo Workflow aktuell wartet
- ✅ Identifizieren von Deadlocks
6.2.6 Praktische Query Use Cases
Use Case 1: Real-Time Dashboard
@dataclass
class DashboardData:
"""Aggregierte Daten für Dashboard"""
total_processed: int
total_failed: int
current_batch: int
average_processing_time: float
estimated_completion: datetime
@workflow.defn
class BatchProcessingWorkflow:
@workflow.init
def __init__(self) -> None:
self.processed = 0
self.failed = 0
self.current_batch = 0
self.processing_times: List[float] = []
self.start_time = datetime.now(timezone.utc)
@workflow.query
def get_dashboard_data(self) -> DashboardData:
"""Real-time Dashboard Query"""
avg_time = (
sum(self.processing_times) / len(self.processing_times)
if self.processing_times else 0
)
remaining = self.total_batches - self.current_batch
eta_seconds = remaining * avg_time
eta = datetime.now(timezone.utc) + timedelta(seconds=eta_seconds)
return DashboardData(
total_processed=self.processed,
total_failed=self.failed,
current_batch=self.current_batch,
average_processing_time=avg_time,
estimated_completion=eta
)
@workflow.run
async def run(self, total_batches: int) -> str:
self.total_batches = total_batches
for batch_num in range(total_batches):
self.current_batch = batch_num
batch_start = time.time()
try:
await workflow.execute_activity(
process_batch,
batch_num,
start_to_close_timeout=timedelta(minutes=10),
)
self.processed += 1
except Exception as e:
self.failed += 1
workflow.logger.error(f"Batch {batch_num} failed: {e}")
# Tracking für ETA
batch_time = time.time() - batch_start
self.processing_times.append(batch_time)
return f"Processed {self.processed} batches, {self.failed} failed"
Dashboard Client:
async def display_dashboard(workflow_id: str):
"""Live Dashboard mit Query Polling"""
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle_for(
BatchProcessingWorkflow,
workflow_id
)
while True:
try:
# Dashboard-Daten abfragen
data = await handle.query(
BatchProcessingWorkflow.get_dashboard_data
)
# Anzeigen (vereinfacht)
print(f"\n{'='*50}")
print(f"Batch Progress Dashboard")
print(f"{'='*50}")
print(f"Processed: {data.total_processed}")
print(f"Failed: {data.total_failed}")
print(f"Current Batch: {data.current_batch}")
print(f"Avg Time: {data.average_processing_time:.2f}s")
print(f"ETA: {data.estimated_completion}")
# Prüfen ob abgeschlossen
result = await handle.result(timeout=0.1)
print(f"\nWorkflow completed: {result}")
break
except asyncio.TimeoutError:
# Workflow läuft noch
await asyncio.sleep(2) # Update alle 2 Sekunden
Use Case 2: State Inspection für Debugging
@workflow.defn
class ComplexWorkflow:
"""Workflow mit umfangreichem State für Debugging"""
@workflow.query
def get_full_state(self) -> dict:
"""Vollständiger State Dump für Debugging"""
return {
"status": self.status.value,
"current_stage": self.current_stage,
"pending_operations": len(self.pending_ops),
"completed_tasks": self.completed_tasks,
"errors": [str(e) for e in self.errors],
"metadata": self.metadata,
"last_updated": self.last_updated.isoformat(),
}
@workflow.query
def get_pending_operations(self) -> List[dict]:
"""Detaillierte Pending Operations"""
return [
{
"id": op.id,
"type": op.type,
"created_at": op.created_at.isoformat(),
"retry_count": op.retry_count,
}
for op in self.pending_ops
]
6.2.7 Query Best Practices
1. Pre-compute komplexe Werte:
# ✗ Schlecht: Schwere Berechnung in Query
@workflow.query
def get_statistics(self) -> dict:
# Vermeiden: O(n) Berechnung bei jedem Query!
total = sum(item.price for item in self.items)
avg = total / len(self.items)
return {"total": total, "average": avg}
# ✓ Gut: Inkrementell updaten, Query nur lesen
@workflow.signal
def add_item(self, item: Item) -> None:
self.items.append(item)
# Statistiken inkrementell updaten
self.total += item.price
self.count += 1
self.average = self.total / self.count
@workflow.query
def get_statistics(self) -> dict:
# Instant return - keine Berechnung
return {"total": self.total, "average": self.average}
2. Dataclass für Query-Responses:
# ✓ Gut: Typsicher und selbst-dokumentierend
@dataclass
class WorkflowStatus:
state: str
progress_percent: float
items_processed: int
errors: List[str]
@workflow.query
def get_status(self) -> WorkflowStatus:
return WorkflowStatus(...)
3. Nicht kontinuierlich pollen:
# ✗ Ineffizient: Tight polling loop
async def wait_for_completion_bad(handle):
while True:
status = await handle.query(MyWorkflow.get_status)
if status == "completed":
break
await asyncio.sleep(0.5) # Sehr ineffizient!
# ✓ Besser: Update mit wait_condition (siehe Updates Sektion)
# Oder: Workflow result() awaiten
async def wait_for_completion_good(handle):
result = await handle.result() # Wartet automatisch
return result
6.3 Updates - Synchrone Zustandsänderungen
6.3.1 Definition und Zweck
Updates sind synchrone, getrackte Write-Operationen, die Workflow-Zustand ändern UND ein Ergebnis an den Aufrufer zurückgeben. Sie kombinieren die State-Mutation von Signalen mit der synchronen Response von Queries, plus optionale Validierung.
Hauptmerkmale von Updates:
- Synchron: Aufrufer erhält Response oder Error
- Validiert: Optionale Validators lehnen ungültige Updates ab
- Tracked: Erzeugt Event History Einträge
- Blocking: Kann Activities, Child Workflows, wait conditions ausführen
- Reliable: Sender weiß ob Update erfolgreich war oder fehlschlug
Update Sequenzdiagramm:
sequenceDiagram
participant Client as External Client
participant Frontend as Temporal Frontend
participant History as History Service
participant Worker as Worker
participant Workflow as Workflow Code
Client->>Frontend: execute_update("set_language", GERMAN)
Frontend->>History: Store WorkflowExecutionUpdateAccepted
History-->>Frontend: Event stored
Frontend->>Worker: Workflow Task with update
Worker->>Workflow: Replay history
Workflow->>Workflow: Execute validator (optional)
alt Validator fails
Workflow-->>Worker: Validation error
Worker-->>Frontend: Update rejected
Frontend-->>Client: WorkflowUpdateFailedError
else Validator passes
Workflow->>Workflow: Execute update handler
Workflow->>Workflow: Can execute activities
Workflow-->>Worker: Update result
Worker->>Frontend: Update completed
Frontend->>History: Store WorkflowExecutionUpdateCompleted
Frontend-->>Client: Return result (synchron)
end
Updates vs Signals Entscheidungsbaum:
flowchart TD
A[Workflow-Zustand ändern] --> B{Brauche ich Response?}
B -->|Nein| C[Signal verwenden]
B -->|Ja| D{Brauche ich Validierung?}
D -->|Nein| E[Update ohne Validator]
D -->|Ja| F[Update mit Validator]
C --> G[Vorteile:<br/>- Niedrige Latenz<br/>- Fire-and-forget<br/>- Einfach]
E --> H[Vorteile:<br/>- Synchrone Response<br/>- Fehler-Feedback<br/>- Activity-Ausführung]
F --> I[Vorteile:<br/>- Frühe Ablehnung<br/>- Input-Validierung<br/>- Keine ungültigen Events]
style C fill:#90EE90
style E fill:#87CEEB
style F fill:#FFD700
6.3.2 Update Handler mit Validierung
Einfacher Update Handler:
from enum import Enum
class Language(Enum):
ENGLISH = "en"
GERMAN = "de"
SPANISH = "es"
FRENCH = "fr"
@workflow.defn
class GreetingWorkflow:
@workflow.init
def __init__(self) -> None:
self.language = Language.ENGLISH
self.greetings = {
Language.ENGLISH: "Hello",
Language.GERMAN: "Hallo",
Language.SPANISH: "Hola",
Language.FRENCH: "Bonjour",
}
@workflow.update
def set_language(self, language: Language) -> Language:
"""Update Handler - gibt vorherige Sprache zurück"""
previous_language = self.language
self.language = language
workflow.logger.info(f"Language changed from {previous_language} to {language}")
return previous_language
@set_language.validator
def validate_language(self, language: Language) -> None:
"""Validator - lehnt nicht unterstützte Sprachen ab"""
if language not in self.greetings:
raise ValueError(f"Language {language.name} not supported")
@workflow.run
async def run(self) -> str:
# Warte auf Language-Updates...
await asyncio.sleep(timedelta(hours=24))
return self.greetings[self.language]
Update Handler mit Activity-Ausführung:
from asyncio import Lock
from decimal import Decimal
@dataclass
class PaymentInfo:
payment_method: str
amount: Decimal
card_token: str
@dataclass
class PaymentResult:
success: bool
transaction_id: str
amount: Decimal
@workflow.defn
class OrderWorkflow:
@workflow.init
def __init__(self) -> None:
self.lock = Lock() # Concurrency-Schutz
self.order_status = "pending"
self.total_amount = Decimal("0.00")
@workflow.update
async def process_payment(self, payment: PaymentInfo) -> PaymentResult:
"""Async Update - kann Activities ausführen"""
async with self.lock: # Verhindert concurrent execution
# Activity ausführen für Zahlung
result = await workflow.execute_activity(
charge_payment,
payment,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_attempts=3,
)
)
# Workflow-State updaten
if result.success:
self.order_status = "paid"
workflow.logger.info(f"Payment successful: {result.transaction_id}")
else:
workflow.logger.error("Payment failed")
return result
@process_payment.validator
def validate_payment(self, payment: PaymentInfo) -> None:
"""Validator - prüft vor Activity-Ausführung"""
# Status-Check
if self.order_status != "pending":
raise ValueError(
f"Cannot process payment in status: {self.order_status}"
)
# Amount-Check
if payment.amount <= 0:
raise ValueError("Payment amount must be positive")
if payment.amount != self.total_amount:
raise ValueError(
f"Payment amount {payment.amount} does not match "
f"order total {self.total_amount}"
)
# Card token Check
if not payment.card_token or len(payment.card_token) < 10:
raise ValueError("Invalid card token")
6.3.3 Validator-Charakteristiken
Validators müssen folgende Regeln einhalten:
- Synchron - Nur
def, NICHTasync def - Read-Only - Dürfen State NICHT mutieren
- Non-Blocking - Keine Activities, Timers, wait conditions
- Selbe Parameter - Wie der Update Handler
- Return None - Raise Exception zum Ablehnen
# ✓ Korrekt: Validator Implementierung
@workflow.update
def add_item(self, item: CartItem) -> int:
"""Item hinzufügen, return neue Anzahl"""
self.items.append(item)
return len(self.items)
@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
"""Validator - synchron, read-only"""
# Item-Validierung
if not item.sku or len(item.sku) == 0:
raise ValueError("Item SKU is required")
# Cart-Limits
if len(self.items) >= 100:
raise ValueError("Cart is full (max 100 items)")
# Duplikat-Check
if any(i.sku == item.sku for i in self.items):
raise ValueError(f"Item {item.sku} already in cart")
# Validation passed - kein expliziter Return
# ✗ FALSCH: Async Validator
@add_item.validator
async def validate_add_item(self, item: CartItem) -> None: # ERROR!
# Async nicht erlaubt
result = await some_async_check(item)
# ✗ FALSCH: State Mutation
@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
self.validation_count += 1 # ERROR! Read-only
# ✗ FALSCH: Activities ausführen
@add_item.validator
def validate_add_item(self, item: CartItem) -> None:
# ERROR! Keine Activities in Validator
result = await workflow.execute_activity(...)
Validator Execution Flow:
flowchart TD
A[Update Request] --> B[Execute Validator]
B --> C{Validator Result}
C -->|Exception| D[Reject Update]
C -->|No Exception| E[Accept Update]
D --> F[Return Error to Client]
D --> G[NO Event History Entry]
E --> H[Store UpdateAccepted Event]
E --> I[Execute Update Handler]
I --> J[Store UpdateCompleted Event]
J --> K[Return Result to Client]
style D fill:#FFB6C1
style E fill:#90EE90
style G fill:#FFB6C1
6.3.4 Updates von Clients senden
Execute Update (Wait for Completion):
from temporalio.client import Client
from temporalio.exceptions import WorkflowUpdateFailedError
async def update_workflow_language():
"""Update ausführen und auf Completion warten"""
client = await Client.connect("localhost:7233")
workflow_handle = client.get_workflow_handle_for(
GreetingWorkflow,
workflow_id="greeting-123"
)
try:
# Update ausführen - wartet auf Validator + Handler
previous_lang = await workflow_handle.execute_update(
GreetingWorkflow.set_language,
Language.GERMAN
)
print(f"Language changed from {previous_lang} to German")
except WorkflowUpdateFailedError as e:
# Validator rejected ODER Handler exception
print(f"Update failed: {e}")
# Event History: KEINE Einträge wenn Validator rejected
Start Update (Non-blocking):
from temporalio.client import WorkflowUpdateStage
async def start_update_non_blocking():
"""Update starten, später auf Result warten"""
client = await Client.connect("localhost:7233")
workflow_handle = client.get_workflow_handle_for(
OrderWorkflow,
workflow_id="order-456"
)
payment = PaymentInfo(
payment_method="credit_card",
amount=Decimal("99.99"),
card_token="tok_abc123xyz"
)
# Update starten - wartet nur bis ACCEPTED
update_handle = await workflow_handle.start_update(
OrderWorkflow.process_payment,
payment,
wait_for_stage=WorkflowUpdateStage.ACCEPTED,
)
print("Update accepted by workflow (validator passed)")
# Andere Arbeit erledigen...
await do_other_work()
# Später: Auf Completion warten
try:
result = await update_handle.result()
print(f"Payment processed: {result.transaction_id}")
except Exception as e:
print(f"Payment failed: {e}")
WorkflowUpdateStage Optionen:
# ADMITTED: Warten bis Server Update empfangen hat (selten verwendet)
handle = await workflow_handle.start_update(
MyWorkflow.my_update,
data,
wait_for_stage=WorkflowUpdateStage.ADMITTED,
)
# ACCEPTED: Warten bis Validator passed (empfohlen für non-blocking)
handle = await workflow_handle.start_update(
MyWorkflow.my_update,
data,
wait_for_stage=WorkflowUpdateStage.ACCEPTED, # Default für start_update
)
# COMPLETED: Warten bis Handler fertig (default für execute_update)
result = await workflow_handle.execute_update(
MyWorkflow.my_update,
data,
# Implizit: wait_for_stage=WorkflowUpdateStage.COMPLETED
)
6.3.5 Update-with-Start Pattern
Ähnlich wie Signal-with-Start - Update an existierenden Workflow senden ODER neuen starten:
from temporalio.client import WithStartWorkflowOperation
from temporalio.common import WorkflowIDConflictPolicy
async def update_or_start_shopping_cart(user_id: str, item: CartItem):
"""Update-with-Start für Shopping Cart"""
client = await Client.connect("localhost:7233")
# Workflow Start Operation definieren
start_op = WithStartWorkflowOperation(
ShoppingCartWorkflow.run,
id=f"cart-{user_id}",
id_conflict_policy=WorkflowIDConflictPolicy.USE_EXISTING,
task_queue="shopping",
)
try:
# Update-with-Start ausführen
cart_total = await client.execute_update_with_start_workflow(
ShoppingCartWorkflow.add_item,
item,
start_workflow_operation=start_op,
)
print(f"Item added. Cart total: ${cart_total}")
except WorkflowUpdateFailedError as e:
print(f"Failed to add item: {e}")
# Workflow Handle abrufen
workflow_handle = await start_op.workflow_handle()
return workflow_handle
6.3.6 Updates vs Signals - Detaillierter Vergleich
| Feature | Signal | Update |
|---|---|---|
| Response | Keine (fire-and-forget) | Gibt Result oder Error zurück |
| Synchron | Nein | Ja |
| Validierung | Nein | Optional mit Validator |
| Event History | Immer (WorkflowExecutionSignaled) | Nur wenn validiert (Accepted + Completed) |
| Latenz | Niedrig (async return) | Höher (wartet auf Response) |
| Fehler-Feedback | Nein | Ja (Exception an Client) |
| Activities ausführen | Ja (async handler) | Ja (async handler) |
| Use Case | Async Notifications | Synchrone State Changes |
| Read-After-Write | Polling mit Query nötig | Built-in Response |
| Worker Required | Kann ohne Worker senden | Worker muss acknowledgment geben |
| Best für | Event-driven, fire-and-forget | Request-Response, Validierung |
Entscheidungsmatrix:
flowchart TD
A[Workflow State ändern] --> B{Brauche ich sofortige<br/>Bestätigung?}
B -->|Nein| C{Ist niedrige Latenz<br/>wichtig?}
B -->|Ja| D{Brauche ich<br/>Validierung?}
C -->|Ja| E[Signal]
C -->|Nein| F[Signal oder Update]
D -->|Ja| G[Update mit Validator]
D -->|Nein| H[Update ohne Validator]
E --> I[Fire-and-forget<br/>Async notification]
F --> I
G --> J[Synchrone Operation<br/>mit Input-Validierung]
H --> K[Synchrone Operation<br/>ohne Validierung]
style E fill:#90EE90
style G fill:#FFD700
style H fill:#87CEEB
Code-Beispiel: Signal vs Update
# Scenario: Item zum Warenkorb hinzufügen
# Option 1: Mit Signal (fire-and-forget)
@workflow.signal
def add_item_signal(self, item: CartItem) -> None:
"""Signal - keine Response"""
self.items.append(item)
self.total += item.price
# Client (Signal)
await handle.signal(CartWorkflow.add_item_signal, item)
# Kehrt sofort zurück - weiß nicht ob erfolgreich!
# Wenn Status prüfen will: Query nötig
total = await handle.query(CartWorkflow.get_total) # Extra roundtrip
# ============================================
# Option 2: Mit Update (synchrone Response)
@workflow.update
def add_item_update(self, item: CartItem) -> dict:
"""Update - gibt neuen State zurück"""
self.items.append(item)
self.total += item.price
return {"items": len(self.items), "total": self.total}
@add_item_update.validator
def validate_add_item(self, item: CartItem) -> None:
"""Frühe Ablehnung ungültiger Items"""
if len(self.items) >= 100:
raise ValueError("Cart full")
if item.price < 0:
raise ValueError("Invalid price")
# Client (Update)
try:
result = await handle.execute_update(CartWorkflow.add_item_update, item)
print(f"Added! Total: ${result['total']}")
except WorkflowUpdateFailedError as e:
print(f"Failed: {e}") # Sofortiges Feedback!
6.3.7 Error Handling bei Updates
Validator Rejection:
# Client Code
try:
result = await workflow_handle.execute_update(
CartWorkflow.add_item,
invalid_item # z.B. leere SKU
)
except WorkflowUpdateFailedError as e:
# Validator hat rejected
print(f"Validation failed: {e}")
# Event History: KEINE Einträge (frühe Ablehnung)
Handler Exception:
@workflow.update
async def process_order(self, order: Order) -> Receipt:
"""Handler mit Activity - Exception propagiert zu Client"""
# Activity failure propagiert
receipt = await workflow.execute_activity(
charge_customer,
order,
start_to_close_timeout=timedelta(seconds=30),
)
return receipt
# Client
try:
receipt = await workflow_handle.execute_update(
OrderWorkflow.process_order,
order
)
print(f"Order processed: {receipt.id}")
except WorkflowUpdateFailedError as e:
# Handler raised exception ODER Activity failed
print(f"Order processing failed: {e}")
# Event History: UpdateAccepted + UpdateFailed Events
Idempotenz mit Update Info:
from temporalio import workflow
@workflow.update
async def process_payment(self, payment_id: str, amount: Decimal) -> str:
"""Idempotenter Update Handler"""
# Update ID für Deduplizierung abrufen
update_info = workflow.current_update_info()
if update_info and update_info.id in self.processed_updates:
# Bereits verarbeitet (wichtig bei Continue-As-New)
workflow.logger.info(f"Duplicate update: {update_info.id}")
return self.update_results[update_info.id]
# Payment verarbeiten
result = await workflow.execute_activity(
charge_payment,
payment_id,
amount,
start_to_close_timeout=timedelta(seconds=30),
)
# Für Deduplizierung speichern
if update_info:
self.processed_updates.add(update_info.id)
self.update_results[update_info.id] = result
return result
6.3.8 Unfinished Handler Policy
Kontrolle über Verhalten wenn Workflow endet während Update läuft:
@workflow.update(
unfinished_policy=workflow.HandlerUnfinishedPolicy.ABANDON
)
async def optional_update(self, data: str) -> None:
"""Update der abgebrochen werden kann"""
# Lange laufende Operation
await workflow.execute_activity(
process_data,
data,
start_to_close_timeout=timedelta(hours=1),
)
# Best Practice: Auf Handler-Completion warten vor Workflow-Ende
@workflow.run
async def run(self) -> str:
# Haupt-Workflow Logik
...
# Alle Handler fertigstellen lassen
await workflow.wait_condition(
lambda: workflow.all_handlers_finished()
)
return "Completed"
6.4 Patterns und Best Practices
6.4.1 Human-in-the-Loop Approval Pattern
Ein häufiges Pattern: Workflow wartet auf menschliche Genehmigung mit Timeout.
Multi-Level Approval Workflow:
from typing import Dict, Optional
@dataclass
class ApprovalRequest:
request_id: str
requester: str
amount: Decimal
description: str
@dataclass
class ApprovalDecision:
approved: bool
approver: str
timestamp: datetime
comment: str
@workflow.defn
class MultiLevelApprovalWorkflow:
"""Mehrstufige Genehmigung basierend auf Betrag"""
@workflow.init
def __init__(self) -> None:
self.approvals: Dict[str, ApprovalDecision] = {}
self.required_approvers: List[str] = []
def _get_required_approvers(self, amount: Decimal) -> List[str]:
"""Bestimme erforderliche Genehmiger basierend auf Betrag"""
if amount < Decimal("1000"):
return ["manager"]
elif amount < Decimal("10000"):
return ["manager", "director"]
else:
return ["manager", "director", "vp"]
@workflow.signal
def approve_manager(self, decision: ApprovalDecision) -> None:
"""Manager Genehmigung"""
self.approvals["manager"] = decision
workflow.logger.info(f"Manager approval: {decision.approved}")
@workflow.signal
def approve_director(self, decision: ApprovalDecision) -> None:
"""Director Genehmigung"""
self.approvals["director"] = decision
workflow.logger.info(f"Director approval: {decision.approved}")
@workflow.signal
def approve_vp(self, decision: ApprovalDecision) -> None:
"""VP Genehmigung"""
self.approvals["vp"] = decision
workflow.logger.info(f"VP approval: {decision.approved}")
@workflow.query
def get_approval_status(self) -> Dict[str, str]:
"""Aktueller Genehmigungs-Status"""
status = {}
for role in self.required_approvers:
if role in self.approvals:
decision = self.approvals[role]
status[role] = "approved" if decision.approved else "rejected"
else:
status[role] = "pending"
return status
@workflow.query
def is_fully_approved(self) -> bool:
"""Alle erforderlichen Genehmigungen vorhanden?"""
if len(self.approvals) < len(self.required_approvers):
return False
return all(
role in self.approvals and self.approvals[role].approved
for role in self.required_approvers
)
@workflow.run
async def run(self, request: ApprovalRequest) -> str:
# Erforderliche Genehmiger bestimmen
self.required_approvers = self._get_required_approvers(request.amount)
workflow.logger.info(
f"Request {request.request_id} requires approval from: "
f"{', '.join(self.required_approvers)}"
)
# Genehmigungs-Requests senden
await workflow.execute_activity(
send_approval_requests,
request,
self.required_approvers,
start_to_close_timeout=timedelta(minutes=5),
)
# Auf alle Genehmigungen warten (max 14 Tage)
try:
await workflow.wait_condition(
lambda: len(self.approvals) >= len(self.required_approvers),
timeout=timedelta(days=14)
)
except asyncio.TimeoutError:
return (
f"Approval timeout - only {len(self.approvals)} of "
f"{len(self.required_approvers)} approvals received"
)
# Prüfen ob alle approved
if not self.is_fully_approved():
rejected_by = [
role for role, decision in self.approvals.items()
if not decision.approved
]
# Ablehnung verarbeiten
await workflow.execute_activity(
notify_rejection,
request,
rejected_by,
start_to_close_timeout=timedelta(minutes=5),
)
return f"Request rejected by: {', '.join(rejected_by)}"
# Alle approved - Request verarbeiten
await workflow.execute_activity(
process_approved_request,
request,
start_to_close_timeout=timedelta(minutes=10),
)
return (
f"Request approved by all {len(self.required_approvers)} approvers "
f"and processed successfully"
)
Approval Workflow Zustandsdiagramm:
stateDiagram-v2
[*] --> SendingRequests: Workflow gestartet
SendingRequests --> WaitingForApprovals: Notifications sent
WaitingForApprovals --> CheckingStatus: Signal received
CheckingStatus --> WaitingForApprovals: Not all approvals yet
CheckingStatus --> Timeout: 14 days timeout
CheckingStatus --> ValidatingDecisions: All approvals received
ValidatingDecisions --> Rejected: Any rejection
ValidatingDecisions --> Processing: All approved
Processing --> [*]: Success
Rejected --> [*]: Rejected
Timeout --> [*]: Timeout
note right of WaitingForApprovals
Wartet auf Signale:
- approve_manager
- approve_director
- approve_vp
end note
6.4.2 Progress Tracking mit Updates statt Query Polling
Ineffizient: Query Polling
# ✗ Ineffizient: Kontinuierliches Query Polling
async def wait_for_progress_old(handle, target_progress: int):
"""Veraltetes Pattern - vermeiden!"""
while True:
progress = await handle.query(MyWorkflow.get_progress)
if progress >= target_progress:
return progress
await asyncio.sleep(1) # Verschwendung!
Effizient: Update mit wait_condition
@workflow.defn
class DataMigrationWorkflow:
"""Progress Tracking mit Update statt Polling"""
@workflow.init
def __init__(self) -> None:
self.progress = 0
self.total_records = 0
self.completed = False
@workflow.update
async def wait_for_progress(self, min_progress: int) -> int:
"""Warte bis Progress erreicht, dann return"""
# Workflow benachrichtigt Client wenn bereit
await workflow.wait_condition(
lambda: self.progress >= min_progress or self.completed
)
return self.progress
@workflow.query
def get_current_progress(self) -> int:
"""Sofortiger Progress-Check (wenn nötig)"""
return self.progress
@workflow.run
async def run(self, total_records: int) -> str:
self.total_records = total_records
for i in range(total_records):
await workflow.execute_activity(
migrate_record,
i,
start_to_close_timeout=timedelta(seconds=30),
)
self.progress = i + 1
# Log alle 10%
if (i + 1) % (total_records // 10) == 0:
workflow.logger.info(
f"Progress: {(i+1)/total_records*100:.0f}%"
)
self.completed = True
return f"Migrated {total_records} records"
# Client: Effiziente Progress-Überwachung
async def monitor_migration_progress(handle):
"""✓ Effizienter Ansatz mit Update"""
# Warte auf 50% Progress
progress = await handle.execute_update(
DataMigrationWorkflow.wait_for_progress,
min_progress=50
)
print(f"50% checkpoint reached: {progress} records")
# Warte auf 100%
progress = await handle.execute_update(
DataMigrationWorkflow.wait_for_progress,
min_progress=100
)
print(f"100% complete: {progress} records")
Vorteile Update-basiertes Progress Tracking:
- ✅ Ein Request statt hunderte Queries
- ✅ Workflow benachrichtigt Client aktiv
- ✅ Keine Polling-Overhead
- ✅ Präzise Benachrichtigung genau wenn Milestone erreicht
- ✅ Server-Push statt Client-Pull
6.4.3 Externe Workflow Handles
Workflows können mit anderen Workflows kommunizieren via externe Handles:
@workflow.defn
class OrchestratorWorkflow:
"""Koordiniert mehrere Worker Workflows"""
@workflow.run
async def run(self, worker_ids: List[str]) -> dict:
results = {}
for worker_id in worker_ids:
# Externes Workflow Handle abrufen
worker_handle = workflow.get_external_workflow_handle_for(
WorkerWorkflow.run,
workflow_id=f"worker-{worker_id}"
)
# Signal an externes Workflow senden
await worker_handle.signal(
WorkerWorkflow.process_task,
TaskData(task_id=f"task-{worker_id}", priority=1)
)
workflow.logger.info(f"Task sent to worker {worker_id}")
# Warte auf alle Worker
await asyncio.sleep(timedelta(minutes=10))
# Optional: Externe Workflows canceln
# await worker_handle.cancel()
return {"workers_coordinated": len(worker_ids)}
@workflow.defn
class WorkerWorkflow:
"""Worker Workflow empfängt Signale"""
@workflow.init
def __init__(self) -> None:
self.tasks: List[TaskData] = []
@workflow.signal
async def process_task(self, task: TaskData) -> None:
"""Empfange Task vom Orchestrator"""
result = await workflow.execute_activity(
process_task_activity,
task,
start_to_close_timeout=timedelta(minutes=5),
)
self.tasks.append(result)
@workflow.run
async def run(self) -> List[str]:
# Warte auf Tasks oder Timeout
await workflow.wait_condition(
lambda: len(self.tasks) >= 5,
timeout=timedelta(hours=1)
)
return self.tasks
Event History bei externen Signalen:
SignalExternalWorkflowExecutionInitiatedim Sender’s HistoryWorkflowExecutionSignaledim Empfänger’s History
6.4.4 Signal Buffering vor Workflow-Start
Signale die VOR Workflow-Start gesendet werden, werden automatisch gepuffert:
async def start_with_buffered_signals():
"""Signale werden gepuffert bis Workflow startet"""
client = await Client.connect("localhost:7233")
# Workflow starten (Worker braucht Zeit zum Aufnehmen)
handle = await client.start_workflow(
MyWorkflow.run,
id="workflow-123",
task_queue="my-queue",
)
# Signale SOFORT senden (werden gepuffert wenn Workflow noch nicht läuft)
await handle.signal(MyWorkflow.signal_1, "data1")
await handle.signal(MyWorkflow.signal_2, "data2")
await handle.signal(MyWorkflow.signal_3, "data3")
# Wenn Workflow startet: Alle gepufferten Signale in Reihenfolge ausgeliefert
Wichtige Überlegungen:
- Signale in Reihenfolge gepuffert
- Alle vor erstem Workflow Task ausgeliefert
- Workflow-State muss initialisiert sein BEVOR Handler ausführen
@workflow.initverwenden um uninitialisierte Variablen zu vermeiden
Buffering Visualisierung:
sequenceDiagram
participant Client
participant Frontend as Temporal Frontend
participant Worker
participant Workflow
Client->>Frontend: start_workflow()
Frontend-->>Client: Workflow ID
Client->>Frontend: signal_1("data1")
Frontend->>Frontend: Buffer signal_1
Client->>Frontend: signal_2("data2")
Frontend->>Frontend: Buffer signal_2
Worker->>Frontend: Poll for tasks
Frontend->>Worker: Workflow Task + buffered signals
Worker->>Workflow: __init__() [via @workflow.init]
Worker->>Workflow: signal_1("data1")
Worker->>Workflow: signal_2("data2")
Worker->>Workflow: run()
Note over Workflow: Alle Signale verarbeitet<br/>BEVOR run() startet
6.4.5 Concurrency Safety bei async Handlern
Problem: Race Conditions
# ✗ Problem: Race Condition bei concurrent Updates
@workflow.update
async def withdraw(self, amount: Decimal) -> Decimal:
# Mehrere Updates können interleaven!
current = await workflow.execute_activity(
get_balance, ...
) # Point A
# Anderer Handler könnte hier ausführen!
new_balance = current - amount # Point B
# Und hier!
await workflow.execute_activity(
set_balance, new_balance, ...
) # Point C
return new_balance
Lösung: asyncio.Lock
from asyncio import Lock
@workflow.defn
class BankAccountWorkflow:
@workflow.init
def __init__(self) -> None:
self.lock = Lock() # Concurrency-Schutz
self.balance = Decimal("1000.00")
@workflow.update
async def withdraw(self, amount: Decimal) -> Decimal:
"""Thread-safe Withdrawal mit Lock"""
async with self.lock: # Nur ein Handler gleichzeitig
# Kritischer Bereich
current_balance = await workflow.execute_activity(
get_current_balance,
start_to_close_timeout=timedelta(seconds=10),
)
if current_balance < amount:
raise ValueError("Insufficient funds")
# Betrag abziehen
self.balance = current_balance - amount
# Externes System updaten
await workflow.execute_activity(
update_balance,
self.balance,
start_to_close_timeout=timedelta(seconds=10),
)
return self.balance
@withdraw.validator
def validate_withdraw(self, amount: Decimal) -> None:
if amount <= 0:
raise ValueError("Amount must be positive")
if amount > Decimal("10000.00"):
raise ValueError("Amount exceeds daily limit")
Alternative: Message Queue Pattern
from collections import deque
from typing import Deque
@workflow.defn
class QueueBasedWorkflow:
"""Natürliche Serialisierung via Queue"""
@workflow.init
def __init__(self) -> None:
self.message_queue: Deque[str] = deque()
@workflow.signal
def add_message(self, message: str) -> None:
"""Leichtgewichtiger Handler - nur queuen"""
if len(self.message_queue) >= 1000:
workflow.logger.warn("Queue full, dropping message")
return
self.message_queue.append(message)
@workflow.run
async def run(self) -> None:
"""Haupt-Workflow verarbeitet Queue"""
while True:
# Warte auf Messages
await workflow.wait_condition(
lambda: len(self.message_queue) > 0
)
# Verarbeite alle gepufferten Messages
while self.message_queue:
message = self.message_queue.popleft()
# Verarbeitung (natürlich serialisiert)
await workflow.execute_activity(
process_message,
message,
start_to_close_timeout=timedelta(seconds=30),
)
# Prüfe ob fortfahren
if self.should_shutdown:
break
Vorteile Queue Pattern:
- ✅ Natürliche FIFO Serialisierung
- ✅ Keine Race Conditions
- ✅ Kann Messages batchen
- ✅ Keine Locks nötig
Nachteile:
- ❌ Komplexerer Code
- ❌ Schwieriger typsicher zu machen
- ❌ Weniger bequem als direkte Handler
6.4.6 Continue-As-New mit Handlern
Problem: Unfertige Handler bei Continue-As-New
# ⚠️ Problem: Handler könnte bei Continue-As-New abgebrochen werden
@workflow.run
async def run(self) -> None:
while True:
# Arbeit erledigen
self.iteration += 1
if workflow.info().is_continue_as_new_suggested():
# PROBLEM: Signal/Update Handler könnten noch laufen!
workflow.continue_as_new(iteration=self.iteration)
Lösung: Handler-Completion warten
@workflow.defn
class LongRunningWorkflow:
@workflow.init
def __init__(self, iteration: int = 0) -> None:
self.iteration = iteration
self.total_processed = 0
@workflow.signal
async def process_item(self, item: str) -> None:
"""Async Signal Handler"""
result = await workflow.execute_activity(
process_item_activity,
item,
start_to_close_timeout=timedelta(minutes=5),
)
self.total_processed += 1
@workflow.run
async def run(self) -> None:
while True:
self.iteration += 1
# Batch-Verarbeitung
await workflow.execute_activity(
batch_process,
start_to_close_timeout=timedelta(minutes=10),
)
# Prüfe ob Continue-As-New nötig
if workflow.info().is_continue_as_new_suggested():
workflow.logger.info(
"Event history approaching limits - Continue-As-New"
)
# ✓ WICHTIG: Warte auf alle Handler
await workflow.wait_condition(
lambda: workflow.all_handlers_finished()
)
# Jetzt sicher für Continue-As-New
workflow.continue_as_new(
iteration=self.iteration,
total_processed=self.total_processed
)
# Nächste Iteration
await asyncio.sleep(timedelta(hours=1))
Idempotenz über Continue-As-New hinweg:
from typing import Set
@workflow.defn
class IdempotentWorkflow:
@workflow.init
def __init__(self, processed_update_ids: Set[str] = None) -> None:
# IDs bereits verarbeiteter Updates
self.processed_update_ids = processed_update_ids or set()
@workflow.update
async def process_payment(self, payment_id: str) -> str:
"""Idempotenter Update über Continue-As-New"""
# Update ID für Deduplizierung
update_info = workflow.current_update_info()
if update_info and update_info.id in self.processed_update_ids:
workflow.logger.info(f"Skipping duplicate update: {update_info.id}")
return "already_processed"
# Payment verarbeiten
result = await workflow.execute_activity(
charge_payment,
payment_id,
start_to_close_timeout=timedelta(seconds=30),
)
# Als verarbeitet markieren
if update_info:
self.processed_update_ids.add(update_info.id)
return result
@workflow.run
async def run(self) -> None:
while True:
# Workflow-Logik...
if workflow.info().is_continue_as_new_suggested():
await workflow.wait_condition(
lambda: workflow.all_handlers_finished()
)
# IDs an nächste Iteration übergeben
workflow.continue_as_new(
processed_update_ids=self.processed_update_ids
)
6.5 Häufige Fehler und Anti-Patterns
6.5.1 Uninitialisierte State-Zugriffe
Problem:
# ✗ FALSCH: Handler kann vor run() ausgeführt werden!
@workflow.defn
class BadWorkflow:
@workflow.run
async def run(self, name: str) -> str:
self.name = name # Initialisiert hier
await workflow.wait_condition(lambda: self.approved)
return f"Hello {self.name}"
@workflow.signal
def approve(self) -> None:
self.approved = True
# Wenn Signal vor run() gesendet: self.name existiert nicht!
# AttributeError!
Lösung:
# ✓ KORREKT: @workflow.init garantiert Ausführung vor Handlern
@workflow.defn
class GoodWorkflow:
@workflow.init
def __init__(self, name: str) -> None:
self.name = name # Garantiert zuerst ausgeführt
self.approved = False
@workflow.run
async def run(self, name: str) -> str:
await workflow.wait_condition(lambda: self.approved)
return f"Hello {self.name}"
@workflow.signal
def approve(self) -> None:
self.approved = True # self.name existiert garantiert
6.5.2 State-Mutation in Queries
Problem:
# ✗ FALSCH: Queries müssen read-only sein!
@workflow.query
def get_and_increment_counter(self) -> int:
self.counter += 1 # ERROR! State-Mutation
return self.counter
Lösung:
# ✓ KORREKT: Query nur lesen, Update für Mutation
@workflow.query
def get_counter(self) -> int:
return self.counter # Read-only
@workflow.update
def increment_counter(self) -> int:
self.counter += 1 # Mutations in Updates erlaubt
return self.counter
6.5.3 Async Query Handler
Problem:
# ✗ FALSCH: Queries können nicht async sein!
@workflow.query
async def get_status(self) -> str: # TypeError!
return self.status
Lösung:
# ✓ KORREKT: Queries müssen synchron sein
@workflow.query
def get_status(self) -> str:
return self.status
6.5.4 Nicht auf Handler-Completion warten
Problem:
# ✗ FALSCH: Workflow endet während Handler laufen
@workflow.run
async def run(self) -> str:
await workflow.execute_activity(...)
return "Done" # Handler könnten noch laufen!
Lösung:
# ✓ KORREKT: Auf Handler warten
@workflow.run
async def run(self) -> str:
await workflow.execute_activity(...)
# Alle Handler fertigstellen
await workflow.wait_condition(
lambda: workflow.all_handlers_finished()
)
return "Done"
6.5.5 Exzessive Signal-Volumes
Problem:
# ✗ FALSCH: Zu viele Signale
for i in range(10000):
await handle.signal(MyWorkflow.process_item, f"item-{i}")
# Überschreitet Event History Limits!
Lösung:
# ✓ BESSER: Batch-Signale
items = [f"item-{i}" for i in range(10000)]
await handle.signal(MyWorkflow.process_batch, items)
# Oder: Child Workflows für hohe Volumes
for i in range(100):
await workflow.execute_child_workflow(
ProcessingWorkflow.run,
batch=items[i*100:(i+1)*100],
id=f"batch-{i}",
)
6.5.6 Kontinuierliches Query Polling
Problem:
# ✗ INEFFIZIENT: Tight polling loop
async def wait_for_completion_bad(handle):
while True:
status = await handle.query(MyWorkflow.get_status)
if status == "completed":
break
await asyncio.sleep(0.5) # Verschwendung!
Lösung:
# ✓ BESSER: Update mit wait_condition
@workflow.update
async def wait_for_completion(self, target_status: str) -> str:
await workflow.wait_condition(lambda: self.status == target_status)
return self.status
# Client
status = await handle.execute_update(
MyWorkflow.wait_for_completion,
"completed"
)
6.6 Praktisches Beispiel: E-Commerce Order Workflow
Vollständiges Beispiel mit Signalen, Queries und Updates:
"""
E-Commerce Order Workflow
Demonstriert Signals, Queries und Updates
"""
from temporalio import workflow, activity
from temporalio.client import Client
from dataclasses import dataclass
from decimal import Decimal
from typing import List, Optional
from datetime import timedelta, datetime, timezone
from enum import Enum
import asyncio
# ==================== Data Models ====================
class OrderStatus(Enum):
PENDING = "pending"
PAID = "paid"
SHIPPED = "shipped"
DELIVERED = "delivered"
CANCELLED = "cancelled"
@dataclass
class OrderItem:
sku: str
name: str
quantity: int
price: Decimal
@dataclass
class PaymentInfo:
payment_method: str
amount: Decimal
card_token: str
@dataclass
class ShippingInfo:
address: str
carrier: str
tracking_number: str
@dataclass
class OrderProgress:
"""Query Response: Order Progress"""
status: OrderStatus
items_count: int
total_amount: str
payment_status: str
shipping_status: str
# ==================== Workflow ====================
@workflow.defn
class OrderWorkflow:
"""E-Commerce Order mit Signals, Queries und Updates"""
@workflow.init
def __init__(self) -> None:
self.lock = asyncio.Lock()
# Order State
self.order_id: str = ""
self.customer_id: str = ""
self.status = OrderStatus.PENDING
# Items
self.items: List[OrderItem] = []
self.total = Decimal("0.00")
# Payment
self.payment_transaction_id: Optional[str] = None
# Shipping
self.shipping_info: Optional[ShippingInfo] = None
# ========== Queries: Read-Only State Access ==========
@workflow.query
def get_status(self) -> str:
"""Aktueller Order Status"""
return self.status.value
@workflow.query
def get_total(self) -> str:
"""Aktueller Total-Betrag"""
return str(self.total)
@workflow.query
def get_progress(self) -> OrderProgress:
"""Detaillierter Progress"""
return OrderProgress(
status=self.status,
items_count=len(self.items),
total_amount=str(self.total),
payment_status=(
"paid" if self.payment_transaction_id
else "pending"
),
shipping_status=(
f"shipped via {self.shipping_info.carrier}"
if self.shipping_info
else "not shipped"
)
)
# ========== Updates: Validated State Changes ==========
@workflow.update
async def add_item(self, item: OrderItem) -> dict:
"""Item hinzufügen (mit Validierung)"""
async with self.lock:
# Inventory prüfen
available = await workflow.execute_activity(
check_inventory,
item,
start_to_close_timeout=timedelta(seconds=10),
)
if not available:
raise ValueError(f"Item {item.sku} out of stock")
# Item hinzufügen
self.items.append(item)
self.total += item.price * item.quantity
workflow.logger.info(
f"Added {item.quantity}x {item.name} - "
f"Total: ${self.total}"
)
return {
"items": len(self.items),
"total": str(self.total)
}
@add_item.validator
def validate_add_item(self, item: OrderItem) -> None:
"""Validator: Item nur wenn Order pending"""
if self.status != OrderStatus.PENDING:
raise ValueError(
f"Cannot add items in status: {self.status.value}"
)
if item.quantity <= 0:
raise ValueError("Quantity must be positive")
if len(self.items) >= 50:
raise ValueError("Maximum 50 items per order")
@workflow.update
async def process_payment(self, payment: PaymentInfo) -> str:
"""Payment verarbeiten (mit Validierung)"""
async with self.lock:
# Payment Amount validieren
if payment.amount != self.total:
raise ValueError(
f"Payment amount {payment.amount} != "
f"order total {self.total}"
)
# Payment Activity ausführen
transaction_id = await workflow.execute_activity(
charge_payment,
payment,
start_to_close_timeout=timedelta(seconds=30),
)
# State updaten
self.payment_transaction_id = transaction_id
self.status = OrderStatus.PAID
workflow.logger.info(
f"Payment successful: {transaction_id}"
)
return transaction_id
@process_payment.validator
def validate_payment(self, payment: PaymentInfo) -> None:
"""Validator: Payment nur wenn pending und Items vorhanden"""
if self.status != OrderStatus.PENDING:
raise ValueError(
f"Cannot process payment in status: {self.status.value}"
)
if len(self.items) == 0:
raise ValueError("Cannot pay for empty order")
if not payment.card_token or len(payment.card_token) < 10:
raise ValueError("Invalid card token")
# ========== Signals: Async Notifications ==========
@workflow.signal
async def mark_shipped(self, shipping: ShippingInfo) -> None:
"""Order als shipped markieren"""
async with self.lock:
if self.status != OrderStatus.PAID:
workflow.logger.warn(
f"Cannot ship order in status: {self.status.value}"
)
return
# Shipping System updaten
await workflow.execute_activity(
update_shipping_system,
self.order_id,
shipping,
start_to_close_timeout=timedelta(seconds=10),
)
self.shipping_info = shipping
self.status = OrderStatus.SHIPPED
workflow.logger.info(
f"Order shipped via {shipping.carrier} - "
f"Tracking: {shipping.tracking_number}"
)
@workflow.signal
def cancel_order(self, reason: str) -> None:
"""Order canceln"""
if self.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:
workflow.logger.warn(
f"Cannot cancel order in status: {self.status.value}"
)
return
self.status = OrderStatus.CANCELLED
workflow.logger.info(f"Order cancelled: {reason}")
# ========== Main Workflow ==========
@workflow.run
async def run(self, order_id: str, customer_id: str) -> str:
"""Order Workflow Main Logic"""
self.order_id = order_id
self.customer_id = customer_id
workflow.logger.info(
f"Order {order_id} created for customer {customer_id}"
)
# Warte auf Payment (max 7 Tage)
try:
await workflow.wait_condition(
lambda: self.status == OrderStatus.PAID,
timeout=timedelta(days=7)
)
except asyncio.TimeoutError:
self.status = OrderStatus.CANCELLED
return f"Order {order_id} cancelled - payment timeout"
# Warte auf Shipment (max 30 Tage)
try:
await workflow.wait_condition(
lambda: self.status == OrderStatus.SHIPPED,
timeout=timedelta(days=30)
)
except asyncio.TimeoutError:
workflow.logger.error("Shipment timeout!")
return f"Order {order_id} paid but not shipped"
# Simuliere Delivery Tracking
await asyncio.sleep(timedelta(days=3))
# Mark als delivered
self.status = OrderStatus.DELIVERED
# Delivery Confirmation senden
await workflow.execute_activity(
send_delivery_confirmation,
self.customer_id,
self.order_id,
start_to_close_timeout=timedelta(seconds=30),
)
workflow.logger.info(f"Order {order_id} delivered successfully")
return f"Order {order_id} completed - ${self.total} charged"
# ==================== Activities ====================
@activity.defn
async def check_inventory(item: OrderItem) -> bool:
"""Prüfe Inventory-Verfügbarkeit"""
# Simuliert Inventory-Check
activity.logger.info(f"Checking inventory for {item.sku}")
return True
@activity.defn
async def charge_payment(payment: PaymentInfo) -> str:
"""Verarbeite Payment"""
activity.logger.info(
f"Charging {payment.amount} via {payment.payment_method}"
)
# Simuliert Payment Gateway
return f"txn_{payment.card_token[:8]}"
@activity.defn
async def update_shipping_system(
order_id: str,
shipping: ShippingInfo
) -> None:
"""Update Shipping System"""
activity.logger.info(
f"Updating shipping for {order_id} - {shipping.carrier}"
)
@activity.defn
async def send_delivery_confirmation(
customer_id: str,
order_id: str
) -> None:
"""Sende Delivery Confirmation"""
activity.logger.info(
f"Sending delivery confirmation to {customer_id} for {order_id}"
)
# ==================== Client Usage ====================
async def main():
"""Client-seitiger Order Flow"""
client = await Client.connect("localhost:7233")
# Order Workflow starten
order_id = "order-12345"
handle = await client.start_workflow(
OrderWorkflow.run,
order_id,
"customer-789",
id=order_id,
task_queue="orders",
)
print(f"Order {order_id} created")
# Items hinzufügen (Update)
try:
result = await handle.execute_update(
OrderWorkflow.add_item,
OrderItem(
sku="LAPTOP-001",
name="Gaming Laptop",
quantity=1,
price=Decimal("1299.99")
)
)
print(f"Item added: {result}")
result = await handle.execute_update(
OrderWorkflow.add_item,
OrderItem(
sku="MOUSE-001",
name="Wireless Mouse",
quantity=2,
price=Decimal("29.99")
)
)
print(f"Item added: {result}")
except Exception as e:
print(f"Failed to add item: {e}")
return
# Total abfragen (Query)
total = await handle.query(OrderWorkflow.get_total)
print(f"Order total: ${total}")
# Progress abfragen (Query)
progress = await handle.query(OrderWorkflow.get_progress)
print(f"Progress: {progress}")
# Payment verarbeiten (Update mit Validierung)
try:
txn_id = await handle.execute_update(
OrderWorkflow.process_payment,
PaymentInfo(
payment_method="credit_card",
amount=Decimal(total),
card_token="tok_1234567890abcdef"
)
)
print(f"Payment processed: {txn_id}")
except Exception as e:
print(f"Payment failed: {e}")
return
# Status abfragen
status = await handle.query(OrderWorkflow.get_status)
print(f"Order status: {status}")
# Shipment markieren (Signal)
await handle.signal(
OrderWorkflow.mark_shipped,
ShippingInfo(
address="123 Main St, City, State 12345",
carrier="UPS",
tracking_number="1Z999AA10123456784"
)
)
print("Order marked as shipped")
# Final result abwarten
result = await handle.result()
print(f"\nWorkflow completed: {result}")
if __name__ == "__main__":
asyncio.run(main())
Beispiel Output:
Order order-12345 created
Item added: {'items': 1, 'total': '1299.99'}
Item added: {'items': 2, 'total': '1359.97'}
Order total: $1359.97
Progress: OrderProgress(status=<OrderStatus.PENDING: 'pending'>, items_count=2, total_amount='1359.97', payment_status='pending', shipping_status='not shipped')
Payment processed: txn_1234567890abcdef
Order status: paid
Order marked as shipped
Workflow completed: Order order-12345 completed - $1359.97 charged
6.7 Zusammenfassung
Kernkonzepte
Signals (Signale):
- Asynchrone, fire-and-forget Zustandsänderungen
- Erzeugen Event History Einträge
- Können vor Workflow-Start gepuffert werden
- Perfekt für Event-driven Patterns und Human-in-the-Loop
Queries (Abfragen):
- Synchrone, read-only Zustandsabfragen
- KEINE Event History Einträge
- Funktionieren auf abgeschlossenen Workflows
- Ideal für Dashboards und Monitoring
Updates (Aktualisierungen):
- Synchrone Zustandsänderungen mit Response
- Optionale Validierung vor Ausführung
- Event History nur bei erfolgreicher Validierung
- Beste Wahl für Request-Response Patterns
Entscheidungsbaum
flowchart TD
A[Workflow Interaktion] --> B{Zustand ändern?}
B -->|Nein - Nur lesen| C[Query verwenden]
B -->|Ja - Zustand ändern| D{Response nötig?}
D -->|Nein| E{Niedrige Latenz<br/>kritisch?}
E -->|Ja| F[Signal]
E -->|Nein| F
D -->|Ja| G{Validierung<br/>nötig?}
G -->|Ja| H[Update mit Validator]
G -->|Nein| I[Update ohne Validator]
C --> J[Vorteile:<br/>- Keine History<br/>- Nach Workflow-Ende<br/>- Niedrige Latenz]
F --> K[Vorteile:<br/>- Fire-and-forget<br/>- Niedrige Latenz<br/>- Event-driven]
H --> L[Vorteile:<br/>- Frühe Ablehnung<br/>- Input-Validierung<br/>- Synchrone Response]
I --> M[Vorteile:<br/>- Synchrone Response<br/>- Fehler-Feedback<br/>- Activity-Ausführung]
style C fill:#90EE90
style F fill:#87CEEB
style H fill:#FFD700
style I fill:#FFA500
Best Practices Checkliste
Allgemein:
- ✅
@workflow.initfür State-Initialisierung verwenden - ✅ Dataclasses für typsichere Parameter nutzen
- ✅ Auf
workflow.all_handlers_finished()vor Workflow-Ende warten - ✅ Event History Limits beachten (Continue-As-New)
Signale:
- ✅ Handler leichtgewichtig halten
- ✅ Idempotenz implementieren
- ✅ Nicht tausende Signale senden (batchen!)
- ✅ Signal-with-Start für lazy initialization
Queries:
- ✅ Nur synchrone (
def) Handler - ✅ KEINE State-Mutation
- ✅ Pre-compute komplexe Werte
- ✅ NICHT kontinuierlich pollen
Updates:
- ✅ Validators für Input-Validierung
- ✅
asyncio.Lockbei concurrent async Handlern - ✅ Update statt Signal+Query Polling
- ✅ Idempotenz über Continue-As-New
Häufige Anti-Patterns
| Anti-Pattern | Problem | Lösung |
|---|---|---|
| State in run() initialisieren | Handler könnten vor run() ausführen | @workflow.init verwenden |
| Async Query Handler | TypeError | Nur def, nicht async def |
| State in Query ändern | Verletzt Read-Only | Update verwenden |
| Nicht auf Handler warten | Workflow endet vorzeitig | workflow.all_handlers_finished() |
| Kontinuierliches Query Polling | Ineffizient | Update mit wait_condition |
| Race Conditions in async Handlern | Concurrent execution | asyncio.Lock verwenden |
| Tausende Signale | Event History Limit | Batching oder Child Workflows |
Nächste Schritte
In diesem Kapitel haben Sie die drei Kommunikationsmechanismen von Temporal kennengelernt. Im nächsten Teil des Buchs (Kapitel 7-9) werden wir uns mit Resilienz beschäftigen:
- Kapitel 7: Error Handling und Retry Policies
- Kapitel 8: Workflow Evolution und Versioning
- Kapitel 9: Fortgeschrittene Resilienz-Patterns
Die Kommunikationsmuster aus diesem Kapitel bilden die Grundlage für robuste, produktionsreife Temporal-Anwendungen.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 7: Fehlerbehandlung und Retries
Code-Beispiele für dieses Kapitel: examples/part-02/chapter-06/
Kapitel 7: Error Handling und Retry Policies
Einleitung
In verteilten Systemen sind Fehler unvermeidlich: Netzwerkverbindungen brechen ab, externe Services werden langsam oder unerreichbar, Datenbanken geraten unter Last, und Timeouts treten auf. Die Fähigkeit, gracefully mit diesen Fehlern umzugehen, ist entscheidend für resiliente Anwendungen.
Temporal wurde von Grund auf entwickelt, um Fehlerbehandlung zu vereinfachen und zu automatisieren. Das Framework übernimmt einen Großteil der komplexen Retry-Logik, während Entwickler die Kontrolle über kritische Business-Entscheidungen behalten.
In diesem Kapitel lernen Sie:
- Den fundamentalen Unterschied zwischen Activity- und Workflow-Fehlern
- Exception-Typen und deren Verwendung im Python SDK
- Retry Policies konfigurieren und anpassen
- Timeouts richtig einsetzen (Activity und Workflow)
- Advanced Error Patterns (SAGA, Circuit Breaker, Dead Letter Queue)
- Testing und Debugging von Fehlerszenarien
- Best Practices für produktionsreife Fehlerbehandlung
Warum Error Handling in Temporal anders ist
In traditionellen Systemen müssen Entwickler:
- Retry-Logik selbst implementieren
- Exponential Backoff manuell programmieren
- Idempotenz explizit sicherstellen
- Fehlerz
ustände in Datenbanken speichern
- Circuit Breaker selbst bauen
Mit Temporal:
- Activities haben automatische Retries (konfigurierbar)
- Exponential Backoff ist eingebaut
- Event History speichert allen State (kein externes Persistence Layer nötig)
- Deterministische Replay-Garantien ermöglichen sichere Fehlerbehandlung
- Deklarative Retry Policies statt imperativer Code
7.1 Error Handling Grundlagen
7.1.1 Activity Errors vs Workflow Errors
Der fundamentalste Unterschied in Temporal’s Error-Modell: Activity-Fehler führen NICHT automatisch zu Workflow-Fehlern. Dies ist ein bewusstes Design-Pattern für Resilienz.
Activity Errors:
- Jede Python Exception in einer Activity wird automatisch in einen
ApplicationErrorkonvertiert - Activities haben Default Retry Policies und versuchen automatisch erneut
- Activity-Fehler werden im Workflow als
ActivityErrorweitergegeben - Der Workflow entscheidet, wie mit dem Fehler umgegangen wird
Workflow Errors:
- Workflows haben KEINE Default Retry Policy
- Nur explizite
ApplicationErrorRaises führen zum Workflow-Fehler - Andere Python Exceptions (z.B.
NameError,TypeError) führen zu Workflow Task Retries - Non-Temporal Exceptions werden als Bugs betrachtet, die durch Code-Fixes behoben werden können
Visualisierung: Error Flow
flowchart TD
A[Activity wirft Exception] --> B{Activity Retry Policy}
B -->|Retry erlaubt| C[Exponential Backoff]
C --> D[Activity Retry]
D --> B
B -->|Max Attempts erreicht| E[ActivityError an Workflow]
E --> F{Workflow Error Handling}
F -->|try/except| G[Workflow behandelt Fehler]
F -->|Nicht gefangen| H[Workflow wirft ApplicationError]
G --> I[Workflow fährt fort]
H --> J[Workflow Status: Failed]
style A fill:#FFB6C1
style E fill:#FFA500
style H fill:#FF4444
style I fill:#90EE90
Code-Beispiel:
from temporalio import workflow, activity
from temporalio.exceptions import ActivityError, ApplicationError
from temporalio.common import RetryPolicy
from datetime import timedelta
@activity.defn
async def risky_operation(data: str) -> str:
"""Activity die fehlschlagen kann"""
if "invalid" in data:
# Diese Exception wird zu ActivityError im Workflow
raise ValueError(f"Invalid data: {data}")
# Simuliere externe API
result = await external_api.call(data)
return result
@workflow.defn
class ErrorHandlingWorkflow:
@workflow.run
async def run(self, data: str) -> str:
try:
# Activity Retry Policy: Max 3 Versuche
result = await workflow.execute_activity(
risky_operation,
data,
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(maximum_attempts=3)
)
return f"Success: {result}"
except ActivityError as e:
# Activity-Fehler abfangen und behandeln
workflow.logger.error(f"Activity failed after retries: {e}")
# Entscheidung: Workflow-Fehler oder graceful handling?
if "critical" in data:
# Kritischer Fehler → Workflow schlägt fehl
raise ApplicationError(
"Critical operation failed",
non_retryable=True
) from e
else:
# Nicht-kritisch → Fallback
return f"Failed with fallback: {e.cause.message if e.cause else str(e)}"
Wichtige Erkenntnisse:
- Separation of Concerns: Activities führen Work aus (fehleranfällig), Workflows orchestrieren (resilient)
- Automatic Retries: Platform kümmert sich um Retries, Sie konfigurieren nur
- Explicit Failure: Workflows müssen explizit fehlschlagen via
ApplicationError - Graceful Degradation: Workflows können Activity-Fehler abfangen und mit Fallback fortfahren
7.1.2 Exception-Hierarchie im Python SDK
Das Temporal Python SDK hat eine klare Exception-Hierarchie:
TemporalError (Basis für alle Temporal Exceptions)
├── FailureError (Basis für Runtime-Failures)
│ ├── ApplicationError (User-thrown, kontrolliert)
│ ├── ActivityError (Activity fehlgeschlagen)
│ ├── ChildWorkflowError (Child Workflow fehlgeschlagen)
│ ├── CancelledError (Cancellation)
│ ├── TerminatedError (Terminierung)
│ ├── TimeoutError (Timeout)
│ └── ServerError (Server-seitige Fehler)
Exception-Klassen im Detail:
1. ApplicationError - Die primäre Exception für bewusste Fehler
from temporalio.exceptions import ApplicationError
# Einfach
raise ApplicationError("Something went wrong")
# Mit Typ (für Retry Policy)
raise ApplicationError(
"Payment failed",
type="PaymentError"
)
# Non-retryable
raise ApplicationError(
"Invalid input",
type="ValidationError",
non_retryable=True # Keine Retries!
)
# Mit Details (serialisierbar)
raise ApplicationError(
"Order processing failed",
type="OrderError",
details=[{
"order_id": "12345",
"reason": "inventory_unavailable",
"requested": 10,
"available": 5
}]
)
# Mit custom retry delay
raise ApplicationError(
"Rate limited",
type="RateLimitError",
next_retry_delay=timedelta(seconds=60)
)
2. ActivityError - Activity Failure Wrapper
try:
result = await workflow.execute_activity(...)
except ActivityError as e:
# Zugriff auf Error-Properties
workflow.logger.error(f"Activity failed: {e.activity_type}")
workflow.logger.error(f"Activity ID: {e.activity_id}")
workflow.logger.error(f"Retry state: {e.retry_state}")
# Zugriff auf die ursprüngliche Exception (cause)
if isinstance(e.cause, ApplicationError):
workflow.logger.error(f"Root cause type: {e.cause.type}")
workflow.logger.error(f"Root cause message: {e.cause.message}")
workflow.logger.error(f"Details: {e.cause.details}")
3. ChildWorkflowError - Child Workflow Failure Wrapper
try:
result = await workflow.execute_child_workflow(...)
except ChildWorkflowError as e:
workflow.logger.error(f"Child workflow {e.workflow_type} failed")
workflow.logger.error(f"Workflow ID: {e.workflow_id}")
workflow.logger.error(f"Run ID: {e.run_id}")
# Nested causes navigieren
current = e.cause
while current:
workflow.logger.error(f"Cause: {type(current).__name__}: {current}")
if hasattr(current, 'cause'):
current = current.cause
else:
break
4. TimeoutError - Timeout Wrapper
try:
result = await workflow.execute_activity(...)
except TimeoutError as e:
# Timeout-Typ ermitteln
from temporalio.api.enums.v1 import TimeoutType
if e.type == TimeoutType.TIMEOUT_TYPE_START_TO_CLOSE:
workflow.logger.error("Activity execution timed out")
elif e.type == TimeoutType.TIMEOUT_TYPE_HEARTBEAT:
workflow.logger.error("Activity heartbeat timed out")
# Last heartbeat details abrufen
if e.last_heartbeat_details:
workflow.logger.info(f"Last progress: {e.last_heartbeat_details}")
**Exception-Hierar
chie Visualisierung:**
classDiagram
TemporalError <|-- FailureError
FailureError <|-- ApplicationError
FailureError <|-- ActivityError
FailureError <|-- ChildWorkflowError
FailureError <|-- CancelledError
FailureError <|-- TimeoutError
FailureError <|-- TerminatedError
class TemporalError {
+message: str
}
class ApplicationError {
+type: str
+message: str
+details: list
+non_retryable: bool
+next_retry_delay: timedelta
+cause: Exception
}
class ActivityError {
+activity_id: str
+activity_type: str
+retry_state: RetryState
+cause: Exception
}
class ChildWorkflowError {
+workflow_id: str
+workflow_type: str
+run_id: str
+retry_state: RetryState
+cause: Exception
}
7.1.3 Retriable vs Non-Retriable Errors
Ein kritisches Konzept: Welche Fehler sollen retry-ed werden, welche nicht?
Retriable Errors (Transiente Fehler):
- Netzwerk-Timeouts
- Verbindungsfehler
- Service temporarily unavailable (503)
- Rate Limiting (429)
- Datenbank Deadlocks
- Temporäre Ressourcen-Knappheit
Non-Retryable Errors (Permanente Fehler):
- Authentication failures (401, 403)
- Resource not found (404)
- Bad Request / Validierung (400)
- Business Logic Failures (Insufficient funds, Invalid state)
- Permanente Authorization Errors
Entscheidungsbaum:
flowchart TD
A[Fehler aufgetreten] --> B{Wird der Fehler<br/>durch Warten behoben?}
B -->|Ja| C[RETRIABLE]
B -->|Nein| D{Kann User<br/>das Problem lösen?}
D -->|Ja| E[NON-RETRIABLE<br/>mit aussagekräftiger Message]
D -->|Nein| F{Ist es ein Bug<br/>im Code?}
F -->|Ja| G[NON-RETRIABLE<br/>+ Alert an Ops]
F -->|Nein| H[NON-RETRIABLE<br/>+ Detailed Error Info]
C --> I[Retry Policy<br/>mit Exponential Backoff]
E --> J[Sofortiges Failure<br/>mit Details für User]
style C fill:#90EE90
style E fill:#FFB6C1
style G fill:#FF4444
style H fill:#FFA500
Implementierungs-Beispiel:
from enum import Enum
class ErrorCategory(Enum):
TRANSIENT = "transient" # Retry
VALIDATION = "validation" # Don't retry
AUTH = "authentication" # Don't retry
BUSINESS = "business_logic" # Don't retry
RATE_LIMIT = "rate_limit" # Retry mit delay
@activity.defn
async def smart_error_handling(order_id: str) -> str:
"""Activity mit intelligentem Error Handling"""
try:
# External API call
response = await payment_api.charge(order_id)
return response.transaction_id
except NetworkError as e:
# Transient → Allow retry
raise ApplicationError(
f"Network error (will retry): {e}",
type="NetworkError"
) from e
except RateLimitError as e:
# Rate limit → Retry mit custom delay
raise ApplicationError(
"API rate limit exceeded",
type="RateLimitError",
next_retry_delay=timedelta(seconds=60)
) from e
except AuthenticationError as e:
# Auth failure → Don't retry
raise ApplicationError(
"Payment service authentication failed",
type="AuthenticationError",
non_retryable=True
) from e
except ValidationError as e:
# Invalid input → Don't retry
raise ApplicationError(
f"Invalid order data: {e.errors}",
type="ValidationError",
non_retryable=True,
details=[{"order_id": order_id, "errors": e.errors}]
) from e
except InsufficientFundsError as e:
# Business logic → Don't retry
raise ApplicationError(
"Payment declined: Insufficient funds",
type="PaymentDeclinedError",
non_retryable=True,
details=[{
"order_id": order_id,
"amount_requested": e.amount,
"balance": e.current_balance
}]
) from e
# Im Workflow: Retry Policy mit non_retryable_error_types
@workflow.defn
class SmartOrderWorkflow:
@workflow.run
async def run(self, order_id: str) -> dict:
try:
transaction_id = await workflow.execute_activity(
smart_error_handling,
order_id,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_attempts=5,
backoff_coefficient=2.0,
# Diese Error-Typen NICHT retry-en
non_retryable_error_types=[
"ValidationError",
"AuthenticationError",
"PaymentDeclinedError"
]
)
)
return {"success": True, "transaction_id": transaction_id}
except ActivityError as e:
if isinstance(e.cause, ApplicationError):
# Detailliertes Error-Handling basierend auf Typ
error_type = e.cause.type
if error_type == "PaymentDeclinedError":
# Kunde benachrichtigen
await self.notify_customer("Payment declined")
return {"success": False, "reason": "insufficient_funds"}
elif error_type == "ValidationError":
# Log für Debugging
workflow.logger.error(f"Validation failed: {e.cause.details}")
return {"success": False, "reason": "invalid_data"}
# Unbehandelter Fehler → Workflow schlägt fehl
raise
7.2 Retry Policies
Retry Policies sind das Herzstück von Temporal’s Resilienz. Sie definieren wie und wie oft ein fehlgeschlagener Versuch wiederholt wird.
7.2.1 RetryPolicy Konfiguration
Vollständige Parameter:
from temporalio.common import RetryPolicy
from datetime import timedelta
retry_policy = RetryPolicy(
# Backoff-Intervall für ersten Retry (Default: 1s)
initial_interval=timedelta(seconds=1),
# Multiplikator für jeden weiteren Retry (Default: 2.0)
backoff_coefficient=2.0,
# Maximales Backoff-Intervall (Default: 100x initial_interval)
maximum_interval=timedelta(seconds=100),
# Maximale Anzahl Versuche (0 = unbegrenzt, Default: 0)
maximum_attempts=5,
# Error-Typen die NICHT retry-ed werden
non_retryable_error_types=["ValidationError", "AuthError"]
)
Parameter-Beschreibung:
| Parameter | Typ | Default | Beschreibung |
|---|---|---|---|
initial_interval | timedelta | 1s | Wartezeit vor erstem Retry |
backoff_coefficient | float | 2.0 | Multiplikator pro Retry |
maximum_interval | timedelta | 100x initial | Max Wartezeit zwischen Retries |
maximum_attempts | int | 0 (∞) | Max Anzahl Versuche (inkl. Original) |
non_retryable_error_types | List[str] | None | Error-Types ohne Retry |
7.2.2 Exponential Backoff
Formel:
next_interval = min(
initial_interval × (backoff_coefficient ^ current_attempt),
maximum_interval
)
Beispiel-Progression (initial=1s, coefficient=2.0, max=100s):
Versuch 1: Sofort (Original)
Versuch 2: +1s = 1s Wartezeit
Versuch 3: +2s = 3s Gesamtzeit
Versuch 4: +4s = 7s Gesamtzeit
Versuch 5: +8s = 15s Gesamtzeit
Versuch 6: +16s = 31s Gesamtzeit
Versuch 7: +32s = 63s Gesamtzeit
Versuch 8: +64s = 127s Gesamtzeit (aber gecapped bei 100s)
Versuch 9+: +100s (max_interval)
Visualisierung:
graph LR
A[Attempt 1<br/>Immediate] -->|Wait 1s| B[Attempt 2]
B -->|Wait 2s| C[Attempt 3]
C -->|Wait 4s| D[Attempt 4]
D -->|Wait 8s| E[Attempt 5]
E -->|Wait 16s| F[Attempt 6]
F -->|Wait 32s| G[Attempt 7]
G -->|Wait 64s| H[Attempt 8]
H -->|Wait 100s<br/>capped| I[Attempt 9]
style A fill:#90EE90
style I fill:#FFB6C1
Warum Exponential Backoff?
- Thundering Herd vermeiden: Nicht alle Clients retry gleichzeitig
- Service-Erholung: Geben externen Services Zeit zu recovern
- Ressourcen-Schonung: Reduziert Load während Ausfall-Perioden
- Progressive Degradation: Schnelle erste Retries, dann geduldiger
Code-Beispiel mit verschiedenen Strategien:
# Strategie 1: Aggressive Retries (schnelle Transients)
quick_retry = RetryPolicy(
initial_interval=timedelta(milliseconds=100),
maximum_interval=timedelta(seconds=1),
backoff_coefficient=1.5,
maximum_attempts=10
)
# Strategie 2: Geduldige Retries (externe Services)
patient_retry = RetryPolicy(
initial_interval=timedelta(seconds=5),
maximum_interval=timedelta(minutes=5),
backoff_coefficient=2.0,
maximum_attempts=20
)
# Strategie 3: Limited Retries (kritische Operationen)
limited_retry = RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=10),
backoff_coefficient=2.0,
maximum_attempts=3 # Nur 3 Versuche
)
# Strategie 4: Custom Delay (Rate Limiting)
@activity.defn
async def rate_limited_activity() -> str:
try:
return await external_api.call()
except RateLimitError as e:
# Custom delay basierend auf API Response
retry_after = e.retry_after_seconds
raise ApplicationError(
"Rate limited",
next_retry_delay=timedelta(seconds=retry_after)
) from e
7.2.3 Default Retry-Verhalten
Activities:
- Haben automatisch eine Default Retry Policy
- Retry unbegrenzt bis Erfolg
- Default
initial_interval: 1 Sekunde - Default
backoff_coefficient: 2.0 - Default
maximum_interval: 100 Sekunden - Default
maximum_attempts: 0 (unbegrenzt)
Workflows:
- Haben KEINE Default Retry Policy
- Müssen explizit konfiguriert werden wenn Retries gewünscht
- Design-Philosophie: Workflows sollten Issues durch Activities behandeln
Child Workflows:
- Können Retry Policies konfiguriert bekommen
- Unabhängig vom Parent Workflow
Beispiel: Defaults überschreiben
# Activity OHNE Retries
await workflow.execute_activity(
one_shot_activity,
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(maximum_attempts=1) # Nur 1 Versuch
)
# Activity MIT custom Retries
await workflow.execute_activity(
my_activity,
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=2),
maximum_attempts=5
)
)
# Workflow MIT Retries (vom Client)
await client.execute_workflow(
MyWorkflow.run,
args=["data"],
id="workflow-id",
task_queue="my-queue",
retry_policy=RetryPolicy(
maximum_interval=timedelta(seconds=10),
maximum_attempts=3
)
)
7.2.4 Retry Policy für Activities, Workflows und Child Workflows
1. Activity Retry Policy:
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self) -> str:
return await workflow.execute_activity(
my_activity,
"arg",
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_attempts=5,
non_retryable_error_types=["ValidationError"]
)
)
2. Workflow Retry Policy (vom Client):
from temporalio.client import Client
client = await Client.connect("localhost:7233")
# Workflow mit Retry
handle = await client.execute_workflow(
MyWorkflow.run,
"argument",
id="workflow-id",
task_queue="my-queue",
retry_policy=RetryPolicy(
maximum_interval=timedelta(seconds=10),
maximum_attempts=3
)
)
3. Child Workflow Retry Policy:
@workflow.defn
class ParentWorkflow:
@workflow.run
async def run(self) -> str:
# Child mit Retry Policy
result = await workflow.execute_child_workflow(
ChildWorkflow.run,
"arg",
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=2),
maximum_attempts=3
)
)
return result
Vergleichstabelle:
| Aspekt | Activity | Workflow | Child Workflow |
|---|---|---|---|
| Default Policy | Ja (unbegrenzt) | Nein | Nein |
| Konfiguration | Bei execute_activity | Beim Client-Start | Bei execute_child_workflow |
| Scope | Pro Activity-Call | Gesamte Execution | Pro Child Workflow |
| Empfehlung | Fast immer Retry | Selten Retry | Manchmal Retry |
7.3 Activity Error Handling
7.3.1 Exceptions werfen und fangen
In Activities werfen:
from temporalio import activity
from temporalio.exceptions import ApplicationError
@activity.defn
async def process_payment(amount: float, card_token: str) -> str:
"""Payment Activity mit detailliertem Error Handling"""
attempt = activity.info().attempt
activity.logger.info(
f"Processing payment (attempt {attempt})",
extra={"amount": amount, "card_token": card_token[:4] + "****"}
)
try:
# Call Payment Service
result = await payment_service.charge(amount, card_token)
return result.transaction_id
except InsufficientFundsError as e:
# Business Logic Error → Don't retry
raise ApplicationError(
"Payment declined: Insufficient funds",
type="InsufficientFundsError",
non_retryable=True,
details=[{
"amount": amount,
"available_balance": e.available_balance,
"shortfall": amount - e.available_balance
}]
) from e
except NetworkTimeoutError as e:
# Transient Network Error → Allow retry mit custom delay
delay = min(5 * attempt, 30) # 5s, 10s, 15s, ..., max 30s
raise ApplicationError(
f"Network timeout on attempt {attempt}",
type="NetworkError",
next_retry_delay=timedelta(seconds=delay)
) from e
except CardDeclinedError as e:
# Permanent Card Issue → Don't retry
raise ApplicationError(
f"Card declined: {e.reason}",
type="CardDeclinedError",
non_retryable=True,
details=[{"reason": e.reason, "card_last4": card_token[-4:]}]
) from e
except Exception as e:
# Unknown Error → Retry mit logging
activity.logger.error(
f"Unexpected error processing payment",
extra={"error_type": type(e).__name__},
exc_info=True
)
raise ApplicationError(
f"Payment processing failed: {type(e).__name__}",
type="UnexpectedError"
) from e
Im Workflow fangen:
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, amount: float, card_token: str) -> dict:
"""Workflow mit umfassendem Activity Error Handling"""
try:
transaction_id = await workflow.execute_activity(
process_payment,
args=[amount, card_token],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
non_retryable_error_types=[
"InsufficientFundsError",
"CardDeclinedError"
]
)
)
# Success
return {
"success": True,
"transaction_id": transaction_id
}
except ActivityError as e:
# Activity failed nach allen Retries
workflow.logger.error(
f"Payment activity failed",
extra={
"activity_type": e.activity_type,
"retry_state": e.retry_state,
"attempt": activity.info().attempt if e.cause else 0
}
)
# Zugriff auf Root Cause
if isinstance(e.cause, ApplicationError):
error_type = e.cause.type
error_message = e.cause.message
error_details = e.cause.details
workflow.logger.error(
f"Root cause: {error_type}: {error_message}",
extra={"details": error_details}
)
# Typ-spezifisches Handling
if error_type == "InsufficientFundsError":
# Kunde benachrichtigen
await workflow.execute_activity(
send_notification,
args=[{
"type": "payment_declined",
"reason": "insufficient_funds",
"amount": amount
}],
start_to_close_timeout=timedelta(seconds=10)
)
return {
"success": False,
"error": "insufficient_funds",
"details": error_details
}
elif error_type == "CardDeclinedError":
return {
"success": False,
"error": "card_declined",
"details": error_details
}
# Unbehandelte Fehler → Workflow schlägt fehl
raise ApplicationError(
f"Payment failed: {e}",
non_retryable=True
) from e
7.3.2 Heartbeats für Long-Running Activities
Heartbeats erfüllen drei kritische Funktionen:
- Progress Tracking: Signalisiert Fortschritt an Temporal Service
- Cancellation Detection: Ermöglicht Activity Cancellation
- Resumption Support: Bei Retry kann Activity von letztem Heartbeat fortfahren
Heartbeat Implementation:
import asyncio
from temporalio import activity
@activity.defn
async def long_batch_processing(total_items: int) -> str:
"""Long-running Activity mit Heartbeats"""
processed = 0
activity.logger.info(f"Starting batch processing: {total_items} items")
try:
for item_id in range(total_items):
# Check for cancellation
if activity.is_cancelled():
activity.logger.info(f"Activity cancelled after {processed} items")
raise asyncio.CancelledError("Activity cancelled by user")
# Process item
await process_single_item(item_id)
processed += 1
# Send heartbeat with progress
activity.heartbeat({
"processed": processed,
"total": total_items,
"percent": (processed / total_items) * 100,
"current_item": item_id
})
# Log progress alle 10%
if processed % (total_items // 10) == 0:
activity.logger.info(
f"Progress: {processed}/{total_items} "
f"({processed/total_items*100:.0f}%)"
)
activity.logger.info(f"Batch processing completed: {processed} items")
return f"Processed {processed} items successfully"
except asyncio.CancelledError:
# Cleanup bei Cancellation
await cleanup_partial_work(processed)
activity.logger.info(f"Cleaned up after processing {processed} items")
raise # Must re-raise
Resumable Activity (mit Heartbeat Details):
@activity.defn
async def resumable_batch_processing(total_items: int) -> str:
"""Activity die bei Retry von letztem Heartbeat fortfährt"""
# Check für vorherigen Fortschritt
heartbeat_details = activity.info().heartbeat_details
start_from = 0
if heartbeat_details:
# Resume von letztem Heartbeat
last_progress = heartbeat_details[0]
start_from = last_progress.get("processed", 0)
activity.logger.info(f"Resuming from item {start_from}")
else:
activity.logger.info("Starting fresh batch processing")
processed = start_from
for item_id in range(start_from, total_items):
# Process item
await process_single_item(item_id)
processed += 1
# Heartbeat mit aktuellem Fortschritt
activity.heartbeat({
"processed": processed,
"total": total_items,
"last_item_id": item_id,
"timestamp": time.time()
})
return f"Processed {processed} items (resumed from {start_from})"
Heartbeat Timeout konfigurieren:
@workflow.defn
class BatchWorkflow:
@workflow.run
async def run(self, total_items: int) -> str:
return await workflow.execute_activity(
long_batch_processing,
args=[total_items],
start_to_close_timeout=timedelta(minutes=30), # Gesamtzeit
heartbeat_timeout=timedelta(seconds=30), # Max Zeit zwischen Heartbeats
retry_policy=RetryPolicy(maximum_attempts=3)
)
Wichtige Heartbeat-Regeln:
- Throttling: Heartbeats werden automatisch gedrosselt (ca. 30-60s)
- Cancellation: Nur Activities mit
heartbeat_timeoutkönnen gecancelt werden - Resumption: Heartbeat Details persistieren über Retries
- Performance: Heartbeats sollten nicht zu frequent sein (alle paar Sekunden reicht)
7.3.3 Activity Timeouts
Es gibt vier Activity Timeout-Typen, die verschiedene Aspekte kontrollieren:
Timeout-Übersicht:
gantt
title Activity Timeout Timeline
dateFormat ss
axisFormat %S
section Queue
Schedule :milestone, 00, 0s
Schedule-To-Start :active, 00, 05
section Execution
Start :milestone, 05, 0s
Start-To-Close :active, 05, 30
Heartbeat Interval :crit, 05, 10
Heartbeat Interval :crit, 15, 10
Heartbeat Interval :crit, 25, 10
Close :milestone, 35, 0s
section Overall
Schedule-To-Close :done, 00, 35
1. Start-To-Close Timeout (Empfohlen!)
- Maximale Zeit für einzelne Activity Task Execution
- Gilt pro Retry-Versuch
- Triggers Retry bei Überschreitung
- WICHTIG: Dieser Timeout ist stark empfohlen!
await workflow.execute_activity(
my_activity,
args=["data"],
start_to_close_timeout=timedelta(seconds=30) # Jeder Versuch max 30s
)
2. Schedule-To-Close Timeout
- Maximale Zeit für gesamte Activity Execution (inkl. aller Retries)
- Stoppt alle weiteren Retries bei Überschreitung
- Triggers KEIN Retry (Budget erschöpft)
await workflow.execute_activity(
my_activity,
args=["data"],
start_to_close_timeout=timedelta(seconds=10), # Pro Versuch
schedule_to_close_timeout=timedelta(minutes=5) # Gesamt über alle Versuche
)
3. Schedule-To-Start Timeout
- Maximale Zeit von Scheduling bis Worker Pickup
- Erkennt Worker Crashes oder Queue Congestion
- Triggers KEIN Retry (würde in gleiche Queue zurück)
await workflow.execute_activity(
my_activity,
args=["data"],
start_to_close_timeout=timedelta(seconds=30),
schedule_to_start_timeout=timedelta(minutes=5) # Max 5 Min in Queue
)
4. Heartbeat Timeout
- Maximale Zeit zwischen Activity Heartbeats
- Erforderlich für Activity Cancellation
- Triggers Retry bei Überschreitung
await workflow.execute_activity(
long_running_activity,
args=[1000],
start_to_close_timeout=timedelta(minutes=30),
heartbeat_timeout=timedelta(seconds=30) # Heartbeat alle 30s erforderlich
)
Vollständiges Beispiel:
@workflow.defn
class TimeoutDemoWorkflow:
@workflow.run
async def run(self) -> dict:
results = {}
# Scenario 1: Quick API call
results["api"] = await workflow.execute_activity(
quick_api_call,
start_to_close_timeout=timedelta(seconds=5),
schedule_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=5)
)
# Scenario 2: Long-running with heartbeat
results["batch"] = await workflow.execute_activity(
long_batch_process,
args=[1000],
start_to_close_timeout=timedelta(minutes=30),
heartbeat_timeout=timedelta(seconds=30),
schedule_to_close_timeout=timedelta(hours=2)
)
# Scenario 3: With queue monitoring
results["scalable"] = await workflow.execute_activity(
scalable_activity,
start_to_close_timeout=timedelta(seconds=10),
schedule_to_start_timeout=timedelta(minutes=5),
schedule_to_close_timeout=timedelta(minutes=10)
)
return results
Timeout vs Retry Interaktion:
| Timeout Typ | Triggers Retry? | Use Case |
|---|---|---|
| Start-To-Close | ✅ Ja | Einzelne Execution überwachen |
| Schedule-To-Close | ❌ Nein | Gesamt-Budget kontrollieren |
| Schedule-To-Start | ❌ Nein | Queue Issues erkennen |
| Heartbeat | ✅ Ja | Long-running Progress überwachen |
Best Practices:
- Immer Start-To-Close setzen (Temporal empfiehlt dies stark)
- Schedule-To-Close optional für Budget-Kontrolle
- Schedule-To-Start bei Scaling-Concerns
- Heartbeat für Long-Running (> 1 Minute)
7.3.4 Activity Cancellation
Activity Cancellation ermöglicht graceful shutdown von laufenden Activities.
Requirements:
- Activity muss Heartbeats senden
- Heartbeat Timeout muss gesetzt sein
- Activity muss
asyncio.CancelledErrorbehandeln
Cancellable Activity Implementation:
import asyncio
from temporalio import activity
@activity.defn
async def cancellable_long_operation(data_size: int) -> str:
"""Activity mit Cancellation Support"""
processed = 0
activity.logger.info(f"Starting operation: {data_size} items")
try:
while processed < data_size:
# Check Cancellation
if activity.is_cancelled():
activity.logger.info(
f"Cancellation detected at {processed}/{data_size}"
)
raise asyncio.CancelledError("Operation cancelled by user")
# Do work chunk
await process_chunk(processed, min(processed + 100, data_size))
processed += 100
# Send heartbeat (enables cancellation detection)
activity.heartbeat({
"processed": processed,
"total": data_size,
"percent": (processed / data_size) * 100
})
# Small sleep to avoid tight loop
await asyncio.sleep(0.5)
activity.logger.info("Operation completed successfully")
return f"Processed {processed} items"
except asyncio.CancelledError:
# Cleanup logic
activity.logger.info(f"Cleaning up after processing {processed} items")
await cleanup_resources(processed)
# Save state for potential resume
await save_progress(processed)
# Must re-raise to signal cancellation
raise
except Exception as e:
activity.logger.error(f"Operation failed: {e}")
await cleanup_resources(processed)
raise
Workflow-seitige Cancellation:
@workflow.defn
class CancellableWorkflow:
@workflow.run
async def run(self, data_size: int, timeout_seconds: int) -> str:
# Start activity (nicht sofort await)
activity_handle = workflow.start_activity(
cancellable_long_operation,
args=[data_size],
start_to_close_timeout=timedelta(minutes=30),
heartbeat_timeout=timedelta(seconds=30), # Required!
)
try:
# Wait mit Custom Timeout
result = await asyncio.wait_for(
activity_handle,
timeout=timeout_seconds
)
return result
except asyncio.TimeoutError:
# Timeout → Cancel Activity
workflow.logger.info(f"Timeout after {timeout_seconds}s - cancelling activity")
activity_handle.cancel()
try:
# Wait for cancellation to complete
await activity_handle
except asyncio.CancelledError:
workflow.logger.info("Activity cancelled successfully")
return "Operation timed out and was cancelled"
Client-seitige Cancellation:
from temporalio.client import Client
# Start Workflow
client = await Client.connect("localhost:7233")
handle = await client.start_workflow(
CancellableWorkflow.run,
args=[10000, 60],
id="cancellable-workflow-1",
task_queue="my-queue"
)
# Cancel von extern
await asyncio.sleep(30) # Nach 30 Sekunden canceln
await handle.cancel()
# Check result
try:
result = await handle.result()
except asyncio.CancelledError:
print("Workflow was cancelled")
7.4 Workflow Error Handling
7.4.1 Try/Except in Workflows
Workflows können Activity-Fehler abfangen und behandeln:
Basic Pattern:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> OrderResult:
inventory_id = None
payment_id = None
try:
# Step 1: Reserve Inventory
workflow.logger.info("Reserving inventory...")
inventory_id = await workflow.execute_activity(
reserve_inventory,
args=[order.items],
start_to_close_timeout=timedelta(seconds=30)
)
# Step 2: Process Payment
workflow.logger.info("Processing payment...")
payment_id = await workflow.execute_activity(
process_payment,
args=[order.payment_info, order.total],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
non_retryable_error_types=["InsufficientFundsError"]
)
)
# Step 3: Ship Order
workflow.logger.info("Shipping order...")
tracking = await workflow.execute_activity(
ship_order,
args=[order.address, inventory_id],
start_to_close_timeout=timedelta(minutes=5)
)
return OrderResult(success=True, tracking=tracking)
except ActivityError as e:
workflow.logger.error(f"Order failed: {e}")
# Compensating Transactions (SAGA Pattern)
if payment_id:
workflow.logger.info("Refunding payment...")
await workflow.execute_activity(
refund_payment,
args=[payment_id],
start_to_close_timeout=timedelta(seconds=30)
)
if inventory_id:
workflow.logger.info("Releasing inventory...")
await workflow.execute_activity(
release_inventory,
args=[inventory_id],
start_to_close_timeout=timedelta(seconds=30)
)
# Determine failure reason
if isinstance(e.cause, ApplicationError):
if e.cause.type == "InsufficientFundsError":
return OrderResult(
success=False,
error="Payment declined: Insufficient funds"
)
# Re-raise für unbehandelte Fehler
raise
7.4.2 ApplicationError vs Non-Determinism Errors
ApplicationError (Bewusster Workflow-Fehler):
@workflow.defn
class ValidationWorkflow:
@workflow.run
async def run(self, data: dict) -> str:
# Business Logic Validation
if not data.get("required_field"):
# Explizites Workflow Failure
raise ApplicationError(
"Invalid workflow input: missing required_field",
type="ValidationError",
non_retryable=True,
details=[{"received_data": data}]
)
# Workflow fährt fort
return "Success"
Non-Determinism Errors (Bug im Workflow Code):
import random
import datetime
@workflow.defn
class BadWorkflow:
@workflow.run
async def run(self) -> str:
# ❌ FALSCH: Non-deterministic!
random_value = random.random() # Anders bei Replay!
now = datetime.datetime.now() # Anders bei Replay!
# ✅ RICHTIG: Temporal APIs verwenden
random_value = workflow.random().random()
now = workflow.now()
return "Success"
Unterschiede:
| Aspekt | ApplicationError | Non-Determinism Error |
|---|---|---|
| Zweck | Business Logic Failure | Code Bug |
| Ursache | Bewusster Raise | Geänderter/Non-Det Code |
| Retry | Konfigurierbar | Task Retry unendlich |
| Fix | Im Workflow-Code behandeln | Code fixen + redeploy |
| History | Execution → Failed | Task retry loop |
Non-Determinism vermeiden:
-
Kein non-deterministic Code:
- ❌
random.random() - ❌
datetime.now() - ❌
uuid.uuid4() - ❌ External I/O im Workflow
- ✅
workflow.random() - ✅
workflow.now() - ✅
workflow.uuid4()
- ❌
-
Sandbox nutzen (aktiviert per Default):
from temporalio.worker import Worker
worker = Worker(
client,
task_queue="my-queue",
workflows=[MyWorkflow],
activities=[my_activity],
# Sandbox enabled by default - schützt vor Non-Determinism
)
7.4.3 Child Workflow Error Handling
Child Workflows mit Error Handling:
@workflow.defn
class ParentWorkflow:
@workflow.run
async def run(self, orders: list[Order]) -> dict:
"""Parent verarbeitet mehrere Child Workflows"""
successful = []
failed = []
for order in orders:
try:
# Execute Child Workflow
result = await workflow.execute_child_workflow(
OrderWorkflow.run,
args=[order],
retry_policy=RetryPolicy(
maximum_attempts=3,
initial_interval=timedelta(seconds=2)
),
# Parent Close Policy
parent_close_policy=ParentClosePolicy.ABANDON
)
successful.append({
"order_id": order.id,
"result": result
})
workflow.logger.info(f"Order {order.id} completed")
except ChildWorkflowError as e:
workflow.logger.error(f"Order {order.id} failed: {e}")
# Navigate nested error causes
root_cause = e.cause
while hasattr(root_cause, 'cause') and root_cause.cause:
root_cause = root_cause.cause
failed.append({
"order_id": order.id,
"error": str(e),
"root_cause": str(root_cause)
})
return {
"successful_count": len(successful),
"failed_count": len(failed),
"successful": successful,
"failed": failed
}
Parent Close Policies:
from temporalio.common import ParentClosePolicy
# Terminate child when parent closes
parent_close_policy=ParentClosePolicy.TERMINATE
# Cancel child when parent closes
parent_close_policy=ParentClosePolicy.REQUEST_CANCEL
# Let child continue independently (default)
parent_close_policy=ParentClosePolicy.ABANDON
7.5 Advanced Error Patterns
7.5.1 SAGA Pattern für Distributed Transactions
Das SAGA Pattern implementiert verteilte Transaktionen durch Compensation Actions.
Konzept:
- Jeder Schritt hat eine entsprechende Compensation (Undo)
- Bei Fehler werden alle bisherigen Schritte kompensiert
- Reihenfolge: Reverse Order (LIFO)
Vollständige SAGA Implementation:
from dataclasses import dataclass, field
from typing import Optional, List, Tuple, Callable
@dataclass
class BookingRequest:
user_id: str
car_id: str
hotel_id: str
flight_id: str
@dataclass
class BookingResult:
success: bool
car_booking: Optional[str] = None
hotel_booking: Optional[str] = None
flight_booking: Optional[str] = None
error: Optional[str] = None
# Forward Activities
@activity.defn
async def book_car(car_id: str) -> str:
"""Book car reservation"""
if "invalid" in car_id:
raise ValueError("Invalid car ID")
activity.logger.info(f"Car booked: {car_id}")
return f"car_booking_{car_id}"
@activity.defn
async def book_hotel(hotel_id: str) -> str:
"""Book hotel reservation"""
if "invalid" in hotel_id:
raise ValueError("Invalid hotel ID")
activity.logger.info(f"Hotel booked: {hotel_id}")
return f"hotel_booking_{hotel_id}"
@activity.defn
async def book_flight(flight_id: str) -> str:
"""Book flight reservation"""
if "invalid" in flight_id:
raise ValueError("Invalid flight ID")
activity.logger.info(f"Flight booked: {flight_id}")
return f"flight_booking_{flight_id}"
# Compensation Activities
@activity.defn
async def undo_book_car(booking_id: str) -> None:
"""Cancel car reservation"""
activity.logger.info(f"Cancelling car booking: {booking_id}")
await asyncio.sleep(0.5)
@activity.defn
async def undo_book_hotel(booking_id: str) -> None:
"""Cancel hotel reservation"""
activity.logger.info(f"Cancelling hotel booking: {booking_id}")
await asyncio.sleep(0.5)
@activity.defn
async def undo_book_flight(booking_id: str) -> None:
"""Cancel flight reservation"""
activity.logger.info(f"Cancelling flight booking: {booking_id}")
await asyncio.sleep(0.5)
# SAGA Workflow
@workflow.defn
class TripBookingSaga:
"""SAGA Pattern für Trip Booking"""
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
# Track completed steps mit Compensations
compensations: List[Tuple[Callable, str]] = []
result = BookingResult(success=False)
try:
# Step 1: Book Car
workflow.logger.info("Step 1: Booking car...")
result.car_booking = await workflow.execute_activity(
book_car,
args=[request.car_id],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
compensations.append((undo_book_car, result.car_booking))
workflow.logger.info(f"✓ Car booked: {result.car_booking}")
# Step 2: Book Hotel
workflow.logger.info("Step 2: Booking hotel...")
result.hotel_booking = await workflow.execute_activity(
book_hotel,
args=[request.hotel_id],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
non_retryable_error_types=["ValueError"]
)
)
compensations.append((undo_book_hotel, result.hotel_booking))
workflow.logger.info(f"✓ Hotel booked: {result.hotel_booking}")
# Step 3: Book Flight
workflow.logger.info("Step 3: Booking flight...")
result.flight_booking = await workflow.execute_activity(
book_flight,
args=[request.flight_id],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
compensations.append((undo_book_flight, result.flight_booking))
workflow.logger.info(f"✓ Flight booked: {result.flight_booking}")
# All steps successful!
result.success = True
workflow.logger.info("🎉 Trip booking completed successfully")
return result
except Exception as e:
# Fehler → Execute Compensations in REVERSE order
workflow.logger.error(f"❌ Booking failed: {e}. Executing compensations...")
result.error = str(e)
for compensation_activity, booking_id in reversed(compensations):
try:
await workflow.execute_activity(
compensation_activity,
args=[booking_id],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=5, # Compensations robuster!
initial_interval=timedelta(seconds=2)
)
)
workflow.logger.info(
f"✓ Compensation successful: {compensation_activity.__name__}"
)
except Exception as comp_error:
# Log aber fortfahren mit anderen Compensations
workflow.logger.error(
f"⚠ Compensation failed: {compensation_activity.__name__}: {comp_error}"
)
workflow.logger.info("All compensations completed")
return result
SAGA Flow Visualisierung:
sequenceDiagram
participant W as Workflow
participant A1 as Book Car Activity
participant A2 as Book Hotel Activity
participant A3 as Book Flight Activity
participant C3 as Undo Flight
participant C2 as Undo Hotel
participant C1 as Undo Car
W->>A1: book_car()
A1-->>W: car_booking_123
Note over W: Add undo_book_car to stack
W->>A2: book_hotel()
A2-->>W: hotel_booking_456
Note over W: Add undo_book_hotel to stack
W->>A3: book_flight()
A3--xW: Flight booking FAILED
Note over W: Execute compensations<br/>in REVERSE order
W->>C2: undo_book_hotel(456)
C2-->>W: Hotel cancelled
W->>C1: undo_book_car(123)
C1-->>W: Car cancelled
Note over W: All compensations done
SAGA Best Practices:
- Idempotenz: Alle Forward & Compensation Activities müssen idempotent sein
- Compensation Resilience: Compensations mit aggressiveren Retry Policies
- Partial Success Tracking: Genau tracken welche Steps erfolgreich waren
- Compensation Logging: Ausführliches Logging für Debugging
- State Preservation: Workflow State nutzen für SAGA Progress
7.5.2 Circuit Breaker Pattern
Circuit Breaker verhindert Cascade Failures durch Blocking bei wiederholten Fehlern.
States:
- CLOSED: Normal operation (Requests gehen durch)
- OPEN: Blocking requests (Service hat Probleme)
- HALF_OPEN: Testing recovery (Einzelne Requests testen)
Implementation:
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerState:
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
last_failure_time: Optional[datetime] = None
success_count: int = 0
@workflow.defn
class CircuitBreakerWorkflow:
def __init__(self) -> None:
self.circuit = CircuitBreakerState()
self.failure_threshold = 5
self.timeout = timedelta(seconds=60)
self.half_open_success_threshold = 2
@workflow.run
async def run(self, requests: list[str]) -> dict:
"""Process requests mit Circuit Breaker"""
results = []
for request in requests:
try:
result = await self.call_with_circuit_breaker(request)
results.append({"request": request, "result": result, "status": "success"})
except ApplicationError as e:
results.append({"request": request, "error": str(e), "status": "failed"})
return {
"total": len(requests),
"successful": sum(1 for r in results if r["status"] == "success"),
"failed": sum(1 for r in results if r["status"] == "failed"),
"results": results
}
async def call_with_circuit_breaker(self, request: str) -> str:
"""Call mit Circuit Breaker Protection"""
# Check circuit state
if self.circuit.state == CircuitState.OPEN:
# Check timeout
time_since_failure = workflow.now() - self.circuit.last_failure_time
if time_since_failure < self.timeout:
# Circuit still open
raise ApplicationError(
f"Circuit breaker is OPEN (failures: {self.circuit.failure_count})",
type="CircuitBreakerOpen",
non_retryable=True
)
else:
# Try half-open
self.circuit.state = CircuitState.HALF_OPEN
self.circuit.success_count = 0
workflow.logger.info("Circuit breaker entering HALF_OPEN state")
# Attempt the call
try:
result = await workflow.execute_activity(
external_service_call,
args=[request],
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(maximum_attempts=1) # No retries!
)
# Success
await self.on_success()
return result
except ActivityError as e:
# Failure
await self.on_failure()
raise
async def on_success(self) -> None:
"""Handle successful call"""
if self.circuit.state == CircuitState.HALF_OPEN:
self.circuit.success_count += 1
if self.circuit.success_count >= self.half_open_success_threshold:
# Enough successes → Close circuit
self.circuit.state = CircuitState.CLOSED
self.circuit.failure_count = 0
workflow.logger.info("✓ Circuit breaker CLOSED")
elif self.circuit.state == CircuitState.CLOSED:
# Reset failure count
self.circuit.failure_count = 0
async def on_failure(self) -> None:
"""Handle failed call"""
self.circuit.failure_count += 1
self.circuit.last_failure_time = workflow.now()
if self.circuit.state == CircuitState.HALF_OPEN:
# Failure in half-open → Reopen
self.circuit.state = CircuitState.OPEN
workflow.logger.warning("⚠ Circuit breaker reopened due to failure")
elif self.circuit.failure_count >= self.failure_threshold:
# Too many failures → Open circuit
self.circuit.state = CircuitState.OPEN
workflow.logger.warning(
f"⚠ Circuit breaker OPENED after {self.circuit.failure_count} failures"
)
Circuit Breaker State Machine:
stateDiagram-v2
[*] --> Closed: Initial State
Closed --> Open: Failure threshold reached
Closed --> Closed: Success (reset counter)
Open --> HalfOpen: Timeout elapsed
Open --> Open: Requests blocked
HalfOpen --> Closed: Success threshold reached
HalfOpen --> Open: Any failure
note right of Closed
Normal operation
Track failures
end note
note right of Open
Block all requests
Wait for timeout
end note
note right of HalfOpen
Test recovery
Limited requests
end note
7.5.3 Dead Letter Queue Pattern
DLQ Pattern routet persistierend fehlerhafte Items in separate Queue für manuelle Verarbeitung.
@dataclass
class ProcessingItem:
id: str
data: str
retry_count: int = 0
errors: list[str] = field(default_factory=list)
@workflow.defn
class DLQWorkflow:
"""Workflow mit Dead Letter Queue Pattern"""
def __init__(self) -> None:
self.max_retries = 3
self.dlq_items: List[ProcessingItem] = []
@workflow.run
async def run(self, items: list[ProcessingItem]) -> dict:
successful = []
failed = []
for item in items:
try:
result = await self.process_with_dlq(item)
successful.append({"id": item.id, "result": result})
except ApplicationError as e:
workflow.logger.error(f"Item {item.id} sent to DLQ: {e}")
failed.append(item.id)
# Send DLQ items to persistent storage
if self.dlq_items:
await self.send_to_dlq(self.dlq_items)
return {
"successful": len(successful),
"failed": len(failed),
"dlq_count": len(self.dlq_items),
"results": successful
}
async def process_with_dlq(self, item: ProcessingItem) -> str:
"""Process mit DLQ fallback"""
while item.retry_count < self.max_retries:
try:
# Attempt processing
result = await workflow.execute_activity(
process_item,
args=[item.data],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=1)
)
return result
except ActivityError as e:
item.retry_count += 1
item.errors.append(str(e))
if item.retry_count < self.max_retries:
# Exponential backoff
wait_time = 2 ** item.retry_count
await asyncio.sleep(timedelta(seconds=wait_time))
workflow.logger.warning(
f"Retrying item {item.id} (attempt {item.retry_count})"
)
else:
# Max retries → DLQ
workflow.logger.error(
f"Item {item.id} failed {item.retry_count} times - sending to DLQ"
)
self.dlq_items.append(item)
raise ApplicationError(
f"Item {item.id} sent to DLQ after {item.retry_count} failures",
type="MaxRetriesExceeded",
details=[{
"item_id": item.id,
"retry_count": item.retry_count,
"errors": item.errors
}]
)
raise ApplicationError("Unexpected state")
async def send_to_dlq(self, items: List[ProcessingItem]) -> None:
"""Send items to Dead Letter Queue"""
await workflow.execute_activity(
write_to_dlq,
args=[items],
start_to_close_timeout=timedelta(seconds=60),
retry_policy=RetryPolicy(
maximum_attempts=5, # DLQ writes must be reliable!
initial_interval=timedelta(seconds=5)
)
)
workflow.logger.info(f"✓ Sent {len(items)} items to DLQ")
@activity.defn
async def write_to_dlq(items: List[ProcessingItem]) -> None:
"""Write failed items to DLQ storage"""
for item in items:
activity.logger.error(
f"DLQ item: {item.id}",
extra={
"retry_count": item.retry_count,
"errors": item.errors,
"data": item.data
}
)
# Write to database, SQS, file, etc.
await dlq_storage.write(item)
7.6 Zusammenfassung
Kernkonzepte Error Handling:
- Activity vs Workflow Errors: Activities haben Default Retries, Workflows nicht
- Exception-Hierarchie:
ApplicationErrorfür bewusste Fehler,ActivityErrorals Wrapper - Retry Policies: Deklarative Konfiguration mit Exponential Backoff
- Timeouts: 4 Activity-Timeouts, 3 Workflow-Timeouts
- Advanced Patterns: SAGA für Compensations, Circuit Breaker für Cascades, DLQ für Persistent Failures
Best Practices Checkliste:
- ✅ Start-To-Close Timeout immer setzen
- ✅ Non-Retryable Errors explizit markieren
- ✅ Idempotenz für alle Activities implementieren
- ✅ SAGA Pattern für Distributed Transactions
- ✅ Heartbeats für Long-Running Activities
- ✅ Circuit Breaker bei externen Services
- ✅ DLQ für persistierende Failures
- ✅ Ausführliches Error Logging mit Context
- ✅ Replay Tests für Non-Determinism
- ✅ Monitoring und Alerting
Im nächsten Kapitel (Kapitel 8) werden wir uns mit Workflow Evolution und Versioning beschäftigen - wie Sie Workflows sicher ändern können während sie laufen.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 8: SAGA Pattern
Code-Beispiele für dieses Kapitel: examples/part-03/chapter-07/
Kapitel 8: Workflow Evolution und Versioning
Einleitung
Eine der größten Herausforderungen in verteilten Systemen ist die Evolution von langlebigem Code. Während traditionelle Web-Services einfach neu deployed werden können, laufen Temporal Workflows oft über Tage, Wochen, Monate oder sogar Jahre. Was passiert, wenn Sie den Code ändern müssen, während tausende Workflows noch laufen?
Temporal löst dieses Problem durch ein ausgeklügeltes Versioning-System, das Determinismus erhält während gleichzeitig Code-Evolution ermöglicht wird. Ohne Versioning würden Code-Änderungen laufende Workflows brechen. Mit Versioning können Sie sicher deployen, Features hinzufügen und Bugs fixen – ohne existierende Executions zu gefährden.
Das Grundproblem
Scenario: Sie haben 10,000 laufende Order-Workflows. Jeder läuft 30 Tage. Sie müssen einen zusätzlichen Fraud-Check hinzufügen.
Ohne Versioning:
# Alter Code
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
return Result(payment=payment)
# Neuer Code - deployed auf laufende Workflows
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# NEU: Fraud Check hinzugefügt
fraud = await workflow.execute_activity(check_fraud, ...) # ❌ BREAKS REPLAY!
return Result(payment=payment)
Problem: Wenn ein alter Workflow replayedwird, erwartet das System die gleiche Befehlsfolge. Die neue Activity check_fraud existiert aber nicht in der History → Non-Determinism Error.
Mit Versioning:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# Versioning: Neue Workflows nutzen neuen Code, alte den alten
if workflow.patched("add-fraud-check"):
fraud = await workflow.execute_activity(check_fraud, ...) # ✅ SAFE!
return Result(payment=payment)
Jetzt können beide Versionen parallel laufen!
Lernziele
Nach diesem Kapitel können Sie:
- Verstehen warum Determinismus Versioning erforderlich macht
- Die Patching API verwenden (
workflow.patched()) - Worker Versioning mit Build IDs implementieren
- Sichere vs. unsichere Code-Änderungen identifizieren
- Replay Tests schreiben
- Migrations-Patterns für Breaking Changes anwenden
- Version Sprawl vermeiden
- Workflows sicher über Jahre hinweg evolutionieren
8.1 Versioning Fundamentals
8.1.1 Determinismus und Replay
Was ist Determinismus?
Ein Workflow ist deterministisch, wenn jede Execution die gleichen Commands in der gleichen Reihenfolge produziert bei gleichem Input. Diese Eigenschaft ermöglicht:
- Workflow Replay nach Worker Crashes
- Lange schlafende Workflows (Monate/Jahre)
- Zuverlässige Workflow-Relocation zwischen Workers
- State-Rekonstruktion aus Event History
Wie Replay funktioniert:
sequenceDiagram
participant History as Event History
participant Worker as Worker
participant Code as Workflow Code
Note over History: Workflow executed<br/>Commands recorded
Worker->>History: Fetch Events
History-->>Worker: Return Event 1-N
Worker->>Code: Execute run()
Code->>Code: Generate Commands
Worker->>Worker: Compare:<br/>Commands vs Events
alt Commands match Events
Worker->>Worker: ✓ Replay successful
else Commands differ
Worker->>Worker: ✗ Non-Determinism Error
end
Kritischer Punkt: Das System führt nicht die Commands aus der History erneut aus, sondern verwendet die aufgezeichneten Results um State zu rekonstruieren. Wenn der neue Code eine andere Befehlsfolge produziert → Fehler.
Beispiel: Non-Determinism Error
# Version 1 (deployed, 1000 workflows running)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
# Command 1: ScheduleActivityTask (process_payment)
payment = await workflow.execute_activity(process_payment, ...)
# Command 2: CompleteWorkflowExecution
return Result(payment=payment)
# Version 2 (deployed while v1 workflows still running)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
# Command 1: ScheduleActivityTask (validate_order) ← NEU!
await workflow.execute_activity(validate_order, ...)
# Command 2: ScheduleActivityTask (process_payment)
payment = await workflow.execute_activity(process_payment, ...)
# Command 3: CompleteWorkflowExecution
return Result(payment=payment)
Was passiert beim Replay:
Event History (v1):
Event 1: WorkflowExecutionStarted
Event 2: ActivityTaskScheduled (process_payment)
Event 3: ActivityTaskCompleted
Event 4: WorkflowExecutionCompleted
Replay mit v2 Code:
✗ Erwartet: ScheduleActivityTask (validate_order)
✗ Gefunden: ActivityTaskScheduled (process_payment)
→ NondeterminismError!
8.1.2 Warum Versioning komplex ist
Langlebigkeit von Workflows:
gantt
title Workflow Lifetime vs Code Changes
dateFormat YYYY-MM-DD
axisFormat %b %d
section Workflow 1
Running :w1, 2025-01-01, 30d
section Workflow 2
Running :w2, 2025-01-15, 30d
section Code Deployments
Version 1.0 :milestone, v1, 2025-01-01, 0d
Version 1.1 :milestone, v2, 2025-01-10, 0d
Version 1.2 :milestone, v3, 2025-01-20, 0d
Version 2.0 :milestone, v4, 2025-01-30, 0d
Workflow 1 durchlebt 4 Code-Versionen während seiner Laufzeit!
Herausforderungen:
- Backwards Compatibility: Neue Code-Version muss alte Workflows replyen können
- Version Sprawl: Zu viele Versionen → Code-Komplexität
- Testing: Replay-Tests für alle Versionen
- Cleanup: Wann können alte Versionen entfernt werden?
- Documentation: Welche Version macht was?
8.1.3 Drei Versioning-Ansätze
Temporal bietet drei Hauptstrategien:
1. Patching API (Code-Level Versioning)
if workflow.patched("my-change"):
# Neuer Code-Pfad
await new_implementation()
else:
# Alter Code-Pfad
await old_implementation()
Vorteile:
- Granulare Kontrolle
- Beide Pfade im gleichen Code
- Funktioniert sofort
Nachteile:
- Code-Komplexität wächst
- Manuelle Verwaltung
- Version Sprawl bei vielen Changes
2. Worker Versioning (Infrastructure-Level)
worker = Worker(
client,
task_queue="orders",
workflows=[OrderWorkflow],
deployment_config=WorkerDeploymentConfig(
deployment_name="order-service",
build_id="v2.0.0", # Version identifier
)
)
Vorteile:
- Saubere Code-Trennung
- Automatisches Routing
- Gradual Rollout möglich
Nachteile:
- Infrastruktur-Overhead (mehrere Worker-Pools)
- Noch in Public Preview
- Komplexere Deployments
3. Workflow-Name Versioning (Cutover)
@workflow.defn(name="ProcessOrder_v2")
class ProcessOrderWorkflowV2:
# Völlig neue Implementation
pass
# Alter Workflow bleibt für Kompatibilität
@workflow.defn(name="ProcessOrder")
class ProcessOrderWorkflowV1:
# Legacy code
pass
Vorteile:
- Klare Trennung
- Einfach zu verstehen
- Keine Patching-Logik
Nachteile:
- Code-Duplizierung
- Kann laufende Workflows nicht versionieren
- Client-Code muss updaten
Wann welchen Ansatz?
| Scenario | Empfohlener Ansatz |
|---|---|
| Kleine Änderungen, wenige Versionen | Patching API |
| Häufige Updates, viele Versionen | Worker Versioning |
| Komplettes Redesign | Workflow-Name Versioning |
| < 10 laufende Workflows | Workflow-Name Versioning |
| Breaking Changes in Datenstruktur | Worker Versioning |
8.2 Patching API
Der Python SDK nutzt workflow.patched() für Code-Level Versioning.
8.2.1 Grundlagen
API:
from temporalio import workflow
if workflow.patched(patch_id: str) -> bool:
# Neuer Code-Pfad
pass
else:
# Alter Code-Pfad
pass
Verhalten:
| Situation | Rückgabewert | Reason |
|---|---|---|
| Erste Execution (neu) | True | Marker wird hinzugefügt, neuer Code läuft |
| Replay MIT Marker | True | Marker in History, neuer Code läuft |
| Replay OHNE Marker | False | Alter Workflow, alter Code läuft |
Beispiel:
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
# Payment verarbeiten
payment = await workflow.execute_activity(
process_payment,
args=[order],
start_to_close_timeout=timedelta(minutes=5),
)
# Patch: Fraud Check hinzufügen
if workflow.patched("add-fraud-check-v1"):
# Neuer Code-Pfad (nach Deployment)
workflow.logger.info("Running fraud check (new version)")
fraud_result = await workflow.execute_activity(
check_fraud,
args=[order, payment],
start_to_close_timeout=timedelta(minutes=2),
)
if not fraud_result.is_safe:
raise FraudDetectedError(f"Fraud detected: {fraud_result.reason}")
else:
# Alter Code-Pfad (für Replay alter Workflows)
workflow.logger.info("Skipping fraud check (old version)")
return Result(payment=payment)
Was passiert:
flowchart TD
A[Workflow run] --> B{patched aufgerufen}
B --> C{Marker in History?}
C -->|Marker vorhanden| D[Return True<br/>Neuer Code]
C -->|Kein Marker| E{Erste Execution?}
E -->|Ja| F[Marker hinzufügen<br/>Return True<br/>Neuer Code]
E -->|Nein = Replay| G[Return False<br/>Alter Code]
D --> H[Workflow fährt fort]
F --> H
G --> H
style D fill:#90EE90
style F fill:#90EE90
style G fill:#FFB6C1
8.2.2 Drei-Schritte-Prozess
Schritt 1: Patch einführen
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# STEP 1: Patch mit if/else
if workflow.patched("add-fraud-check-v1"):
fraud = await workflow.execute_activity(check_fraud, ...)
# Else-Block leer = alter Code macht nichts
return Result(payment=payment)
Deployment: Alle neuen Workflows nutzen Fraud Check, alte nicht.
Schritt 2: Patch deprecaten
Nach allen alten Workflows sind abgeschlossen:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# STEP 2: deprecate_patch() + nur neuer Code
workflow.deprecate_patch("add-fraud-check-v1")
fraud = await workflow.execute_activity(check_fraud, ...)
return Result(payment=payment)
Zweck von deprecate_patch():
- Fügt Marker hinzu OHNE Replay zu brechen
- Erlaubt Entfernung des if/else
- Brücke zwischen Patching und Clean Code
Schritt 3: Patch entfernen
Nach Retention Period ist abgelaufen:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# STEP 3: Clean code - kein Versioning mehr
fraud = await workflow.execute_activity(check_fraud, ...)
return Result(payment=payment)
Timeline:
Tag 0: Deploy Patch (Step 1)
- Neue Workflows: Fraud Check
- Alte Workflows: Kein Fraud Check
Tag 30: Alle alten Workflows abgeschlossen
- Verify: Keine laufenden Workflows ohne Patch
Tag 31: Deploy deprecate_patch (Step 2)
- Code hat nur noch neuen Pfad
- Kompatibel mit alter History
Tag 61: Retention Period abgelaufen
- Alte Histories gelöscht
Tag 68: Remove Patch (Step 3 + Safety Margin)
- Clean Code ohne Versioning Calls
8.2.3 Mehrere Patches / Nested Patches
Pattern für multiple Versionen:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# Version 3 (neueste)
if workflow.patched("add-notifications-v3"):
await self._send_notifications(order, payment)
fraud = await workflow.execute_activity(check_fraud_v2, ...)
# Version 2
elif workflow.patched("add-fraud-check-v2"):
fraud = await workflow.execute_activity(check_fraud_v2, ...)
# Version 1 (älteste)
else:
# Original code - kein Fraud Check
pass
return Result(payment=payment)
Wichtige Regel: Neuesten Code ZUERST (top of if-block).
Warum? Frische Executions sollen immer die neueste Version nutzen. Wenn ältere Versionen zuerst geprüft werden, könnte eine neue Execution fälschlicherweise einen älteren Pfad nehmen.
Beispiel - FALSCH:
# ✗ FALSCH: Alte Version zuerst
if workflow.patched("v1"):
# Version 1 code
elif workflow.patched("v2"):
# Version 2 code - neue Executions könnten v1 nehmen!
Beispiel - RICHTIG:
# ✓ RICHTIG: Neue Version zuerst
if workflow.patched("v2"):
# Version 2 code - neue Executions nehmen diesen
elif workflow.patched("v1"):
# Version 1 code
else:
# Version 0 (original)
8.2.4 Best Practices für Patch IDs
Gute Naming Conventions:
# ✓ GUT: Beschreibend + Versionnummer
workflow.patched("add-fraud-check-v1")
workflow.patched("change-payment-params-v2")
workflow.patched("remove-legacy-validation-v1")
# ✓ GUT: Datum für Tracking
workflow.patched("refactor-2025-01-15")
# ✓ GUT: Ticket-Referenz
workflow.patched("JIRA-1234-add-validation")
# ✗ SCHLECHT: Nicht beschreibend
workflow.patched("patch1")
workflow.patched("fix")
workflow.patched("update")
# ✗ SCHLECHT: Keine Version Info
workflow.patched("add-fraud-check") # Was wenn wir v2 brauchen?
Dokumentation im Code:
@workflow.defn
class OrderWorkflow:
"""
Order Processing Workflow.
Versioning History:
- v1 (2024-01-01): Initial implementation
- v2 (2024-06-15): Added fraud check
Patch: "add-fraud-check-v1"
Deployed: 2024-06-15
Deprecated: 2024-08-01
Removed: 2024-10-01
- v3 (2024-09-01): Multi-currency support
Patch: "multi-currency-v1"
Deployed: 2024-09-01
Status: ACTIVE
"""
@workflow.run
async def run(self, order: Order) -> Result:
# Patch: add-fraud-check-v1
# Added: 2024-06-15
# Status: REMOVED (2024-10-01)
# All workflows now have fraud check
fraud = await workflow.execute_activity(check_fraud, ...)
# Patch: multi-currency-v1
# Added: 2024-09-01
# Status: ACTIVE
if workflow.patched("multi-currency-v1"):
currency = order.currency
else:
currency = "USD" # Default für alte Workflows
# ... rest of workflow
8.3 Sichere vs. Unsichere Code-Änderungen
8.3.1 Was kann OHNE Versioning geändert werden?
Kategorie 1: Activity Implementation
# ✓ SICHER: Activity-Logik ändern
@activity.defn
async def process_payment(payment: Payment) -> Receipt:
# ALLE Änderungen hier sind safe:
# - Database Schema ändern
# - API Endpoints ändern
# - Error Handling anpassen
# - Business Logic updaten
# - Performance optimieren
pass
Warum sicher? Activities werden außerhalb des Replay-Mechanismus ausgeführt. Nur das Result wird in der History gespeichert, nicht die Logic.
Kategorie 2: Workflow Logging
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self) -> None:
workflow.logger.info("Starting") # ✓ SAFE to add/remove/change
result = await workflow.execute_activity(...)
workflow.logger.debug(f"Result: {result}") # ✓ SAFE
Warum sicher? Logging erzeugt keine Events in der History.
Kategorie 3: Query Handler (read-only)
@workflow.defn
class MyWorkflow:
@workflow.query
def get_status(self) -> str: # ✓ SAFE to add
return self._status
@workflow.query
def get_progress(self) -> dict: # ✓ SAFE to modify
return {"processed": self._processed, "total": self._total}
Kategorie 4: Signal Handler hinzufügen
@workflow.defn
class MyWorkflow:
@workflow.signal
async def new_signal(self, data: str) -> None: # ✓ SAFE wenn noch nie gesendet
self._data = data
Wichtig: Nur safe wenn das Signal noch nie gesendet wurde!
Kategorie 5: Dataclass Fields mit Defaults
@dataclass
class WorkflowInput:
name: str
age: int
email: str = "" # ✓ SAFE to add with default
phone: str = "" # ✓ SAFE to add with default
Forward Compatible: Alte Workflows können neue Dataclass-Version deserializen.
8.3.2 Was BRICHT Determinismus?
Kategorie 1: Activity Calls hinzufügen/entfernen
# VORHER
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self) -> None:
result1 = await workflow.execute_activity(activity1, ...)
result2 = await workflow.execute_activity(activity2, ...)
# NACHHER - ❌ BREAKS DETERMINISM
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self) -> None:
result1 = await workflow.execute_activity(activity1, ...)
# activity2 entfernt - BREAKS REPLAY
result3 = await workflow.execute_activity(activity3, ...) # Neu - BREAKS REPLAY
Warum broken? Event History erwartet ScheduleActivityTask für activity2, bekommt aber activity3.
Kategorie 2: Activity Reihenfolge ändern
# VORHER
result1 = await workflow.execute_activity(activity1, ...)
result2 = await workflow.execute_activity(activity2, ...)
# NACHHER - ❌ BREAKS DETERMINISM
result2 = await workflow.execute_activity(activity2, ...) # Reihenfolge getauscht
result1 = await workflow.execute_activity(activity1, ...)
Kategorie 3: Activity Parameter ändern
# VORHER
await workflow.execute_activity(
process_order,
args=[order_id, customer_id], # 2 Parameter
...
)
# NACHHER - ❌ BREAKS DETERMINISM
await workflow.execute_activity(
process_order,
args=[order_id, customer_id, payment_method], # 3 Parameter
...
)
Kategorie 4: Conditional Logic ändern
# VORHER
if amount > 100:
await workflow.execute_activity(large_order, ...)
else:
await workflow.execute_activity(small_order, ...)
# NACHHER - ❌ BREAKS DETERMINISM
if amount > 500: # Threshold geändert
await workflow.execute_activity(large_order, ...)
Bei Replay: Ein Workflow mit amount=300 nahm vorher den large_order Pfad, jetzt nimmt er small_order → Non-Determinism.
Kategorie 5: Sleep/Timer ändern
# VORHER
await asyncio.sleep(300) # 5 Minuten
# NACHHER - ❌ BREAKS DETERMINISM
await asyncio.sleep(600) # 10 Minuten - anderer Timer
# Oder Timer entfernen
Kategorie 6: Non-Deterministic Functions
# ❌ FALSCH - Non-Deterministic
import random
import datetime
import uuid
@workflow.defn
class BadWorkflow:
@workflow.run
async def run(self) -> None:
random_val = random.randint(1, 100) # ❌ WRONG
current_time = datetime.datetime.now() # ❌ WRONG
unique_id = str(uuid.uuid4()) # ❌ WRONG
Beim Replay: Unterschiedliche Werte → unterschiedliche Commands → Non-Determinism.
8.3.3 Deterministische Alternativen
Python SDK Deterministic APIs:
from temporalio import workflow
@workflow.defn
class DeterministicWorkflow:
@workflow.run
async def run(self) -> None:
# ✓ RICHTIG - Deterministic time
current_time = workflow.now()
timestamp_ns = workflow.time_ns()
# ✓ RICHTIG - Deterministic random
rng = workflow.random()
random_number = rng.randint(1, 100)
random_float = rng.random()
# ✓ RICHTIG - UUID via Activity
unique_id = await workflow.execute_activity(
generate_uuid,
schedule_to_close_timeout=timedelta(seconds=5),
)
# ✓ RICHTIG - Deterministic logging
workflow.logger.info(f"Processing at {current_time}")
@activity.defn
async def generate_uuid() -> str:
"""Activities können non-deterministisch sein"""
return str(uuid.uuid4())
Warum funktioniert das?
workflow.now(): Gibt Workflow Start Time zurück (konstant bei Replay)workflow.random(): Seeded RNG basierend auf History- Activities: Results aus History, nicht neu ausgeführt
Decision Matrix:
flowchart TD
A[Code-Änderung geplant] --> B{Ändert Commands<br/>oder deren Reihenfolge?}
B -->|Nein| C[✓ SAFE<br/>Ohne Versioning]
B -->|Ja| D{Nur Activity<br/>Implementation?}
D -->|Ja| C
D -->|Nein| E[❌ UNSAFE<br/>Versioning erforderlich]
C --> F[Deploy direkt]
E --> G[Patching API<br/>oder<br/>Worker Versioning]
8.3.4 Safe Change Pattern mit Versioning
Beispiel: Activity hinzufügen
# Schritt 1: Original Code
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
return Result(payment=payment)
# Schritt 2: Mit Patching neue Activity hinzufügen
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# Patch: Neue Activity
if workflow.patched("add-fraud-check-v1"):
fraud = await workflow.execute_activity(check_fraud, ...)
if not fraud.is_safe:
raise FraudDetectedError()
return Result(payment=payment)
# Schritt 3: Später deprecate
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
workflow.deprecate_patch("add-fraud-check-v1")
fraud = await workflow.execute_activity(check_fraud, ...)
if not fraud.is_safe:
raise FraudDetectedError()
return Result(payment=payment)
# Schritt 4: Schließlich clean code
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
payment = await workflow.execute_activity(process_payment, ...)
# Clean code - Fraud Check ist Standard
fraud = await workflow.execute_activity(check_fraud, ...)
if not fraud.is_safe:
raise FraudDetectedError()
return Result(payment=payment)
8.4 Worker Versioning (Build IDs)
Worker Versioning ist der moderne Ansatz für Workflow-Versionierung (Public Preview, GA erwartet Q4 2025).
8.4.1 Konzepte
Build ID: Eindeutiger Identifier für eine Worker-Version
from temporalio.worker import Worker, WorkerDeploymentConfig
worker = Worker(
client,
task_queue="orders",
workflows=[OrderWorkflow],
activities=[process_payment, check_fraud],
deployment_config=WorkerDeploymentConfig(
deployment_name="order-service",
build_id="v1.5.2", # Semantic versioning
)
)
Workflow Pinning: Workflow bleibt auf ursprünglicher Worker-Version
Vorteile:
- Eliminiert Code-Level Patching
- Workflows können Breaking Changes enthalten
- Einfacheres Code-Management
Nachteil:
- Muss mehrere Worker-Pools laufen lassen
- Alte Versionen blockieren bis alle Workflows complete
8.4.2 Deployment Strategies
Blue-Green Deployment:
graph LR
A[Traffic] --> B{Router}
B -->|100%| C[Blue Workers<br/>v1.0.0]
B -.->|0%| D[Green Workers<br/>v2.0.0]
style C fill:#87CEEB
style D fill:#90EE90
Nach Cutover:
graph LR
A[Traffic] --> B{Router}
B -.->|0%| C[Blue Workers<br/>v1.0.0<br/>Draining]
B -->|100%| D[Green Workers<br/>v2.0.0]
style C fill:#FFB6C1
style D fill:#90EE90
Eigenschaften:
- Zwei simultane Versionen
- Klarer Cutover-Point
- Instant Rollback möglich
- Einfach zu verstehen
Rainbow Deployment:
graph TD
A[Traffic] --> B{Router}
B -->|Pinned| C[v1.0.0<br/>Draining]
B -->|5%| D[v1.5.0<br/>Active]
B -->|90%| E[v2.0.0<br/>Current]
B -->|5%| F[v2.1.0<br/>Ramping]
style C fill:#FFB6C1
style D fill:#FFD700
style E fill:#90EE90
style F fill:#87CEEB
Eigenschaften:
- Viele simultane Versionen
- Graduelle Migration
- Workflow Pinning optimal
- Komplexer aber flexibler
8.4.3 Gradual Rollout
Ramp Percentages:
# Start: 1% Traffic zu neuer Version
temporal task-queue versioning insert-assignment-rule \
--task-queue orders \
--build-id v2.0.0 \
--percentage 1
# Monitoring...
# Error rate OK? → Increase
# 5% Traffic
temporal task-queue versioning insert-assignment-rule \
--task-queue orders \
--build-id v2.0.0 \
--percentage 5
# 25% Traffic
temporal task-queue versioning insert-assignment-rule \
--task-queue orders \
--build-id v2.0.0 \
--percentage 25
# 100% Traffic
temporal task-queue versioning insert-assignment-rule \
--task-queue orders \
--build-id v2.0.0 \
--percentage 100
Ablauf:
1% → Monitor 1 day → 5% → Monitor 1 day → 25% → Monitor → 100%
Was wird monitored:
- Error Rate
- Latency
- Completion Rate
- Activity Failures
8.4.4 Version Lifecycle States
stateDiagram-v2
[*] --> Inactive: Deploy new version
Inactive --> Active: Set as target
Active --> Ramping: Set ramp %
Ramping --> Current: Set to 100%
Current --> Draining: New version deployed
Draining --> Drained: All workflows complete
Drained --> [*]: Decommission workers
note right of Inactive
Version existiert
Kein Traffic
end note
note right of Ramping
X% neuer Traffic
Testing rollout
end note
note right of Current
100% neuer Traffic
Production version
end note
note right of Draining
Nur pinned workflows
Keine neuen workflows
end note
8.4.5 Python Worker Configuration
from temporalio.common import WorkerDeploymentVersion
from temporalio.worker import Worker, WorkerDeploymentConfig
async def create_worker(build_id: str):
"""Create worker with versioning"""
client = await Client.connect("localhost:7233")
worker = Worker(
client,
task_queue="orders",
workflows=[OrderWorkflowV2], # Neue Version
activities=[process_payment_v2, check_fraud_v2],
deployment_config=WorkerDeploymentConfig(
deployment_name="order-service",
build_id=build_id,
),
# Optional: Max concurrent workflows
max_concurrent_workflow_tasks=100,
)
return worker
# Deploy v1 workers
worker_v1 = await create_worker("v1.5.2")
# Deploy v2 workers (parallel)
worker_v2 = await create_worker("v2.0.0")
# Beide Workers laufen parallel
await asyncio.gather(
worker_v1.run(),
worker_v2.run(),
)
8.5 Testing Versioned Workflows
8.5.1 Replay Testing
Zweck: Verifizieren dass neuer Code kompatibel mit existierenden Workflow Histories ist.
Basic Replay Test:
import json
import pytest
from temporalio.client import WorkflowHistory
from temporalio.worker import Replayer
from workflows import OrderWorkflow
@pytest.mark.asyncio
async def test_replay_workflow_history():
"""Test dass neuer Code alte Histories replyen kann"""
# History von Production laden
with open("tests/histories/order_workflow_history.json") as f:
history_json = json.load(f)
# Replayer mit NEUEM Workflow-Code
replayer = Replayer(workflows=[OrderWorkflow])
# Replay - wirft Exception bei Non-Determinism
await replayer.replay_workflow(
WorkflowHistory.from_json("test-workflow-id", history_json)
)
# Test passed = Neuer Code ist kompatibel!
History von Production fetchen:
# CLI: History als JSON exportieren
temporal workflow show \
--workflow-id order-12345 \
--namespace production \
--output json > workflow_history.json
Programmatisch:
from temporalio.client import Client
async def fetch_workflow_history(workflow_id: str) -> dict:
"""Fetch history für Replay Testing"""
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle(workflow_id)
history = await handle.fetch_history()
return history.to_json()
8.5.2 CI/CD Integration
GitHub Actions Workflow:
# .github/workflows/replay-test.yml
name: Temporal Replay Tests
on:
pull_request:
paths:
- 'workflows/**'
- 'activities/**'
jobs:
replay-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Download production histories
env:
TEMPORAL_ADDRESS: ${{ secrets.TEMPORAL_PROD_ADDRESS }}
TEMPORAL_NAMESPACE: production
run: |
# Download recent workflow histories
python scripts/download_histories.py \
--workflow-type OrderWorkflow \
--limit 50 \
--output tests/histories/
- name: Run replay tests
run: |
pytest tests/test_replay.py -v
- name: Fail on non-determinism
run: |
if [ $? -ne 0 ]; then
echo "❌ Non-determinism detected! Do not merge."
exit 1
fi
echo "✅ All replay tests passed"
Replay Test Script:
# tests/test_replay.py
import json
import pytest
from pathlib import Path
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory
from workflows import OrderWorkflow, PaymentWorkflow
ALL_WORKFLOWS = [OrderWorkflow, PaymentWorkflow]
@pytest.mark.asyncio
async def test_replay_all_production_histories():
"""Test neuen Code gegen Production Histories"""
histories_dir = Path("tests/histories")
if not histories_dir.exists():
pytest.skip("No histories to test")
replayer = Replayer(workflows=ALL_WORKFLOWS)
failed_replays = []
for history_file in histories_dir.glob("*.json"):
with open(history_file) as f:
history_data = json.load(f)
try:
await replayer.replay_workflow(
WorkflowHistory.from_json(
history_file.stem,
history_data
)
)
print(f"✓ Successfully replayed {history_file.name}")
except Exception as e:
failed_replays.append({
"file": history_file.name,
"error": str(e)
})
print(f"✗ Failed to replay {history_file.name}: {e}")
# Fail test wenn irgendein Replay fehlschlug
if failed_replays:
error_msg = "Non-determinism detected in:\n"
for failure in failed_replays:
error_msg += f" - {failure['file']}: {failure['error']}\n"
pytest.fail(error_msg)
Script zum History Download:
# scripts/download_histories.py
import asyncio
import json
import argparse
from pathlib import Path
from temporalio.client import Client
async def download_histories(
workflow_type: str,
limit: int,
output_dir: Path,
):
"""Download recent workflow histories für Testing"""
client = await Client.connect("localhost:7233")
# Query für laufende Workflows
query = f'WorkflowType="{workflow_type}" AND ExecutionStatus="Running"'
workflows = client.list_workflows(query=query)
count = 0
async for workflow in workflows:
if count >= limit:
break
# Fetch history
handle = client.get_workflow_handle(workflow.id)
history = await handle.fetch_history()
# Save to file
output_file = output_dir / f"{workflow.id}.json"
with open(output_file, "w") as f:
json.dump(history.to_json(), f, indent=2)
print(f"Downloaded: {workflow.id}")
count += 1
print(f"\nTotal downloaded: {count} histories")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--workflow-type", required=True)
parser.add_argument("--limit", type=int, default=50)
parser.add_argument("--output", required=True)
args = parser.parse_args()
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
asyncio.run(download_histories(
args.workflow_type,
args.limit,
output_dir,
))
8.5.3 Testing Version Transitions
@pytest.mark.asyncio
async def test_patched_workflow_new_execution():
"""Test dass neue Workflows neuen Code-Pfad nutzen"""
async with await WorkflowEnvironment.start_time_skipping() as env:
async with Worker(
env.client,
task_queue="test-queue",
workflows=[OrderWorkflow],
activities=[process_payment, check_fraud],
):
# Neue Workflow-Execution
result = await env.client.execute_workflow(
OrderWorkflow.run,
order,
id="test-new-workflow",
task_queue="test-queue",
)
# Verify: Fraud check wurde ausgeführt
assert result.fraud_checked is True
@pytest.mark.asyncio
async def test_patched_workflow_replay():
"""Test dass alte Workflows alten Code-Pfad nutzen"""
# History VOR Patch laden
with open("tests/pre_patch_history.json") as f:
old_history = json.load(f)
# Replay sollte erfolgreich sein
replayer = Replayer(workflows=[OrderWorkflow])
await replayer.replay_workflow(
WorkflowHistory.from_json("old-workflow", old_history)
)
# Success = Alter Pfad wurde korrekt gefolgt
8.6 Migration Patterns
8.6.1 Multi-Step Backward-Compatible Migration
Scenario: Activity Parameter ändern
Step 1: Optional Fields
from dataclasses import dataclass
from typing import Optional
# Alte Struktur
@dataclass
class PaymentParams:
order_id: str
amount: float
# Step 1: Neue Fields optional hinzufügen
@dataclass
class PaymentParams:
order_id: str
amount: float
payment_method: Optional[str] = None # NEU, optional
currency: Optional[str] = "USD" # NEU, mit Default
Step 2: Activity handhabt beide
@activity.defn
async def process_payment(params: PaymentParams) -> Payment:
"""Handle alte und neue Parameter"""
# Defaults für alte Calls
payment_method = params.payment_method or "credit_card"
currency = params.currency or "USD"
# Process mit neuer Logic
result = await payment_processor.process(
order_id=params.order_id,
amount=params.amount,
method=payment_method,
currency=currency,
)
return result
Step 3: Workflow nutzt neue Parameters (mit Patching)
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> Result:
if workflow.patched("payment-params-v2"):
# Neuer Code - nutzt neue Parameters
payment = await workflow.execute_activity(
process_payment,
args=[PaymentParams(
order_id=order.id,
amount=order.total,
payment_method=order.payment_method,
currency=order.currency,
)],
schedule_to_close_timeout=timedelta(minutes=5),
)
else:
# Alter Code - nutzt alte Parameters
payment = await workflow.execute_activity(
process_payment,
args=[PaymentParams(
order_id=order.id,
amount=order.total,
)],
schedule_to_close_timeout=timedelta(minutes=5),
)
return Result(payment=payment)
Step 4: Nach Migration Fields required machen
# Nach allen alten Workflows complete
@dataclass
class PaymentParams:
order_id: str
amount: float
payment_method: str # Jetzt required
currency: str = "USD" # Required mit Default
8.6.2 Continue-As-New mit Versioning
Pattern: Long-running Entity Workflows die periodisch Version Updates bekommen.
@workflow.defn
class EntityWorkflow:
"""
Entity Workflow läuft unbegrenzt mit Continue-As-New.
Updates automatisch auf neue Versionen.
"""
def __init__(self) -> None:
self._state: dict = {}
self._iteration = 0
@workflow.run
async def run(self, initial_state: dict) -> None:
self._state = initial_state
while True:
# Process iteration
await self._process_iteration()
self._iteration += 1
# Continue-As-New alle 100 Iterationen
if (
self._iteration >= 100
or workflow.info().is_continue_as_new_suggested()
):
workflow.logger.info(
f"Continue-As-New after {self._iteration} iterations"
)
# Continue-As-New picked automatisch neue Version auf!
workflow.continue_as_new(self._state)
await asyncio.sleep(60)
async def _process_iteration(self):
"""Versioned iteration logic"""
# Version 2: Validation hinzugefügt
if workflow.patched("add-validation-v2"):
await self._validate_state()
# Core logic
await workflow.execute_activity(
process_entity_state,
args=[self._state],
schedule_to_close_timeout=timedelta(minutes=5),
)
Vorteile:
- Natural version upgrade points
- History bleibt bounded
- Keine Manual Migration nötig
8.7 Zusammenfassung
Kernkonzepte:
- Determinismus: Temporal’s Replay-Mechanismus erfordert dass Workflows deterministisch sind
- Versioning erforderlich: Jede Code-Änderung die Commands ändert braucht Versioning
- Drei Ansätze: Patching API, Worker Versioning, Workflow-Name Versioning
- Safe Changes: Activity Implementation, Logging, Queries können ohne Versioning geändert werden
- Unsafe Changes: Activity Calls hinzufügen/entfernen, Reihenfolge ändern, Parameter ändern
Patching API Workflow:
1. workflow.patched("id") → if/else blocks
2. workflow.deprecate_patch("id") → nur neuer Code
3. Remove patch call → clean code
Best Practices:
- ✅ Replay Tests in CI/CD
- ✅ Production Histories regelmäßig testen
- ✅ Max 3 active Patches pro Workflow
- ✅ Dokumentation für jedes Patch
- ✅ Cleanup Timeline planen
- ✅ Monitoring für Version Adoption
Häufige Fehler:
- ❌ Versioning vergessen
- ❌
random.random()stattworkflow.random() - ❌ Kein Replay Testing
- ❌ Version Sprawl (zu viele Patches)
- ❌ Alte Versionen nicht entfernen
Im nächsten Kapitel (Kapitel 9) werden wir Fortgeschrittene Resilienz-Patterns behandeln - komplexe Patterns für Production-Ready Systeme.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 9: Workflow-Evolution und Versionierung
Code-Beispiele für dieses Kapitel: examples/part-03/chapter-08/
Kapitel 9: Fortgeschrittene Resilienz-Patterns
Einleitung
In den vorherigen Kapiteln haben Sie die Grundlagen von Error Handling, Retry Policies und Workflow Evolution kennengelernt. Diese Konzepte bilden das Fundament für resiliente Temporal-Anwendungen. Doch in Production-Systemen stoßen Sie auf komplexere Herausforderungen: Workflows die Monate laufen, Hunderttausende Events akkumulieren, Millionen parallele Executions koordinieren oder mit externen Rate Limits umgehen müssen.
Dieses Kapitel behandelt fortgeschrittene Patterns, die Temporal von einer robusten Orchestration-Platform zu einem hochskalierbaren, produktionsreifen System machen.
Wann brauchen Sie Advanced Patterns?
Continue-As-New: Ihr Workflow läuft 6 Monate und hat 500,000 Events in der History Child Workflows: Sie müssen 10,000 Sub-Tasks parallel orchestrieren Parallel Execution: Batch-Verarbeitung von 100,000 Orders pro Stunde Rate Limiting: Externe API erlaubt nur 100 Requests/Minute Graceful Degradation: Service-Ausfälle dürfen Business nicht stoppen State Management: Workflow-State ist 5 MB groß Advanced Recovery: Manuelle Intervention bei kritischen Fehlern erforderlich
Lernziele
Nach diesem Kapitel können Sie:
- Continue-As-New korrekt einsetzen um unbegrenzt lange Workflows zu ermöglichen
- Child Workflows für Skalierung und Isolation nutzen
- Parallele Activity Execution effizient orchestrieren
- Rate Limiting auf Worker- und Activity-Level implementieren
- Graceful Degradation mit Fallback-Mechanismen bauen
- Large State und Payloads in Workflows handhaben
- Human-in-the-Loop Patterns für manuelle Approvals implementieren
- Production-Ready Monitoring und Observability aufsetzen
9.1 Continue-As-New Pattern
9.1.1 Das Event History Problem
Jede Workflow-Execution speichert ihre komplette Event History: jede Activity, jeder Timer, jedes Signal, jede State-Transition. Diese History wächst mit jeder Operation.
Problem: History hat praktische Limits:
Workflow läuft 365 Tage
├─ 1 Activity pro Stunde
├─ 1 Signal alle 10 Minuten
├─ State Updates jede Minute
└─ Total Events: ~500,000
Event History Size: ~50 MB
Performance Impact: Replay dauert Minuten!
Temporal Limits:
- Empfohlenes Maximum: 50,000 Events
- Hard Limit: Konfigurierbar (Default 50,000-200,000)
- Performance Degradation: Ab 10,000 Events merklich
Was passiert bei zu großer History:
graph TD
A[Workflow mit 100k Events] --> B[Worker fetcht History]
B --> C[50 MB Download]
C --> D[Replay dauert 5 Minuten]
D --> E{Timeout?}
E -->|Ja| F[Workflow Task Timeout]
E -->|Nein| G[Extrem langsam]
F --> H[Retry Loop]
H --> B
style F fill:#FF4444
style G fill:#FFA500
9.1.2 Continue-As-New Lösung
Konzept: “Reboot” des Workflows mit neuem State, History wird zurückgesetzt.
from temporalio import workflow
import asyncio
@workflow.defn
class LongRunningEntityWorkflow:
"""
Entity Workflow läuft unbegrenzt mit Continue-As-New.
Beispiel: IoT Device Management, User Session, Subscription
"""
def __init__(self) -> None:
self._events_processed = 0
self._state = {}
self._iteration = 0
@workflow.run
async def run(self, entity_id: str, initial_state: dict) -> None:
"""Main workflow - läuft bis Continue-As-New"""
self._state = initial_state
workflow.logger.info(
f"Entity workflow started (iteration {self._iteration})",
extra={"entity_id": entity_id}
)
while True:
# Warte auf Signals oder Timer
await workflow.wait_condition(
lambda: len(self._pending_events) > 0,
timeout=timedelta(hours=1)
)
# Process events
for event in self._pending_events:
await self._process_event(event)
self._events_processed += 1
self._pending_events.clear()
# Check für Continue-As-New
if self._should_continue_as_new():
workflow.logger.info(
f"Continuing as new (processed {self._events_processed} events)",
extra={"iteration": self._iteration}
)
# Continue-As-New mit updated State
workflow.continue_as_new(
args=[entity_id, self._state]
)
def _should_continue_as_new(self) -> bool:
"""Decision logic für Continue-As-New"""
# Option 1: Nach fixer Anzahl Events
if self._events_processed >= 1000:
return True
# Option 2: Temporal's Suggestion (basierend auf History Size)
if workflow.info().is_continue_as_new_suggested():
return True
# Option 3: Nach Zeit
elapsed = workflow.now() - workflow.info().start_time
if elapsed > timedelta(days=7):
return True
return False
@workflow.signal
async def process_event(self, event: dict) -> None:
"""Signal Handler für Events"""
self._pending_events.append(event)
Was passiert bei Continue-As-New:
sequenceDiagram
participant W1 as Workflow Run 1
participant Server as Temporal Server
participant W2 as Workflow Run 2
W1->>W1: Process 1000 events
W1->>W1: History = 5000 events
W1->>W1: continue_as_new(state)
W1->>Server: CompleteWorkflowExecution<br/>+ StartWorkflowExecution
Note over Server: Atomically complete<br/>old run and start new
Server->>W2: Start with fresh history
W2->>W2: State = inherited
W2->>W2: History = 0 events
Note over W2: Clean slate!<br/>Performance restored
Wichtige Eigenschaften:
- Atomic: Altes Workflow-Ende + Neues Workflow-Start = eine Operation
- Same Workflow ID: Workflow ID bleibt gleich
- New Run ID: Jeder Continue bekommt neue Run ID
- State Migration: Übergebe explizit State via args
- Fresh History: Event History startet bei 0
9.1.3 State Migration
Best Practices für State Übergabe:
from dataclasses import dataclass, asdict
from typing import Optional, List
import json
@dataclass
class EntityState:
"""Serializable State für Continue-As-New"""
entity_id: str
balance: float
transactions: List[dict]
metadata: dict
version: int = 1 # State Schema Version
def to_dict(self) -> dict:
"""Serialize für Continue-As-New"""
return asdict(self)
@classmethod
def from_dict(cls, data: dict) -> "EntityState":
"""Deserialize mit Version Handling"""
version = data.get("version", 1)
if version == 1:
return cls(**data)
elif version == 2:
# Migration logic für v1 -> v2
return cls._migrate_v1_to_v2(data)
else:
raise ValueError(f"Unknown state version: {version}")
@workflow.defn
class AccountWorkflow:
def __init__(self) -> None:
self._state: Optional[EntityState] = None
@workflow.run
async def run(self, entity_id: str, state_dict: Optional[dict] = None) -> None:
# Initialize oder restore State
if state_dict:
self._state = EntityState.from_dict(state_dict)
workflow.logger.info("Restored from Continue-As-New")
else:
self._state = EntityState(
entity_id=entity_id,
balance=0.0,
transactions=[],
metadata={}
)
workflow.logger.info("Fresh workflow start")
# ... workflow logic ...
# Continue-As-New
if self._should_continue():
workflow.continue_as_new(
args=[entity_id, self._state.to_dict()]
)
Größenbeschränkungen beachten:
# ❌ FALSCH: State zu groß
@dataclass
class BadState:
large_list: List[dict] # 10 MB!
# Continue-As-New schlägt fehl wenn State > 2 MB
# ✅ RICHTIG: State compact halten
@dataclass
class GoodState:
summary: dict # Nur Zusammenfassung
last_checkpoint: str
# Details in Activities/externe Storage auslagern
9.1.4 Frequency-Based vs Suggested Continue-As-New
Pattern 1: Frequency-Based (Deterministisch)
@workflow.defn
class FrequencyBasedWorkflow:
def __init__(self) -> None:
self._counter = 0
@workflow.run
async def run(self, state: dict) -> None:
while True:
await self._process_batch()
self._counter += 1
# Continue alle 100 Batches
if self._counter >= 100:
workflow.logger.info("Continue-As-New after 100 batches")
workflow.continue_as_new(state)
await asyncio.sleep(timedelta(minutes=5))
Vorteile: Vorhersehbar, testbar Nachteile: Ignoriert tatsächliche History Size
Pattern 2: Suggested Continue-As-New (Dynamisch)
@workflow.defn
class SuggestedBasedWorkflow:
@workflow.run
async def run(self, state: dict) -> None:
while True:
await self._process_batch()
# Temporal schlägt Continue vor wenn nötig
if workflow.info().is_continue_as_new_suggested():
workflow.logger.info(
"Continue-As-New suggested by Temporal",
extra={"history_size": workflow.info().get_current_history_length()}
)
workflow.continue_as_new(state)
await asyncio.sleep(timedelta(minutes=5))
Vorteile: Adaptiv, optimal für History Size Nachteile: Non-deterministisch (Suggestion kann bei Replay anders sein)
Best Practice: Hybrid Approach
@workflow.defn
class HybridWorkflow:
@workflow.run
async def run(self, state: dict) -> None:
iteration = 0
while True:
await self._process_batch()
iteration += 1
# Continue wenn EINE der Bedingungen erfüllt
should_continue = (
iteration >= 1000 # Max Iterations
or workflow.info().is_continue_as_new_suggested() # History zu groß
or workflow.now() - workflow.info().start_time > timedelta(days=30) # Max Time
)
if should_continue:
workflow.continue_as_new(state)
await asyncio.sleep(timedelta(hours=1))
9.2 Child Workflows
9.2.1 Wann Child Workflows nutzen?
Child Workflows sind eigenständige Workflow-Executions, gestartet von einem Parent Workflow. Sie bieten Event History Isolation und Independent Lifecycle.
Use Cases:
1. Fan-Out Pattern
# Parent orchestriert 1000 Child Workflows
await asyncio.gather(*[
workflow.execute_child_workflow(ProcessOrder.run, order)
for order in orders
])
2. Long-Running Sub-Tasks
# Child läuft Wochen, Parent monitored nur
child_handle = await workflow.start_child_workflow(
DataPipelineWorkflow.run,
dataset_id
)
# Parent kann weiter arbeiten
3. Retry Isolation
# Child hat eigene Retry Policy, unabhängig vom Parent
await workflow.execute_child_workflow(
RiskyOperation.run,
data,
retry_policy=RetryPolicy(maximum_attempts=10)
)
4. Multi-Tenant Isolation
# Jeder Tenant bekommt eigenen Child Workflow
for tenant in tenants:
await workflow.execute_child_workflow(
TenantProcessor.run,
tenant,
id=f"tenant-{tenant.id}" # Separate Workflow ID
)
9.2.2 Parent vs Child Event History
Kritischer Unterschied: Child Workflows haben separate Event Histories.
graph TD
subgraph Parent Workflow History
P1[WorkflowExecutionStarted]
P2[ChildWorkflowExecutionStarted]
P3[ChildWorkflowExecutionCompleted]
P4[WorkflowExecutionCompleted]
P1 --> P2
P2 --> P3
P3 --> P4
end
subgraph Child Workflow History
C1[WorkflowExecutionStarted]
C2[ActivityTaskScheduled x 1000]
C3[ActivityTaskCompleted x 1000]
C4[WorkflowExecutionCompleted]
C1 --> C2
C2 --> C3
C3 --> C4
end
P2 -.->|Spawns| C1
C4 -.->|Returns Result| P3
style Parent fill:#E6F3FF
style Child fill:#FFE6E6
Parent History: Nur Start/Complete Events für Child Child History: Komplette Execution Details
Vorteil: Parent bleibt schlank, auch wenn Child komplex ist.
9.2.3 Child Workflow Patterns
Pattern 1: Fire-and-Forget
@workflow.defn
class ParentWorkflow:
@workflow.run
async def run(self, tasks: List[Task]) -> str:
"""Start Children und warte NICHT auf Completion"""
for task in tasks:
# start_child_workflow gibt Handle zurück ohne zu warten
handle = await workflow.start_child_workflow(
ProcessTaskWorkflow.run,
args=[task],
id=f"task-{task.id}",
parent_close_policy=ParentClosePolicy.ABANDON
)
workflow.logger.info(f"Started child {task.id}")
# Parent beendet, Children laufen weiter!
return "All children started"
Pattern 2: Wait-All (Fan-Out/Fan-In)
@workflow.defn
class BatchProcessorWorkflow:
@workflow.run
async def run(self, items: List[Item]) -> dict:
"""Parallel processing mit Warten auf alle Results"""
# Start alle Children parallel
child_futures = [
workflow.execute_child_workflow(
ProcessItemWorkflow.run,
item,
id=f"item-{item.id}"
)
for item in items
]
# Warte auf ALLE
results = await asyncio.gather(*child_futures, return_exceptions=True)
# Analyze results
successful = [r for r in results if not isinstance(r, Exception)]
failed = [r for r in results if isinstance(r, Exception)]
return {
"total": len(items),
"successful": len(successful),
"failed": len(failed),
"results": successful
}
Pattern 3: Throttled Parallel Execution
import asyncio
from temporalio import workflow
@workflow.defn
class ThrottledBatchWorkflow:
@workflow.run
async def run(self, items: List[Item]) -> dict:
"""Process mit max 10 parallelen Children"""
semaphore = asyncio.Semaphore(10) # Max 10 concurrent
results = []
async def process_with_semaphore(item: Item):
async with semaphore:
return await workflow.execute_child_workflow(
ProcessItemWorkflow.run,
item,
id=f"item-{item.id}"
)
# Start alle, aber Semaphore limitiert Parallelität
results = await asyncio.gather(*[
process_with_semaphore(item)
for item in items
])
return {"processed": len(results)}
Pattern 4: Rolling Window
@workflow.defn
class RollingWindowWorkflow:
@workflow.run
async def run(self, items: List[Item]) -> dict:
"""Process in Batches von 100"""
batch_size = 100
results = []
for i in range(0, len(items), batch_size):
batch = items[i:i + batch_size]
workflow.logger.info(f"Processing batch {i//batch_size + 1}")
# Process Batch parallel
batch_results = await asyncio.gather(*[
workflow.execute_child_workflow(
ProcessItemWorkflow.run,
item,
id=f"item-{item.id}"
)
for item in batch
])
results.extend(batch_results)
# Optional: Pause zwischen Batches
await asyncio.sleep(timedelta(seconds=5))
return {"total_processed": len(results)}
9.2.4 Parent Close Policies
Was passiert mit Children wenn Parent beendet/canceled/terminated wird?
from temporalio.common import ParentClosePolicy
# Policy 1: TERMINATE - Kill children wenn parent schließt
await workflow.start_child_workflow(
ChildWorkflow.run,
args=[data],
parent_close_policy=ParentClosePolicy.TERMINATE
)
# Policy 2: REQUEST_CANCEL - Cancellation request an children
await workflow.start_child_workflow(
ChildWorkflow.run,
args=[data],
parent_close_policy=ParentClosePolicy.REQUEST_CANCEL
)
# Policy 3: ABANDON - Children laufen weiter (default)
await workflow.start_child_workflow(
ChildWorkflow.run,
args=[data],
parent_close_policy=ParentClosePolicy.ABANDON
)
Decision Matrix:
| Scenario | Empfohlene Policy |
|---|---|
| Parent verwaltet Child Lifecycle vollständig | TERMINATE |
| Child kann gracefully canceln | REQUEST_CANCEL |
| Child ist unabhängig | ABANDON |
| Data Pipeline (Parent orchestriert) | TERMINATE |
| Long-running Entity Workflow | ABANDON |
| User Session Management | REQUEST_CANCEL |
9.3 Parallel Execution Patterns
9.3.1 Activity Parallelism mit asyncio.gather()
Basic Parallel Activities:
from temporalio import workflow
import asyncio
from datetime import timedelta
@workflow.defn
class ParallelActivitiesWorkflow:
@workflow.run
async def run(self, order_id: str) -> dict:
"""Execute multiple activities in parallel"""
# Start all activities concurrently
inventory_future = workflow.execute_activity(
reserve_inventory,
args=[order_id],
start_to_close_timeout=timedelta(seconds=30)
)
payment_future = workflow.execute_activity(
process_payment,
args=[order_id],
start_to_close_timeout=timedelta(seconds=30)
)
shipping_quote_future = workflow.execute_activity(
get_shipping_quote,
args=[order_id],
start_to_close_timeout=timedelta(seconds=30)
)
# Wait for all to complete
inventory, payment, shipping = await asyncio.gather(
inventory_future,
payment_future,
shipping_quote_future
)
return {
"inventory_reserved": inventory,
"payment_processed": payment,
"shipping_cost": shipping
}
Warum parallel?
Sequential:
├─ reserve_inventory: 2s
├─ process_payment: 3s
└─ get_shipping_quote: 1s
Total: 6s
Parallel:
├─ reserve_inventory: 2s ┐
├─ process_payment: 3s ├─ Concurrent
└─ get_shipping_quote: 1s┘
Total: 3s (longest activity)
9.3.2 Partial Failure Handling
Problem: Was wenn eine Activity fehlschlägt?
@workflow.defn
class PartialFailureWorkflow:
@workflow.run
async def run(self, items: List[str]) -> dict:
"""Handle partial failures gracefully"""
# Execute all, capture exceptions
results = await asyncio.gather(*[
workflow.execute_activity(
process_item,
args=[item],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
for item in items
], return_exceptions=True)
# Separate successful from failed
successful = []
failed = []
for i, result in enumerate(results):
if isinstance(result, Exception):
failed.append({
"item": items[i],
"error": str(result)
})
else:
successful.append({
"item": items[i],
"result": result
})
workflow.logger.info(
f"Batch complete: {len(successful)} success, {len(failed)} failed"
)
# Decide: Fail workflow if any failed?
if failed and len(failed) / len(items) > 0.1: # >10% failure rate
raise ApplicationError(
f"Batch processing failed: {len(failed)} items",
details=[{"failed_items": failed}]
)
return {
"successful": successful,
"failed": failed,
"success_rate": len(successful) / len(items)
}
9.3.3 Dynamic Parallelism mit Batching
Problem: 10,000 Items zu verarbeiten → 10,000 Activities spawnen?
Lösung: Batching
@workflow.defn
class BatchedParallelWorkflow:
@workflow.run
async def run(self, items: List[Item]) -> dict:
"""Process large dataset mit batching"""
batch_size = 100 # Activity verarbeitet 100 Items
max_parallel = 10 # Max 10 Activities parallel
# Split in batches
batches = [
items[i:i + batch_size]
for i in range(0, len(items), batch_size)
]
workflow.logger.info(f"Processing {len(items)} items in {len(batches)} batches")
all_results = []
# Process batches mit concurrency limit
for batch_group in self._chunk_list(batches, max_parallel):
# Execute batch group parallel
batch_results = await asyncio.gather(*[
workflow.execute_activity(
process_batch,
args=[batch],
start_to_close_timeout=timedelta(minutes=5),
retry_policy=RetryPolicy(maximum_attempts=3)
)
for batch in batch_group
])
all_results.extend(batch_results)
return {
"batches_processed": len(batches),
"items_processed": len(items)
}
def _chunk_list(self, lst: List, chunk_size: int) -> List[List]:
"""Split list into chunks"""
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]
Performance Comparison:
Scenario: 10,000 Items
Approach 1: Sequential
└─ 10,000 activities x 1s = 10,000s (~3 hours)
Approach 2: Unbounded Parallel
└─ 10,000 activities spawned
└─ Worker overload, Temporal Server pressure
Approach 3: Batched (100 items/batch, 10 parallel)
└─ 100 batches, 10 parallel
└─ ~100s total time
9.3.4 MapReduce Pattern
Full MapReduce Workflow:
from typing import Any, Callable, List
from temporalio import workflow
import asyncio
@workflow.defn
class MapReduceWorkflow:
"""MapReduce Pattern für verteilte Verarbeitung"""
@workflow.run
async def run(
self,
dataset: List[Any],
map_activity: str,
reduce_activity: str,
chunk_size: int = 100,
max_parallel: int = 10
) -> Any:
"""
Map-Reduce Execution:
1. Split dataset in chunks
2. Map: Process chunks parallel
3. Reduce: Aggregate results
"""
# ========== MAP PHASE ==========
workflow.logger.info(f"MAP: Processing {len(dataset)} items")
# Split in chunks
chunks = [
dataset[i:i + chunk_size]
for i in range(0, len(dataset), chunk_size)
]
# Map: Process all chunks parallel (mit limit)
map_results = []
for chunk_group in self._chunk_list(chunks, max_parallel):
results = await asyncio.gather(*[
workflow.execute_activity(
map_activity,
args=[chunk],
start_to_close_timeout=timedelta(minutes=5),
retry_policy=RetryPolicy(maximum_attempts=3)
)
for chunk in chunk_group
])
map_results.extend(results)
workflow.logger.info(f"MAP complete: {len(map_results)} results")
# ========== REDUCE PHASE ==========
workflow.logger.info(f"REDUCE: Aggregating {len(map_results)} results")
# Reduce: Aggregate all map results
final_result = await workflow.execute_activity(
reduce_activity,
args=[map_results],
start_to_close_timeout=timedelta(minutes=10),
retry_policy=RetryPolicy(maximum_attempts=3)
)
workflow.logger.info("REDUCE complete")
return final_result
def _chunk_list(self, lst: List, chunk_size: int) -> List[List]:
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]
# Activities für MapReduce
@activity.defn
async def map_activity(chunk: List[dict]) -> dict:
"""Map function: Process chunk"""
total = 0
for item in chunk:
# Process item
total += item.get("value", 0)
return {"chunk_sum": total, "count": len(chunk)}
@activity.defn
async def reduce_activity(map_results: List[dict]) -> dict:
"""Reduce function: Aggregate"""
grand_total = sum(r["chunk_sum"] for r in map_results)
total_count = sum(r["count"] for r in map_results)
return {
"total": grand_total,
"count": total_count,
"average": grand_total / total_count if total_count > 0 else 0
}
9.4 Rate Limiting und Throttling
9.4.1 Warum Rate Limiting?
Externe API Limits:
Third-Party API:
├─ 100 requests/minute
├─ 1000 requests/hour
└─ 10,000 requests/day
Ohne Rate Limiting → Exceeding limits → API errors → Workflow failures
9.4.2 Worker-Level Rate Limiting
Global Rate Limit über Worker Configuration:
from temporalio.worker import Worker
worker = Worker(
client,
task_queue="api-calls",
workflows=[APIWorkflow],
activities=[call_external_api],
# Max concurrent activity executions
max_concurrent_activities=10,
# Max concurrent activity tasks
max_concurrent_activity_tasks=50,
)
Limitation: Gilt pro Worker. Bei Scale-out (multiple Workers) → multiply limit.
9.4.3 Activity-Level Rate Limiting mit Semaphore
Workflow-Managed Semaphore:
import asyncio
from temporalio import workflow
@workflow.defn
class RateLimitedWorkflow:
"""Workflow mit Activity Rate Limiting"""
def __init__(self) -> None:
# Semaphore: Max 5 concurrent API calls
self._api_semaphore = asyncio.Semaphore(5)
@workflow.run
async def run(self, requests: List[dict]) -> List[dict]:
"""Process requests mit rate limit"""
async def call_with_limit(request: dict):
async with self._api_semaphore:
return await workflow.execute_activity(
call_external_api,
args=[request],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
initial_interval=timedelta(seconds=2)
)
)
# All requests, aber max 5 concurrent
results = await asyncio.gather(*[
call_with_limit(req)
for req in requests
])
return results
9.4.4 Token Bucket Rate Limiter
Advanced Pattern für präzise Rate Limits:
from dataclasses import dataclass
from datetime import timedelta
from temporalio import workflow
import asyncio
@dataclass
class TokenBucket:
"""Token Bucket für Rate Limiting"""
capacity: int # Max tokens
refill_rate: float # Tokens pro Sekunde
tokens: float # Current tokens
last_refill: float # Last refill timestamp
def consume(self, tokens: int, current_time: float) -> bool:
"""Attempt to consume tokens"""
# Refill tokens basierend auf elapsed time
elapsed = current_time - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + (elapsed * self.refill_rate)
)
self.last_refill = current_time
# Check if enough tokens
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def time_until_available(self, tokens: int) -> float:
"""Sekunden bis genug Tokens verfügbar"""
if self.tokens >= tokens:
return 0.0
needed = tokens - self.tokens
return needed / self.refill_rate
@workflow.defn
class TokenBucketWorkflow:
"""Rate Limiting mit Token Bucket"""
def __init__(self) -> None:
# 100 requests/minute = 1.67 requests/second
self._bucket = TokenBucket(
capacity=100,
refill_rate=100 / 60, # 1.67/s
tokens=100,
last_refill=workflow.time_ns() / 1e9
)
@workflow.run
async def run(self, requests: List[dict]) -> List[dict]:
"""Process mit token bucket rate limit"""
results = []
for request in requests:
# Wait for token availability
await self._wait_for_token()
# Execute activity
result = await workflow.execute_activity(
call_api,
args=[request],
start_to_close_timeout=timedelta(seconds=30)
)
results.append(result)
return results
async def _wait_for_token(self) -> None:
"""Wait until token available"""
while True:
current_time = workflow.time_ns() / 1e9
if self._bucket.consume(1, current_time):
# Token consumed
return
# Wait until token available
wait_time = self._bucket.time_until_available(1)
await asyncio.sleep(timedelta(seconds=wait_time))
9.4.5 Sliding Window Rate Limiter
Pattern für zeitbasierte Limits (z.B. 1000/hour):
from collections import deque
from temporalio import workflow
import asyncio
@workflow.defn
class SlidingWindowWorkflow:
"""Rate Limiting mit Sliding Window"""
def __init__(self) -> None:
# Track requests in sliding window
self._request_timestamps: deque = deque()
self._max_requests = 1000
self._window_seconds = 3600 # 1 hour
@workflow.run
async def run(self, requests: List[dict]) -> List[dict]:
results = []
for request in requests:
# Wait for slot in window
await self._wait_for_window_slot()
# Execute
result = await workflow.execute_activity(
call_api,
args=[request],
start_to_close_timeout=timedelta(seconds=30)
)
results.append(result)
return results
async def _wait_for_window_slot(self) -> None:
"""Wait until request slot available in window"""
while True:
current_time = workflow.time_ns() / 1e9
cutoff_time = current_time - self._window_seconds
# Remove expired timestamps
while self._request_timestamps and self._request_timestamps[0] < cutoff_time:
self._request_timestamps.popleft()
# Check if slot available
if len(self._request_timestamps) < self._max_requests:
self._request_timestamps.append(current_time)
return
# Wait until oldest request expires
oldest = self._request_timestamps[0]
wait_until = oldest + self._window_seconds
wait_seconds = wait_until - current_time
if wait_seconds > 0:
await asyncio.sleep(timedelta(seconds=wait_seconds))
9.5 Graceful Degradation
9.5.1 Fallback Pattern
Konzept: Bei Service-Ausfall auf Fallback-Mechanismus zurückfallen.
from temporalio import workflow
from temporalio.exceptions import ActivityError, ApplicationError
import asyncio
@workflow.defn
class GracefulDegradationWorkflow:
"""Workflow mit Fallback-Mechanismen"""
@workflow.run
async def run(self, user_id: str) -> dict:
"""
Get user recommendations:
1. Try ML-based recommendations (primary)
2. Fallback to rule-based (secondary)
3. Fallback to popular items (tertiary)
"""
# Try primary: ML Recommendations
try:
workflow.logger.info("Attempting ML recommendations")
recommendations = await workflow.execute_activity(
get_ml_recommendations,
args=[user_id],
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(
maximum_attempts=2,
initial_interval=timedelta(seconds=1)
)
)
return {
"recommendations": recommendations,
"source": "ml",
"degraded": False
}
except ActivityError as e:
workflow.logger.warning(f"ML recommendations failed: {e}")
# Fallback 1: Rule-Based
try:
workflow.logger.info("Falling back to rule-based recommendations")
recommendations = await workflow.execute_activity(
get_rule_based_recommendations,
args=[user_id],
start_to_close_timeout=timedelta(seconds=5),
retry_policy=RetryPolicy(maximum_attempts=2)
)
return {
"recommendations": recommendations,
"source": "rules",
"degraded": True
}
except ActivityError as e:
workflow.logger.warning(f"Rule-based recommendations failed: {e}")
# Fallback 2: Popular Items (always works)
workflow.logger.info("Falling back to popular items")
recommendations = await workflow.execute_activity(
get_popular_items,
start_to_close_timeout=timedelta(seconds=3),
retry_policy=RetryPolicy(maximum_attempts=3)
)
return {
"recommendations": recommendations,
"source": "popular",
"degraded": True
}
9.5.2 Circuit Breaker mit Fallback
Integration von Circuit Breaker + Fallback:
from enum import Enum
from dataclasses import dataclass
from typing import Optional
from temporalio import workflow
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
timeout_seconds: int = 60
half_open_attempts: int = 2
@workflow.defn
class CircuitBreakerFallbackWorkflow:
"""Circuit Breaker mit automatischem Fallback"""
def __init__(self) -> None:
self._circuit_state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: Optional[float] = None
self._config = CircuitBreakerConfig()
@workflow.run
async def run(self, requests: List[dict]) -> List[dict]:
results = []
for request in requests:
result = await self._execute_with_circuit_breaker(request)
results.append(result)
return results
async def _execute_with_circuit_breaker(self, request: dict) -> dict:
"""Execute mit Circuit Breaker Protection + Fallback"""
# Check circuit state
await self._update_circuit_state()
if self._circuit_state == CircuitState.OPEN:
# Circuit OPEN → Immediate fallback
workflow.logger.warning("Circuit OPEN - using fallback")
return await self._execute_fallback(request)
# Try primary service
try:
result = await workflow.execute_activity(
call_primary_service,
args=[request],
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(maximum_attempts=1) # No retries
)
# Success → Reset circuit
await self._on_success()
return result
except ActivityError as e:
# Failure → Update circuit
await self._on_failure()
workflow.logger.warning(f"Primary service failed: {e}")
# Use fallback
return await self._execute_fallback(request)
async def _execute_fallback(self, request: dict) -> dict:
"""Fallback execution"""
return await workflow.execute_activity(
call_fallback_service,
args=[request],
start_to_close_timeout=timedelta(seconds=5),
retry_policy=RetryPolicy(maximum_attempts=3)
)
async def _update_circuit_state(self) -> None:
"""Update circuit state basierend auf Timeouts"""
if self._circuit_state == CircuitState.OPEN:
current_time = workflow.time_ns() / 1e9
if self._last_failure_time:
elapsed = current_time - self._last_failure_time
if elapsed > self._config.timeout_seconds:
# Timeout elapsed → Try half-open
self._circuit_state = CircuitState.HALF_OPEN
workflow.logger.info("Circuit HALF_OPEN - testing recovery")
async def _on_success(self) -> None:
"""Handle successful call"""
if self._circuit_state == CircuitState.HALF_OPEN:
# Success in half-open → Close circuit
self._circuit_state = CircuitState.CLOSED
self._failure_count = 0
workflow.logger.info("Circuit CLOSED - service recovered")
elif self._circuit_state == CircuitState.CLOSED:
# Reset failure count
self._failure_count = 0
async def _on_failure(self) -> None:
"""Handle failed call"""
self._failure_count += 1
self._last_failure_time = workflow.time_ns() / 1e9
if self._circuit_state == CircuitState.HALF_OPEN:
# Failure in half-open → Reopen
self._circuit_state = CircuitState.OPEN
workflow.logger.warning("Circuit OPEN - service still failing")
elif self._failure_count >= self._config.failure_threshold:
# Too many failures → Open circuit
self._circuit_state = CircuitState.OPEN
workflow.logger.warning(
f"Circuit OPEN after {self._failure_count} failures"
)
9.6 State Management Patterns
9.6.1 Large Payload Problem
Problem: Workflow State > 2 MB → Performance Issues
# ❌ FALSCH: Large state in workflow
@workflow.defn
class BadWorkflow:
def __init__(self) -> None:
self._large_dataset: List[dict] = [] # Can grow to 10 MB!
@workflow.run
async def run(self) -> None:
# Load large data
self._large_dataset = await workflow.execute_activity(...)
# State wird bei jedem Workflow Task übertragen → slow!
Temporal Limits:
- Payload Size Limit: 2 MB (default, konfigurierbar bis 4 MB)
- History Size Limit: 50 MB (empfohlen)
- Performance: State wird bei jedem Task serialisiert/deserialisiert
9.6.2 External Storage Pattern
Lösung: Large Data in externes Storage, nur Reference in Workflow.
from dataclasses import dataclass
from typing import Optional
from temporalio import workflow
@dataclass
class DataReference:
"""Reference zu externem Storage"""
storage_key: str
size_bytes: int
checksum: str
storage_type: str = "s3" # s3, gcs, blob, etc.
@workflow.defn
class ExternalStorageWorkflow:
"""Workflow mit External Storage Pattern"""
def __init__(self) -> None:
self._data_ref: Optional[DataReference] = None
@workflow.run
async def run(self, dataset_id: str) -> dict:
"""Process large dataset via external storage"""
# Step 1: Load data und store externally
workflow.logger.info(f"Loading dataset {dataset_id}")
self._data_ref = await workflow.execute_activity(
load_and_store_dataset,
args=[dataset_id],
start_to_close_timeout=timedelta(minutes=10),
heartbeat_timeout=timedelta(seconds=30)
)
workflow.logger.info(
f"Dataset stored: {self._data_ref.storage_key} "
f"({self._data_ref.size_bytes} bytes)"
)
# Step 2: Process via reference
result = await workflow.execute_activity(
process_dataset_from_storage,
args=[self._data_ref],
start_to_close_timeout=timedelta(minutes=30),
heartbeat_timeout=timedelta(minutes=1)
)
# Step 3: Cleanup external storage
await workflow.execute_activity(
cleanup_storage,
args=[self._data_ref.storage_key],
start_to_close_timeout=timedelta(minutes=5)
)
return result
# Activities
@activity.defn
async def load_and_store_dataset(dataset_id: str) -> DataReference:
"""Load large dataset und store in S3"""
# Load from database
data = await database.load_dataset(dataset_id)
# Store in S3
storage_key = f"datasets/{dataset_id}/{uuid.uuid4()}.json"
await s3_client.put_object(
Bucket="workflow-data",
Key=storage_key,
Body=json.dumps(data)
)
return DataReference(
storage_key=storage_key,
size_bytes=len(json.dumps(data)),
checksum=hashlib.md5(json.dumps(data).encode()).hexdigest(),
storage_type="s3"
)
@activity.defn
async def process_dataset_from_storage(ref: DataReference) -> dict:
"""Process dataset from external storage"""
# Load from S3
response = await s3_client.get_object(
Bucket="workflow-data",
Key=ref.storage_key
)
data = json.loads(response['Body'].read())
# Process
result = process_data(data)
return result
9.6.3 Compression Pattern
Alternative: State komprimieren bei Storage.
import gzip
import base64
from typing import Any
from temporalio import workflow
@workflow.defn
class CompressedStateWorkflow:
"""Workflow mit State Compression"""
def __init__(self) -> None:
self._compressed_state: Optional[str] = None
@workflow.run
async def run(self) -> dict:
# Load large state
large_state = await workflow.execute_activity(
load_large_state,
start_to_close_timeout=timedelta(minutes=5)
)
# Compress state
self._compressed_state = self._compress_state(large_state)
workflow.logger.info(
f"State compressed: "
f"{len(json.dumps(large_state))} → {len(self._compressed_state)} bytes"
)
# Later: Decompress when needed
state = self._decompress_state(self._compressed_state)
return {"status": "complete"}
def _compress_state(self, state: Any) -> str:
"""Compress state für storage"""
json_bytes = json.dumps(state).encode('utf-8')
compressed = gzip.compress(json_bytes)
return base64.b64encode(compressed).decode('ascii')
def _decompress_state(self, compressed: str) -> Any:
"""Decompress state"""
compressed_bytes = base64.b64decode(compressed.encode('ascii'))
json_bytes = gzip.decompress(compressed_bytes)
return json.loads(json_bytes.decode('utf-8'))
9.6.4 Incremental State Updates
Pattern: Nur Änderungen tracken statt kompletten State.
from dataclasses import dataclass, field
from typing import Dict, List, Set
from temporalio import workflow
@dataclass
class IncrementalState:
"""State mit incremental updates"""
processed_ids: Set[str] = field(default_factory=set)
failed_ids: Set[str] = field(default_factory=set)
metadata: Dict[str, Any] = field(default_factory=dict)
# Track only changes
_added_since_checkpoint: List[str] = field(default_factory=list)
_checkpoint_size: int = 0
def add_processed(self, item_id: str):
"""Add processed item"""
if item_id not in self.processed_ids:
self.processed_ids.add(item_id)
self._added_since_checkpoint.append(item_id)
def should_checkpoint(self, threshold: int = 1000) -> bool:
"""Check if checkpoint needed"""
return len(self._added_since_checkpoint) >= threshold
@workflow.defn
class IncrementalStateWorkflow:
"""Workflow mit incremental state updates"""
def __init__(self) -> None:
self._state = IncrementalState()
@workflow.run
async def run(self, items: List[str]) -> dict:
for item in items:
# Process item
await workflow.execute_activity(
process_item,
args=[item],
start_to_close_timeout=timedelta(seconds=30)
)
self._state.add_processed(item)
# Checkpoint state periodically
if self._state.should_checkpoint():
await self._checkpoint_state()
return {
"processed": len(self._state.processed_ids),
"failed": len(self._state.failed_ids)
}
async def _checkpoint_state(self):
"""Persist state checkpoint"""
await workflow.execute_activity(
save_checkpoint,
args=[{
"processed_ids": list(self._state.processed_ids),
"failed_ids": list(self._state.failed_ids),
"metadata": self._state.metadata
}],
start_to_close_timeout=timedelta(seconds=30)
)
# Reset incremental tracking
self._state._added_since_checkpoint.clear()
workflow.logger.info(
f"Checkpoint saved: {len(self._state.processed_ids)} processed"
)
9.7 Human-in-the-Loop Patterns
9.7.1 Manual Approval Pattern
Use Case: Kritische Workflows brauchen menschliche Genehmigung.
from temporalio import workflow
import asyncio
from datetime import timedelta
@workflow.defn
class ApprovalWorkflow:
"""Workflow mit Manual Approval"""
def __init__(self) -> None:
self._approval_granted: bool = False
self._rejection_reason: Optional[str] = None
@workflow.run
async def run(self, request: dict) -> dict:
"""Execute mit approval requirement"""
# Step 1: Validation
workflow.logger.info("Validating request")
validation = await workflow.execute_activity(
validate_request,
args=[request],
start_to_close_timeout=timedelta(seconds=30)
)
if not validation.is_valid:
return {"status": "rejected", "reason": "validation_failed"}
# Step 2: Request Approval
workflow.logger.info("Requesting manual approval")
await workflow.execute_activity(
send_approval_request,
args=[request, workflow.info().workflow_id],
start_to_close_timeout=timedelta(seconds=30)
)
# Step 3: Wait for approval (max 7 days)
try:
await workflow.wait_condition(
lambda: self._approval_granted or self._rejection_reason is not None,
timeout=timedelta(days=7)
)
except asyncio.TimeoutError:
workflow.logger.warning("Approval timeout - auto-rejecting")
return {"status": "timeout", "reason": "no_approval_within_7_days"}
# Check approval result
if self._rejection_reason:
workflow.logger.info(f"Request rejected: {self._rejection_reason}")
return {"status": "rejected", "reason": self._rejection_reason}
# Step 4: Execute approved action
workflow.logger.info("Approval granted - executing")
result = await workflow.execute_activity(
execute_approved_action,
args=[request],
start_to_close_timeout=timedelta(minutes=30)
)
return {"status": "approved", "result": result}
@workflow.signal
async def approve(self) -> None:
"""Signal: Approve request"""
self._approval_granted = True
workflow.logger.info("Approval signal received")
@workflow.signal
async def reject(self, reason: str) -> None:
"""Signal: Reject request"""
self._rejection_reason = reason
workflow.logger.info(f"Rejection signal received: {reason}")
@workflow.query
def get_status(self) -> dict:
"""Query: Current approval status"""
return {
"approved": self._approval_granted,
"rejected": self._rejection_reason is not None,
"rejection_reason": self._rejection_reason,
"waiting": not self._approval_granted and not self._rejection_reason
}
Client-seitige Approval:
from temporalio.client import Client
async def approve_workflow(workflow_id: str):
"""Approve workflow extern"""
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle(workflow_id)
# Send approval signal
await handle.signal(ApprovalWorkflow.approve)
print(f"Workflow {workflow_id} approved")
async def reject_workflow(workflow_id: str, reason: str):
"""Reject workflow extern"""
client = await Client.connect("localhost:7233")
handle = client.get_workflow_handle(workflow_id)
# Send rejection signal
await handle.signal(ApprovalWorkflow.reject, reason)
print(f"Workflow {workflow_id} rejected: {reason}")
9.7.2 Multi-Step Approval Chain
Pattern: Mehrstufige Genehmigungskette.
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
class ApprovalLevel(Enum):
MANAGER = "manager"
DIRECTOR = "director"
VP = "vp"
CEO = "ceo"
@dataclass
class ApprovalStep:
level: ApprovalLevel
approved_by: Optional[str] = None
approved_at: Optional[float] = None
rejected: bool = False
rejection_reason: Optional[str] = None
@workflow.defn
class MultiStepApprovalWorkflow:
"""Workflow mit Multi-Level Approval Chain"""
def __init__(self) -> None:
self._approval_chain: List[ApprovalStep] = []
self._current_step = 0
@workflow.run
async def run(self, request: dict, amount: float) -> dict:
"""Execute mit approval chain basierend auf amount"""
# Determine approval chain basierend auf amount
if amount < 10000:
levels = [ApprovalLevel.MANAGER]
elif amount < 100000:
levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR]
elif amount < 1000000:
levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR, ApprovalLevel.VP]
else:
levels = [ApprovalLevel.MANAGER, ApprovalLevel.DIRECTOR, ApprovalLevel.VP, ApprovalLevel.CEO]
# Initialize approval chain
self._approval_chain = [ApprovalStep(level=level) for level in levels]
workflow.logger.info(
f"Approval chain: {[step.level.value for step in self._approval_chain]}"
)
# Process each approval step
for i, step in enumerate(self._approval_chain):
self._current_step = i
workflow.logger.info(f"Requesting {step.level.value} approval")
# Notify approver
await workflow.execute_activity(
notify_approver,
args=[step.level.value, request, workflow.info().workflow_id],
start_to_close_timeout=timedelta(seconds=30)
)
# Wait for approval (24 hours timeout per level)
try:
await workflow.wait_condition(
lambda: step.approved_by is not None or step.rejected,
timeout=timedelta(hours=24)
)
except asyncio.TimeoutError:
workflow.logger.warning(f"{step.level.value} approval timeout")
return {
"status": "timeout",
"level": step.level.value,
"reason": "approval_timeout"
}
# Check result
if step.rejected:
workflow.logger.info(
f"Rejected at {step.level.value}: {step.rejection_reason}"
)
return {
"status": "rejected",
"level": step.level.value,
"reason": step.rejection_reason
}
workflow.logger.info(
f"{step.level.value} approved by {step.approved_by}"
)
# All approvals granted → Execute
workflow.logger.info("All approvals granted - executing")
result = await workflow.execute_activity(
execute_approved_action,
args=[request],
start_to_close_timeout=timedelta(minutes=30)
)
return {
"status": "approved",
"approval_chain": [
{
"level": step.level.value,
"approved_by": step.approved_by,
"approved_at": step.approved_at
}
for step in self._approval_chain
],
"result": result
}
@workflow.signal
async def approve_step(self, level: str, approver: str) -> None:
"""Approve current step"""
current_step = self._approval_chain[self._current_step]
if current_step.level.value == level:
current_step.approved_by = approver
current_step.approved_at = workflow.time_ns() / 1e9
workflow.logger.info(f"Step approved: {level} by {approver}")
@workflow.signal
async def reject_step(self, level: str, reason: str) -> None:
"""Reject current step"""
current_step = self._approval_chain[self._current_step]
if current_step.level.value == level:
current_step.rejected = True
current_step.rejection_reason = reason
workflow.logger.info(f"Step rejected: {level} - {reason}")
@workflow.query
def get_approval_status(self) -> dict:
"""Query approval status"""
return {
"current_step": self._current_step,
"total_steps": len(self._approval_chain),
"steps": [
{
"level": step.level.value,
"approved": step.approved_by is not None,
"rejected": step.rejected,
"pending": step.approved_by is None and not step.rejected
}
for step in self._approval_chain
]
}
9.8 Zusammenfassung
Kernkonzepte:
- Continue-As-New: Unbegrenzte Workflow-Laufzeit durch History-Reset
- Child Workflows: Event History Isolation und unabhängiger Lifecycle
- Parallel Execution: Effiziente Batch-Verarbeitung mit asyncio.gather()
- Rate Limiting: Token Bucket, Sliding Window, Semaphore-Patterns
- Graceful Degradation: Fallback-Mechanismen und Circuit Breaker
- State Management: External Storage, Compression, Incremental Updates
- Human-in-the-Loop: Manual Approvals und Multi-Level Chains
Best Practices Checkliste:
- ✅ Continue-As-New bei >10,000 Events oder Temporal Suggestion
- ✅ Child Workflows für Event History Isolation
- ✅ Batching bei >1000 parallelen Activities
- ✅ Rate Limiting für externe APIs implementieren
- ✅ Fallback-Mechanismen für kritische Services
- ✅ State < 2 MB halten, sonst External Storage
- ✅ Approval Workflows mit Timeout
- ✅ Monitoring für alle Advanced Patterns
Wann welches Pattern?
| Scenario | Pattern |
|---|---|
| Workflow > 30 Tage | Continue-As-New |
| >1000 Sub-Tasks | Child Workflows |
| Batch Processing | Parallel Execution + Batching |
| Externe API mit Limits | Rate Limiting (Token Bucket) |
| Service-Ausfälle möglich | Graceful Degradation |
| State > 2 MB | External Storage |
| Manuelle Genehmigung | Human-in-the-Loop |
Im nächsten Kapitel (Teil IV: Betrieb) werden wir uns mit Production Deployments, Monitoring und Operations beschäftigen - wie Sie Temporal-Systeme in Production betreiben.
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 10: Produktions-Deployment
Code-Beispiele für dieses Kapitel: examples/part-03/chapter-09/
Kapitel 10: Deployment und Production Best Practices
Einleitung
Temporal in der Entwicklung zum Laufen zu bringen ist einfach – temporal server start-dev und los geht’s. Aber der Sprung in die Production ist eine andere Herausforderung. Sie müssen über High Availability, Zero-Downtime Deployments, Kapazitätsplanung, Disaster Recovery und vieles mehr nachdenken.
Dieses Kapitel behandelt alles, was Sie wissen müssen, um Temporal sicher und zuverlässig in Production zu betreiben. Von Deployment-Strategien über Worker-Management bis hin zu Temporal Server-Konfiguration.
Das Grundproblem
Scenario: Sie haben 50,000 laufende Workflows in Production. Ein neues Feature muss deployed werden. Aber:
- Workflows laufen über Wochen/Monate
- Worker dürfen nicht einfach beendet werden (laufende Activities!)
- Code-Änderungen müssen abwärtskompatibel sein
- Zero Downtime ist Pflicht
- Rollback muss möglich sein
Ohne Best Practices:
# FALSCH: Workers einfach neu starten
kubectl delete pods -l app=temporal-worker # ❌ Kills running activities!
kubectl apply -f new-worker.yaml
Ergebnis:
- Activities werden abgebrochen
- Workflows müssen retries durchführen
- Potentieller Datenverlust
- Service-Degradation
Mit Best Practices:
# RICHTIG: Graceful shutdown + Rolling deployment
kubectl set image deployment/temporal-worker worker=v2.0.0 # ✅
# Kubernetes terminiert Pods graceful
# Workers beenden aktuelle Tasks
# Neue Workers starten parallel
Lernziele
Nach diesem Kapitel können Sie:
- Verschiedene Deployment-Strategien (Blue-Green, Canary, Rolling) anwenden
- Workers graceful shutdown und zero-downtime deployments durchführen
- Temporal Server selbst hosten oder Cloud nutzen
- High Availability und Disaster Recovery implementieren
- Capacity Planning durchführen
- Production Checklisten anwenden
- Monitoring und Alerting aufsetzen (Details in Kapitel 11)
- Security Best Practices implementieren (Details in Kapitel 13)
10.1 Worker Deployment Strategies
10.1.1 Graceful Shutdown
Warum wichtig?
Workers führen Activities aus, die Minuten oder Stunden dauern können. Ein abruptes Beenden würde:
- Laufende Activities abbrechen
- Externe State inkonsistent lassen
- Unnötige Retries auslösen
Lifecycle eines Workers:
stateDiagram-v2
[*] --> Starting: Start Worker
Starting --> Running: Connected to Temporal
Running --> Draining: SIGTERM received
Draining --> Stopped: All tasks completed
Stopped --> [*]
note right of Draining
- Accepts no new tasks
- Completes running tasks
- Timeout: 30s - 5min
end note
Python Implementation:
"""
Graceful Worker Shutdown
"""
import asyncio
import signal
from temporalio.client import Client
from temporalio.worker import Worker
import logging
logger = logging.getLogger(__name__)
class GracefulWorker:
"""Worker with graceful shutdown support"""
def __init__(
self,
client: Client,
task_queue: str,
workflows: list,
activities: list,
shutdown_timeout: float = 300.0 # 5 minutes
):
self.client = client
self.task_queue = task_queue
self.workflows = workflows
self.activities = activities
self.shutdown_timeout = shutdown_timeout
self.worker: Worker | None = None
self._shutdown_event = asyncio.Event()
async def run(self):
"""Run worker with graceful shutdown handling"""
# Setup signal handlers
loop = asyncio.get_running_loop()
def signal_handler(sig):
logger.info(f"Received signal {sig}, initiating graceful shutdown...")
self._shutdown_event.set()
# Handle SIGTERM (Kubernetes pod termination)
loop.add_signal_handler(signal.SIGTERM, lambda: signal_handler("SIGTERM"))
# Handle SIGINT (Ctrl+C)
loop.add_signal_handler(signal.SIGINT, lambda: signal_handler("SIGINT"))
# Create and start worker
logger.info(f"Starting worker on task queue: {self.task_queue}")
async with Worker(
self.client,
task_queue=self.task_queue,
workflows=self.workflows,
activities=self.activities,
# Important: Enable graceful shutdown
graceful_shutdown_timeout=timedelta(seconds=self.shutdown_timeout)
) as self.worker:
logger.info("✓ Worker started and polling for tasks")
# Wait for shutdown signal
await self._shutdown_event.wait()
logger.info("Shutdown signal received, draining tasks...")
logger.info(f"Waiting up to {self.shutdown_timeout}s for tasks to complete")
# Worker will automatically:
# 1. Stop accepting new tasks
# 2. Wait for running tasks to complete
# 3. Timeout after graceful_shutdown_timeout
logger.info("✓ Worker stopped gracefully")
# Usage
async def main():
client = await Client.connect("localhost:7233")
worker = GracefulWorker(
client=client,
task_queue="production-queue",
workflows=[MyWorkflow],
activities=[my_activity],
shutdown_timeout=300.0
)
await worker.run()
if __name__ == "__main__":
asyncio.run(main())
Kubernetes Deployment mit Graceful Shutdown:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
app: temporal-worker
version: v2.0.0
spec:
containers:
- name: worker
image: myregistry/temporal-worker:v2.0.0
# Resource limits
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# Graceful shutdown configuration
lifecycle:
preStop:
exec:
# Optional: Custom pre-stop hook
# Worker already handles SIGTERM gracefully
command: ["/bin/sh", "-c", "sleep 5"]
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# Termination grace period (must be > graceful_shutdown_timeout!)
terminationGracePeriodSeconds: 360 # 6 minutes
Best Practices:
✅ DO:
- Set
graceful_shutdown_timeout> longest expected activity duration - Set Kubernetes
terminationGracePeriodSeconds>graceful_shutdown_timeout+ buffer - Log shutdown progress for observability
- Monitor drain duration metrics
- Test graceful shutdown in staging
❌ DON’T:
- Use
SIGKILLfor routine shutdowns - Set timeout too short (activities will be aborted)
- Ignore health check failures
- Skip testing shutdown behavior
10.1.2 Rolling Deployment
Pattern: Schrittweises Ersetzen alter Workers durch neue.
Vorteile:
- ✅ Zero Downtime
- ✅ Automatisches Rollback bei Fehlern
- ✅ Kapazität bleibt konstant
- ✅ Standard in Kubernetes
Nachteile:
- ⚠️ Zwei Versionen laufen parallel
- ⚠️ Code muss backward-compatible sein
- ⚠️ Langsamer als Blue-Green
Flow:
graph TD
A[3 Workers v1.0] --> B[2 Workers v1.0<br/>1 Worker v2.0]
B --> C[1 Worker v1.0<br/>2 Workers v2.0]
C --> D[3 Workers v2.0]
style A fill:#e1f5ff
style D fill:#d4f1d4
Kubernetes RollingUpdate:
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2 # Max 2 workers down at once
maxSurge: 2 # Max 2 extra workers during rollout
template:
spec:
containers:
- name: worker
image: myregistry/temporal-worker:v2.0.0
Deployment Process:
# 1. Update image
kubectl set image deployment/temporal-worker \
worker=myregistry/temporal-worker:v2.0.0
# 2. Monitor rollout
kubectl rollout status deployment/temporal-worker
# Output:
# Waiting for deployment "temporal-worker" rollout to finish: 2 out of 10 new replicas have been updated...
# Waiting for deployment "temporal-worker" rollout to finish: 5 out of 10 new replicas have been updated...
# deployment "temporal-worker" successfully rolled out
# 3. If problems occur, rollback
kubectl rollout undo deployment/temporal-worker
Health Check during Rollout:
"""
Health check endpoint for Kubernetes probes
"""
from fastapi import FastAPI
from temporalio.worker import Worker
app = FastAPI()
worker: Worker | None = None
@app.get("/health")
async def health():
"""Liveness probe: Is the process alive?"""
return {"status": "ok"}
@app.get("/ready")
async def ready():
"""Readiness probe: Is worker ready to accept tasks?"""
if worker is None or not worker.is_running:
return {"status": "not_ready"}, 503
return {"status": "ready"}
# Run alongside worker
async def run_worker_with_health_check():
import uvicorn
# Start health check server
config = uvicorn.Config(app, host="0.0.0.0", port=8080)
server = uvicorn.Server(config)
# Run both concurrently
await asyncio.gather(
server.serve(),
run_worker() # Your worker logic
)
10.1.3 Blue-Green Deployment
Pattern: Zwei identische Environments (Blue = alt, Green = neu). Traffic wird komplett umgeschaltet.
Vorteile:
- ✅ Instant Rollback (switch back)
- ✅ Beide Versionen getestet vor Switch
- ✅ Zero Downtime
- ✅ Volle Kontrolle über Cutover
Nachteile:
- ⚠️ Doppelte Ressourcen während Deployment
- ⚠️ Komplexere Infrastruktur
- ⚠️ Database Schema muss kompatibel sein
Flow:
graph LR
A[Traffic 100%] --> B[Blue Workers v1.0]
C[Green Workers v2.0<br/>Deployed & Tested] -.-> D[Switch Traffic]
D --> E[Traffic 100%]
E --> F[Green Workers v2.0]
G[Blue Workers v1.0] -.-> H[Keep for Rollback]
style B fill:#e1f5ff
style F fill:#d4f1d4
style G fill:#ffe1e1
Implementation with Worker Versioning:
"""
Blue-Green Deployment mit Worker Versioning (Build IDs)
"""
from temporalio.client import Client
from temporalio.worker import Worker
async def deploy_green_workers():
"""Deploy new GREEN workers with new Build ID"""
client = await Client.connect("localhost:7233")
# Green workers mit neuem Build ID
worker = Worker(
client,
task_queue="production-queue",
workflows=[MyWorkflowV2], # New version
activities=[my_activity_v2],
build_id="v2.0.0", # GREEN Build ID
use_worker_versioning=True
)
await worker.run()
async def cutover_to_green():
"""Switch traffic from BLUE to GREEN"""
client = await Client.connect("localhost:7233")
# Make v2.0.0 the default for new workflows
await client.update_worker_build_id_compatibility(
task_queue="production-queue",
operation=BuildIdOperation.add_new_default("v2.0.0")
)
print("✓ Traffic switched to GREEN (v2.0.0)")
print(" - New workflows → v2.0.0")
print(" - Running workflows → continue on v1.0.0")
async def rollback_to_blue():
"""Rollback to BLUE version"""
client = await Client.connect("localhost:7233")
# Revert to v1.0.0
await client.update_worker_build_id_compatibility(
task_queue="production-queue",
operation=BuildIdOperation.promote_set_by_build_id("v1.0.0")
)
print("✓ Rolled back to BLUE (v1.0.0)")
Kubernetes Setup:
# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker-blue
spec:
replicas: 5
template:
metadata:
labels:
app: temporal-worker
color: blue
version: v1.0.0
spec:
containers:
- name: worker
image: myregistry/temporal-worker:v1.0.0
env:
- name: BUILD_ID
value: "v1.0.0"
---
# Green deployment (new version, ready for cutover)
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker-green
spec:
replicas: 5
template:
metadata:
labels:
app: temporal-worker
color: green
version: v2.0.0
spec:
containers:
- name: worker
image: myregistry/temporal-worker:v2.0.0
env:
- name: BUILD_ID
value: "v2.0.0"
Deployment Procedure:
# 1. Deploy GREEN alongside BLUE
kubectl apply -f worker-green.yaml
# 2. Verify GREEN health
kubectl get pods -l color=green
kubectl logs -l color=green --tail=100
# 3. Run smoke tests on GREEN
python scripts/smoke_test.py --build-id v2.0.0
# 4. Cutover traffic to GREEN
python scripts/cutover.py --to green
# 5. Monitor GREEN for issues
# ... wait 1-2 hours ...
# 6. If all good, decommission BLUE
kubectl delete deployment temporal-worker-blue
# 7. If issues, instant rollback
python scripts/cutover.py --to blue
10.1.4 Canary Deployment
Pattern: Neue Version auf kleinem Prozentsatz der Traffic (z.B. 5%), dann schrittweise erhöhen.
Vorteile:
- ✅ Minimal Risk (nur 5% betroffen)
- ✅ Frühe Fehler-Erkennung
- ✅ Gradual Rollout
- ✅ A/B Testing möglich
Nachteile:
- ⚠️ Komplexere Observability
- ⚠️ Längerer Deployment-Prozess
- ⚠️ Requires Traffic Splitting
Flow:
graph TD
A[100% v1.0] --> B[95% v1.0 + 5% v2.0<br/>Canary]
B --> C{Metrics OK?}
C -->|Yes| D[80% v1.0 + 20% v2.0]
C -->|No| E[Rollback to 100% v1.0]
D --> F[50% v1.0 + 50% v2.0]
F --> G[100% v2.0]
style E fill:#ffe1e1
style G fill:#d4f1d4
Implementation mit Worker Versioning:
"""
Canary Deployment mit schrittweisem Rollout
"""
from temporalio.client import Client
import asyncio
async def canary_rollout():
"""Gradual canary rollout: 5% → 20% → 50% → 100%"""
client = await Client.connect("localhost:7233")
stages = [
{"canary_pct": 5, "wait_minutes": 30},
{"canary_pct": 20, "wait_minutes": 60},
{"canary_pct": 50, "wait_minutes": 120},
{"canary_pct": 100, "wait_minutes": 0},
]
for stage in stages:
pct = stage["canary_pct"]
wait = stage["wait_minutes"]
print(f"\n🚀 Stage: {pct}% canary traffic to v2.0.0")
# Adjust worker replicas based on percentage
canary_replicas = max(1, int(10 * pct / 100))
stable_replicas = 10 - canary_replicas
# Scale workers (using kubectl or k8s API)
await scale_workers("blue", stable_replicas)
await scale_workers("green", canary_replicas)
print(f" - Blue (v1.0.0): {stable_replicas} replicas")
print(f" - Green (v2.0.0): {canary_replicas} replicas")
if wait > 0:
print(f"⏳ Monitoring for {wait} minutes...")
await asyncio.sleep(wait * 60)
# Check metrics
metrics = await check_canary_metrics()
if not metrics["healthy"]:
print("❌ Canary metrics unhealthy, rolling back!")
await rollback()
return
print("✅ Canary metrics healthy, continuing rollout")
print("\n🎉 Canary rollout completed successfully!")
print(" 100% traffic now on v2.0.0")
async def check_canary_metrics() -> dict:
"""Check if canary version is healthy"""
# Check:
# - Error rate
# - Latency p99
# - Success rate
# - Custom business metrics
return {
"healthy": True,
"error_rate": 0.01,
"latency_p99": 450,
"success_rate": 99.9
}
Kubernetes + Argo Rollouts:
# Canary with Argo Rollouts (advanced)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: temporal-worker
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% canary
- pause: {duration: 30m}
- setWeight: 20 # 20% canary
- pause: {duration: 1h}
- setWeight: 50 # 50% canary
- pause: {duration: 2h}
- setWeight: 100 # Full rollout
# Auto-rollback on metrics failure
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: temporal-worker
template:
spec:
containers:
- name: worker
image: myregistry/temporal-worker:v2.0.0
10.1.5 Deployment Strategy Decision Matrix
| Faktor | Rolling | Blue-Green | Canary |
|---|---|---|---|
| Complexity | Low | Medium | High |
| Rollback Speed | Slow | Instant | Fast |
| Resource Cost | Low | High (2x) | Medium |
| Risk | Medium | Low | Very Low |
| Best For | - Routine updates - Small teams | - Critical releases - Need instant rollback | - High-risk changes - A/B testing |
| Temporal Feature | Standard | Worker Versioning | Worker Versioning |
Empfehlung:
def choose_deployment_strategy(change_type: str) -> str:
"""Decision tree for deployment strategy"""
if change_type == "hotfix":
return "rolling" # Fast, simple
elif change_type == "major_release":
return "blue-green" # Safe, instant rollback
elif change_type == "experimental_feature":
return "canary" # Gradual, low risk
elif change_type == "routine_update":
return "rolling" # Standard, cost-effective
else:
return "canary" # When in doubt, go safe
10.2 Temporal Server Deployment
10.2.1 Temporal Cloud vs Self-Hosted
Decision Matrix:
| Faktor | Temporal Cloud | Self-Hosted |
|---|---|---|
| Setup Time | Minutes | Days/Weeks |
| Operational Overhead | None | High |
| Cost | Pay-per-use | Infrastructure + Team |
| Control | Limited | Full |
| Compliance | SOC2, HIPAA | Your responsibility |
| Customization | Limited | Unlimited |
| Scaling | Automatic | Manual |
| Best For | - Startups - Focus on business logic - Fast time-to-market | - Enterprise - Strict compliance - Full control needs |
Temporal Cloud:
"""
Connecting to Temporal Cloud
"""
from temporalio.client import Client, TLSConfig
async def connect_to_cloud():
"""Connect to Temporal Cloud"""
client = await Client.connect(
# Your Temporal Cloud namespace
target_host="my-namespace.tmprl.cloud:7233",
# Namespace
namespace="my-namespace.account-id",
# TLS configuration (required for Cloud)
tls=TLSConfig(
client_cert=open("client-cert.pem", "rb").read(),
client_private_key=open("client-key.pem", "rb").read(),
)
)
return client
# Usage
client = await connect_to_cloud()
Pros:
- ✅ No infrastructure management
- ✅ Automatic scaling
- ✅ Built-in monitoring
- ✅ Multi-region support
- ✅ 99.99% SLA
Cons:
- ❌ Less control over configuration
- ❌ Pay-per-action pricing
- ❌ Vendor lock-in
10.2.2 Self-Hosted: Docker Compose
For: Development, small deployments
# docker-compose.yml
version: '3.8'
services:
# PostgreSQL (persistence)
postgresql:
image: postgres:13
environment:
POSTGRES_PASSWORD: temporal
POSTGRES_USER: temporal
volumes:
- postgres-data:/var/lib/postgresql/data
networks:
- temporal-network
# Temporal Server
temporal:
image: temporalio/auto-setup:latest
depends_on:
- postgresql
environment:
- DB=postgresql
- DB_PORT=5432
- POSTGRES_USER=temporal
- POSTGRES_PWD=temporal
- POSTGRES_SEEDS=postgresql
- DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-sql.yaml
ports:
- "7233:7233" # gRPC
- "8233:8233" # HTTP
volumes:
- ./dynamicconfig:/etc/temporal/config/dynamicconfig
networks:
- temporal-network
# Temporal Web UI
temporal-ui:
image: temporalio/ui:latest
depends_on:
- temporal
environment:
- TEMPORAL_ADDRESS=temporal:7233
ports:
- "8080:8080"
networks:
- temporal-network
volumes:
postgres-data:
networks:
temporal-network:
driver: bridge
Start:
docker-compose up -d
Pros:
- ✅ Simple setup
- ✅ Good for dev/test
- ✅ All-in-one
Cons:
- ❌ Not production-grade
- ❌ Single point of failure
- ❌ No HA
10.2.3 Self-Hosted: Kubernetes (Production)
For: Production, high availability
Architecture:
graph TB
subgraph "Temporal Cluster"
Frontend[Frontend Service<br/>gRPC API]
History[History Service<br/>Workflow Orchestration]
Matching[Matching Service<br/>Task Queue Management]
Worker_Service[Worker Service<br/>Background Jobs]
end
subgraph "Persistence"
DB[(PostgreSQL/Cassandra<br/>Workflow State)]
ES[(Elasticsearch<br/>Visibility)]
end
subgraph "Workers"
W1[Worker Pod 1]
W2[Worker Pod 2]
W3[Worker Pod N]
end
Frontend --> DB
History --> DB
History --> ES
Matching --> DB
Worker_Service --> DB
W1 --> Frontend
W2 --> Frontend
W3 --> Frontend
Helm Chart Deployment:
# 1. Add Temporal Helm repo
helm repo add temporalio https://go.temporal.io/helm-charts
helm repo update
# 2. Create namespace
kubectl create namespace temporal
# 3. Install with custom values
helm install temporal temporalio/temporal \
--namespace temporal \
--values temporal-values.yaml
temporal-values.yaml:
# Temporal Server configuration for production
# High Availability: Multiple replicas
server:
replicaCount: 3
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
# Frontend service
frontend:
replicaCount: 3
service:
type: LoadBalancer
port: 7233
# History service (most critical)
history:
replicaCount: 5
resources:
requests:
cpu: 2000m
memory: 4Gi
# Matching service
matching:
replicaCount: 3
# Worker service
worker:
replicaCount: 2
# PostgreSQL (persistence)
postgresql:
enabled: true
persistence:
enabled: true
size: 100Gi
storageClass: "fast-ssd"
# HA setup
replication:
enabled: true
readReplicas: 2
resources:
requests:
cpu: 2000m
memory: 8Gi
# Elasticsearch (visibility)
elasticsearch:
enabled: true
replicas: 3
minimumMasterNodes: 2
volumeClaimTemplate:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
# Prometheus metrics
prometheus:
enabled: true
# Grafana dashboards
grafana:
enabled: true
Verify Installation:
# Check pods
kubectl get pods -n temporal
# Expected output:
# NAME READY STATUS RESTARTS AGE
# temporal-frontend-xxxxx 1/1 Running 0 5m
# temporal-history-xxxxx 1/1 Running 0 5m
# temporal-matching-xxxxx 1/1 Running 0 5m
# temporal-worker-xxxxx 1/1 Running 0 5m
# temporal-postgresql-0 1/1 Running 0 5m
# temporal-elasticsearch-0 1/1 Running 0 5m
# Check services
kubectl get svc -n temporal
# Port-forward to access UI
kubectl port-forward -n temporal svc/temporal-frontend 7233:7233
kubectl port-forward -n temporal svc/temporal-web 8080:8080
10.2.4 High Availability Setup
Multi-Region Deployment:
graph TB
subgraph "Region US-East"
US_LB[Load Balancer]
US_T1[Temporal Cluster]
US_DB[(PostgreSQL Primary)]
end
subgraph "Region EU-West"
EU_LB[Load Balancer]
EU_T1[Temporal Cluster]
EU_DB[(PostgreSQL Replica)]
end
subgraph "Region AP-South"
AP_LB[Load Balancer]
AP_T1[Temporal Cluster]
AP_DB[(PostgreSQL Replica)]
end
US_DB -.Replication.-> EU_DB
US_DB -.Replication.-> AP_DB
Global_DNS[Global DNS/CDN] --> US_LB
Global_DNS --> EU_LB
Global_DNS --> AP_LB
HA Checklist:
✅ Infrastructure:
- Multiple availability zones
- Database replication (PostgreSQL streaming)
- Load balancer health checks
- Auto-scaling groups
- Network redundancy
✅ Temporal Server:
- Frontend: ≥3 replicas
- History: ≥5 replicas (most critical)
- Matching: ≥3 replicas
- Worker: ≥2 replicas
✅ Database:
- PostgreSQL with streaming replication
- Automated backups (daily)
- Point-in-time recovery enabled
- Separate disk for WAL logs
- Connection pooling (PgBouncer)
✅ Monitoring:
- Prometheus + Grafana
- Alert on service degradation
- Dashboard for all services
- Log aggregation (ELK/Loki)
10.2.5 Disaster Recovery
Backup Strategy:
# Automated PostgreSQL backup script
#!/bin/bash
# backup-temporal-db.sh
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/temporal"
DB_NAME="temporal"
DB_USER="temporal"
# Full backup
pg_dump -U $DB_USER -d $DB_NAME -F c -b -v \
-f $BACKUP_DIR/temporal_$TIMESTAMP.dump
# Compress
gzip $BACKUP_DIR/temporal_$TIMESTAMP.dump
# Upload to S3
aws s3 cp $BACKUP_DIR/temporal_$TIMESTAMP.dump.gz \
s3://my-temporal-backups/daily/
# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.dump.gz" -mtime +30 -delete
echo "Backup completed: temporal_$TIMESTAMP.dump.gz"
Cron Schedule:
# Daily backup at 2 AM
0 2 * * * /scripts/backup-temporal-db.sh >> /var/log/temporal-backup.log 2>&1
# Hourly incremental backup (WAL archiving)
0 * * * * /scripts/archive-wal.sh >> /var/log/wal-archive.log 2>&1
Restore Procedure:
#!/bin/bash
# restore-temporal-db.sh
BACKUP_FILE=$1
# Download from S3
aws s3 cp s3://my-temporal-backups/daily/$BACKUP_FILE .
# Decompress
gunzip $BACKUP_FILE
# Restore
pg_restore -U temporal -d temporal -c -v ${BACKUP_FILE%.gz}
echo "Restore completed from $BACKUP_FILE"
DR Runbook:
# Disaster Recovery Runbook
## Scenario 1: Database Corruption
1. **Stop Temporal services**
```bash
kubectl scale deployment temporal-frontend --replicas=0 -n temporal
kubectl scale deployment temporal-history --replicas=0 -n temporal
kubectl scale deployment temporal-matching --replicas=0 -n temporal
-
Restore from latest backup
./restore-temporal-db.sh temporal_20250118_020000.dump.gz -
Verify database integrity
psql -U temporal -d temporal -c "SELECT COUNT(*) FROM executions;" -
Restart services
kubectl scale deployment temporal-frontend --replicas=3 -n temporal kubectl scale deployment temporal-history --replicas=5 -n temporal kubectl scale deployment temporal-matching --replicas=3 -n temporal -
Verify workflows resuming
temporal workflow list
Scenario 2: Complete Region Failure
-
Switch DNS to DR region
aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890ABC \ --change-batch file://failover.json -
Promote replica to primary
kubectl exec -it postgresql-replica-0 -n temporal -- \ pg_ctl promote -
Scale up DR services
kubectl scale deployment temporal-frontend --replicas=3 -n temporal-dr kubectl scale deployment temporal-history --replicas=5 -n temporal-dr -
Update worker connections
# Workers automatically reconnect to new endpoint via DNS
RTO/RPO Targets
- RTO (Recovery Time Objective): 15 minutes
- RPO (Recovery Point Objective): 1 hour (last backup)
## 10.3 Capacity Planning
### 10.3.1 Worker Sizing
**Factors:**
1. **Workflow Throughput**
- Workflows/second
- Average workflow duration
- Concurrent workflow limit
2. **Activity Characteristics**
- Average duration
- CPU/Memory usage
- External dependencies (API rate limits)
3. **Task Queue Backlog**
- Acceptable lag
- Peak vs average load
**Formula:**
```python
"""
Worker Capacity Calculator
"""
from dataclasses import dataclass
from typing import List
@dataclass
class WorkloadProfile:
"""Characterize your workload"""
workflows_per_second: float
avg_workflow_duration_sec: float
activities_per_workflow: int
avg_activity_duration_sec: float
activity_cpu_cores: float = 0.1
activity_memory_mb: float = 256
def calculate_required_workers(profile: WorkloadProfile) -> dict:
"""Calculate required worker capacity"""
# Concurrent workflows
concurrent_workflows = profile.workflows_per_second * profile.avg_workflow_duration_sec
# Concurrent activities
concurrent_activities = (
concurrent_workflows *
profile.activities_per_workflow
)
# Worker slots (assuming 100 slots per worker)
slots_per_worker = 100
required_workers = max(1, int(concurrent_activities / slots_per_worker) + 1)
# Resource requirements
total_cpu = concurrent_activities * profile.activity_cpu_cores
total_memory_gb = (concurrent_activities * profile.activity_memory_mb) / 1024
return {
"concurrent_workflows": int(concurrent_workflows),
"concurrent_activities": int(concurrent_activities),
"required_workers": required_workers,
"total_cpu_cores": round(total_cpu, 2),
"total_memory_gb": round(total_memory_gb, 2),
"cpu_per_worker": round(total_cpu / required_workers, 2),
"memory_per_worker_gb": round(total_memory_gb / required_workers, 2)
}
# Example
profile = WorkloadProfile(
workflows_per_second=10,
avg_workflow_duration_sec=300, # 5 minutes
activities_per_workflow=5,
avg_activity_duration_sec=10,
activity_cpu_cores=0.1,
activity_memory_mb=256
)
result = calculate_required_workers(profile)
print("Capacity Planning Results:")
print(f" Concurrent Workflows: {result['concurrent_workflows']}")
print(f" Concurrent Activities: {result['concurrent_activities']}")
print(f" Required Workers: {result['required_workers']}")
print(f" Total CPU: {result['total_cpu_cores']} cores")
print(f" Total Memory: {result['total_memory_gb']} GB")
print(f" Per Worker: {result['cpu_per_worker']} CPU, {result['memory_per_worker_gb']} GB RAM")
Output:
Capacity Planning Results:
Concurrent Workflows: 3000
Concurrent Activities: 15000
Required Workers: 151
Total CPU: 1500.0 cores
Total Memory: 3750.0 GB
Per Worker: 9.93 CPU, 24.83 GB RAM
10.3.2 Horizontal Pod Autoscaling
Kubernetes HPA:
# hpa-worker.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: temporal-worker-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: temporal-worker
minReplicas: 5
maxReplicas: 50
metrics:
# Scale based on CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale based on Memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: Task queue backlog
- type: Pods
pods:
metric:
name: temporal_task_queue_backlog
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Percent
value: 50 # Max 50% pods removed at once
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Max 100% pods added at once
periodSeconds: 15
- type: Pods
value: 5 # Max 5 pods added at once
periodSeconds: 15
selectPolicy: Max # Use most aggressive policy
Custom Metrics (Prometheus Adapter):
# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'temporal_task_queue_backlog'
resources:
template: <<.Resource>>
name:
matches: "^(.*)$"
as: "temporal_task_queue_backlog"
metricsQuery: 'avg(temporal_task_queue_backlog{queue="production-queue"})'
10.3.3 Database Sizing
PostgreSQL Sizing Guidelines:
| Workflows | Storage | CPU | RAM | IOPS |
|---|---|---|---|---|
| 1M active | 100 GB | 4 cores | 16 GB | 3,000 |
| 10M active | 500 GB | 8 cores | 32 GB | 10,000 |
| 100M active | 2 TB | 16 cores | 64 GB | 30,000 |
Storage Growth Estimation:
def estimate_storage_growth(
workflows_per_day: int,
avg_events_per_workflow: int,
avg_event_size_bytes: int = 1024,
retention_days: int = 90
) -> dict:
"""Estimate PostgreSQL storage requirements"""
# Total workflows in retention window
total_workflows = workflows_per_day * retention_days
# Events
total_events = total_workflows * avg_events_per_workflow
# Storage (with overhead)
raw_storage_gb = (total_events * avg_event_size_bytes) / (1024**3)
storage_with_overhead_gb = raw_storage_gb * 1.5 # 50% overhead for indexes, etc.
# Growth per day
daily_growth_gb = (workflows_per_day * avg_events_per_workflow * avg_event_size_bytes) / (1024**3)
return {
"total_workflows": total_workflows,
"total_events": total_events,
"storage_required_gb": round(storage_with_overhead_gb, 2),
"daily_growth_gb": round(daily_growth_gb, 2)
}
# Example
result = estimate_storage_growth(
workflows_per_day=100000,
avg_events_per_workflow=50,
retention_days=90
)
print(f"Storage required: {result['storage_required_gb']} GB")
print(f"Daily growth: {result['daily_growth_gb']} GB")
10.4 Production Checklist
10.4.1 Pre-Deployment
Code:
- All tests passing (unit, integration, replay)
- Workflow versioning implemented (patching or Build IDs)
- Error handling and retries configured
- Logging at appropriate levels
- No secrets in code (use Secret Manager)
- Code reviewed and approved
Infrastructure:
- Temporal Server deployed (Cloud or self-hosted)
- Database configured with replication
- Backups automated and tested
- Monitoring and alerting setup
- Resource limits configured
- Network policies applied
Security:
- TLS enabled for all connections
- mTLS configured (if required)
- RBAC/authorization configured
- Secrets encrypted at rest
- Audit logging enabled
- Vulnerability scanning completed
Operations:
- Runbooks created (incident response, DR)
- On-call rotation scheduled
- Escalation paths defined
- SLOs/SLAs documented
10.4.2 Deployment
- Deploy in off-peak hours
- Use deployment strategy (Rolling/Blue-Green/Canary)
- Monitor metrics in real-time
- Validate with smoke tests
- Communicate to stakeholders
10.4.3 Post-Deployment
- Verify all workers healthy
- Check task queue backlog
- Monitor error rates
- Review logs for warnings
- Confirm workflows completing successfully
- Update documentation
- Retrospective (lessons learned)
10.5 Zusammenfassung
Wichtigste Konzepte
-
Graceful Shutdown
- Workers müssen laufende Activities abschließen
graceful_shutdown_timeout> längste Activity- Kubernetes
terminationGracePeriodSecondsentsprechend setzen
-
Deployment Strategies
- Rolling: Standard, kostengünstig, moderate Risk
- Blue-Green: Instant Rollback, höhere Kosten
- Canary: Minimales Risk, schrittweise Rollout
-
Temporal Server
- Cloud: Einfach, managed, pay-per-use
- Self-Hosted: Volle Kontrolle, höherer Aufwand
- HA Setup: Multi-AZ, Replikation, Load Balancing
-
Capacity Planning
- Worker-Sizing basierend auf Workload-Profil
- Horizontal Pod Autoscaling für elastische Kapazität
- Database-Sizing für Storage und IOPS
-
Production Readiness
- Comprehensive Checklisten
- Automated Backups & DR
- Monitoring & Alerting (Kapitel 11)
Best Practices
✅ DO:
- Implement graceful shutdown
- Use deployment strategies (not ad-hoc restarts)
- Automate capacity planning
- Test DR procedures regularly
- Monitor all the things (Kapitel 11)
❌ DON’T:
- Kill workers abruptly (
SIGKILL) - Deploy without versioning
- Skip capacity planning
- Ignore backup testing
- Deploy without monitoring
Nächste Schritte
- Kapitel 11: Monitoring und Observability – Wie Sie Production-Workflows überwachen
- Kapitel 12: Testing Strategies – Comprehensive testing für Temporal
- Kapitel 13: Best Practices und Anti-Muster – Production-ready Temporal-Anwendungen
Weiterführende Ressourcen
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 11: Monitoring und Observability
Code-Beispiele für dieses Kapitel: examples/part-04/chapter-10/
Praxis-Tipp: Beginnen Sie mit Temporal Cloud für schnellen Start. Wenn Sie spezifische Compliance- oder Kosten-Anforderungen haben, evaluieren Sie Self-Hosted. Unabhängig davon: Implementieren Sie von Anfang an graceful shutdown und Deployment-Strategien!
Kapitel 11: Monitoring und Observability
Einleitung
Sie haben Temporal in Production deployed, Workers laufen, Workflows werden ausgeführt. Alles scheint gut zu funktionieren. Bis plötzlich:
- Workflows verzögern sich ohne erkennbaren Grund
- Activities schlagen häufiger fehl als erwartet
- Task Queues füllen sich auf
- Die Business-Logik funktioniert nicht mehr wie gewünscht
Ohne Monitoring sind Sie blind. Sie merken Probleme erst, wenn Kunden sich beschweren. Sie haben keine Ahnung, wo das Problem liegt. Debugging wird zum Rätselraten.
Mit richtigem Monitoring und Observability sehen Sie:
- Wie viele Workflows gerade laufen
- Wo Bottlenecks sind
- Welche Activities am längsten dauern
- Ob Worker überlastet sind
- Wann Probleme begannen und warum
Temporal bietet umfassende Observability-Features, aber sie müssen richtig konfiguriert und genutzt werden.
Das Grundproblem
Scenario: Sie betreiben einen Order Processing Service mit Temporal:
@workflow.defn
class OrderWorkflow:
async def run(self, order_id: str) -> str:
# 10+ Activities: payment, inventory, shipping, notifications, etc.
payment = await workflow.execute_activity(process_payment, ...)
inventory = await workflow.execute_activity(check_inventory, ...)
shipping = await workflow.execute_activity(create_shipment, ...)
# ... more activities
Plötzlich: Kunden berichten, dass Orders langsamer verarbeitet werden. Von 2 Minuten auf 10+ Minuten.
Ohne Monitoring:
❓ Welche Activity ist langsam?
❓ Ist es ein spezifischer Worker?
❓ Ist die Datenbank überlastet?
❓ Sind externe APIs langsam?
❓ Gibt es ein Deployment-Problem?
→ Stunden mit Guesswork verbringen
→ Logs manuell durchsuchen
→ Code instrumentieren und neu deployen
Mit Monitoring & Observability:
✓ Grafana Dashboard öffnen
✓ "process_payment" Activity latency: 9 Minuten (normal: 30s)
✓ Trace zeigt: Payment API antwortet nicht
✓ Logs zeigen: Connection timeouts zu payment.api.com
✓ Alert wurde bereits ausgelöst
→ Problem in 2 Minuten identifiziert
→ Payment Service Team kontaktieren
→ Fallback-Lösung aktivieren
Die drei Säulen der Observability
1. Metrics (Was passiert?)
- Workflow execution rate
- Activity success/failure rates
- Queue depth
- Worker utilization
- Latency percentiles (p50, p95, p99)
2. Logs (Warum passiert es?)
- Structured logging in Workflows/Activities
- Correlation mit Workflow/Activity IDs
- Error messages und stack traces
- Business-relevante Events
3. Traces (Wie fließen Requests?)
- End-to-end Workflow execution traces
- Activity spans
- Distributed tracing über Service-Grenzen
- Bottleneck-Identifikation
Lernziele
Nach diesem Kapitel können Sie:
- SDK Metrics mit Prometheus exportieren und monitoren
- Temporal Cloud/Server Metrics nutzen
- Grafana Dashboards für Temporal erstellen und nutzen
- OpenTelemetry für Distributed Tracing integrieren
- Strukturierte Logs mit Correlation implementieren
- SLO-basiertes Alerting für kritische Workflows aufsetzen
- Debugging mit Observability-Tools durchführen
11.1 SDK Metrics mit Prometheus
11.1.1 Warum SDK Metrics?
Temporal bietet zwei Arten von Metrics:
| Metric Source | Perspektive | Was wird gemessen? |
|---|---|---|
| SDK Metrics | Client/Worker | Ihre Application-Performance |
| Server Metrics | Temporal Service | Temporal Infrastructure Health |
Für Application Monitoring → SDK Metrics sind die Source of Truth!
SDK Metrics zeigen:
- Activity execution time aus Sicht Ihrer Worker
- Workflow execution success rate Ihrer Workflows
- Task Queue lag Ihrer Queues
- Worker resource usage Ihrer Deployments
11.1.2 Prometheus Setup für Python SDK
Schritt 1: Dependencies
# requirements.txt
temporalio>=1.5.0
prometheus-client>=0.19.0
Schritt 2: Prometheus Exporter in Worker
"""
Worker mit Prometheus Metrics Export
Chapter: 11 - Monitoring und Observability
"""
import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio.contrib.prometheus import PrometheusMetricsExporter
from prometheus_client import start_http_server, CollectorRegistry
import logging
logger = logging.getLogger(__name__)
class MonitoredWorker:
"""Worker mit Prometheus Metrics"""
def __init__(
self,
temporal_host: str,
task_queue: str,
workflows: list,
activities: list,
metrics_port: int = 9090
):
self.temporal_host = temporal_host
self.task_queue = task_queue
self.workflows = workflows
self.activities = activities
self.metrics_port = metrics_port
async def run(self):
"""Start worker mit Prometheus metrics export"""
# 1. Prometheus Registry erstellen
registry = CollectorRegistry()
# 2. Temporal Client mit Metrics Exporter
client = await Client.connect(
self.temporal_host,
# Metrics aktivieren
runtime=self._create_runtime_with_metrics(registry)
)
# 3. Prometheus HTTP Server starten
start_http_server(self.metrics_port, registry=registry)
logger.info(f"✓ Prometheus metrics exposed on :{self.metrics_port}/metrics")
# 4. Worker mit Metrics starten
async with Worker(
client,
task_queue=self.task_queue,
workflows=self.workflows,
activities=self.activities
):
logger.info(f"✓ Worker started on queue: {self.task_queue}")
await asyncio.Event().wait() # Run forever
def _create_runtime_with_metrics(self, registry: CollectorRegistry):
"""Runtime mit Prometheus Metrics konfigurieren"""
from temporalio.runtime import (
Runtime,
TelemetryConfig,
PrometheusConfig
)
return Runtime(telemetry=TelemetryConfig(
metrics=PrometheusConfig(
# Bind an localhost:0 - wird von start_http_server übernommen
bind_address="0.0.0.0:0",
# Custom registry
registry=registry
)
))
# Verwendung
if __name__ == "__main__":
from my_workflows import OrderWorkflow
from my_activities import process_payment, check_inventory
worker = MonitoredWorker(
temporal_host="localhost:7233",
task_queue="order-processing",
workflows=[OrderWorkflow],
activities=[process_payment, check_inventory],
metrics_port=9090
)
asyncio.run(worker.run())
Schritt 3: Metrics abrufen
# Metrics endpoint testen
curl http://localhost:9090/metrics
# Ausgabe (Beispiel):
# temporal_workflow_task_execution_count{namespace="default",task_queue="order-processing"} 142
# temporal_activity_execution_count{activity_type="process_payment"} 89
# temporal_activity_execution_latency_seconds_sum{activity_type="process_payment"} 45.2
# temporal_worker_task_slots_available{task_queue="order-processing"} 98
# ...
11.1.3 Prometheus Scrape Configuration
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Temporal Workers
- job_name: 'temporal-workers'
static_configs:
- targets:
- 'worker-1:9090'
- 'worker-2:9090'
- 'worker-3:9090'
# Labels für besseres Filtering
relabel_configs:
- source_labels: [__address__]
regex: 'worker-(\d+):.*'
target_label: 'worker_id'
replacement: '$1'
# Temporal Server (self-hosted)
- job_name: 'temporal-server'
static_configs:
- targets:
- 'temporal-frontend:9090'
- 'temporal-history:9090'
- 'temporal-matching:9090'
- 'temporal-worker:9090'
# Temporal Cloud (via Prometheus API)
- job_name: 'temporal-cloud'
scheme: https
static_configs:
- targets:
- 'cloud-metrics.temporal.io'
authorization:
credentials: '<YOUR_TEMPORAL_CLOUD_API_KEY>'
params:
namespace: ['your-namespace.account']
Kubernetes Service Discovery (fortgeschritten):
scrape_configs:
- job_name: 'temporal-workers-k8s'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Nur Pods mit Label app=temporal-worker
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: temporal-worker
# Port 9090 targeten
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: __address__
replacement: '$1:9090'
# Labels hinzufügen
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
11.1.4 Wichtige SDK Metrics
Workflow Metrics:
# Workflow Execution Rate
rate(temporal_workflow_task_execution_count[5m])
# Workflow Success Rate
rate(temporal_workflow_completed_count{status="completed"}[5m])
/
rate(temporal_workflow_completed_count[5m])
# Workflow Latency (p95)
histogram_quantile(0.95,
rate(temporal_workflow_execution_latency_seconds_bucket[5m])
)
Activity Metrics:
# Activity Execution Rate by Type
rate(temporal_activity_execution_count[5m]) by (activity_type)
# Activity Failure Rate
rate(temporal_activity_execution_failed_count[5m]) by (activity_type)
# Activity Latency by Type
histogram_quantile(0.95,
rate(temporal_activity_execution_latency_seconds_bucket[5m])
) by (activity_type)
# Slowest Activities (Top 5)
topk(5,
avg(rate(temporal_activity_execution_latency_seconds_sum[5m]))
by (activity_type)
)
Worker Metrics:
# Task Slots Used vs Available
temporal_worker_task_slots_used / temporal_worker_task_slots_available
# Task Queue Lag (Backlog)
temporal_task_queue_lag_seconds
# Worker Poll Success Rate
rate(temporal_worker_poll_success_count[5m])
/
rate(temporal_worker_poll_count[5m])
11.1.5 Custom Business Metrics
Problem: SDK Metrics zeigen technische Metriken, aber nicht Ihre Business KPIs.
Lösung: Custom Metrics in Activities exportieren.
"""
Custom Business Metrics in Activities
"""
from temporalio import activity
from prometheus_client import Counter, Histogram, Gauge
# Custom Metrics
orders_processed = Counter(
'orders_processed_total',
'Total orders processed',
['status', 'payment_method']
)
order_value = Histogram(
'order_value_usd',
'Order value in USD',
buckets=[10, 50, 100, 500, 1000, 5000]
)
payment_latency = Histogram(
'payment_processing_seconds',
'Payment processing time',
['payment_provider']
)
active_orders = Gauge(
'active_orders',
'Currently processing orders'
)
@activity.defn
async def process_order(order_id: str, amount: float, payment_method: str) -> str:
"""Process order mit custom metrics"""
# Gauge: Active orders erhöhen
active_orders.inc()
try:
# Business-Logic
start = time.time()
payment_result = await process_payment(amount, payment_method)
latency = time.time() - start
# Metrics erfassen
payment_latency.labels(payment_provider=payment_method).observe(latency)
order_value.observe(amount)
orders_processed.labels(
status='completed',
payment_method=payment_method
).inc()
return f"Order {order_id} completed"
except Exception as e:
orders_processed.labels(
status='failed',
payment_method=payment_method
).inc()
raise
finally:
# Gauge: Active orders reduzieren
active_orders.dec()
PromQL Queries für Business Metrics:
# Revenue per Hour
sum(rate(order_value_usd_sum[1h]))
# Orders per Minute by Payment Method
sum(rate(orders_processed_total[1m])) by (payment_method)
# Payment Provider Performance
histogram_quantile(0.95,
rate(payment_processing_seconds_bucket[5m])
) by (payment_provider)
# Success Rate by Payment Method
sum(rate(orders_processed_total{status="completed"}[5m])) by (payment_method)
/
sum(rate(orders_processed_total[5m])) by (payment_method)
11.2 Grafana Dashboards
11.2.1 Grafana Setup
Docker Compose Setup (Development):
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
volumes:
prometheus-data:
grafana-data:
Grafana Datasource Provisioning:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
11.2.2 Community Dashboards
Temporal stellt Community Grafana Dashboards bereit:
Installation:
# Dashboard JSON herunterladen
curl -O https://raw.githubusercontent.com/temporalio/dashboards/main/grafana/temporal-sdk.json
# In Grafana importieren:
# Dashboards > Import > Upload JSON file
Verfügbare Dashboards:
-
Temporal SDK Overview
- Workflow execution rates
- Activity success/failure rates
- Worker health
- Task queue metrics
-
Temporal Server
- Service health (Frontend, History, Matching, Worker)
- Request rates und latency
- Database performance
- Resource usage
-
Temporal Cloud
- Namespace metrics
- API request rates
- Workflow execution trends
- Billing metrics
11.2.3 Custom Dashboard erstellen
Panel 1: Workflow Execution Rate
{
"title": "Workflow Execution Rate",
"targets": [{
"expr": "rate(temporal_workflow_task_execution_count{namespace=\"$namespace\"}[5m])",
"legendFormat": "{{task_queue}}"
}],
"type": "graph"
}
Panel 2: Activity Latency Heatmap
{
"title": "Activity Latency Distribution",
"targets": [{
"expr": "rate(temporal_activity_execution_latency_seconds_bucket{activity_type=\"$activity\"}[5m])",
"format": "heatmap"
}],
"type": "heatmap",
"yAxis": { "format": "s" }
}
Panel 3: Worker Task Slots
{
"title": "Worker Task Slots",
"targets": [
{
"expr": "temporal_worker_task_slots_available",
"legendFormat": "Available - {{worker_id}}"
},
{
"expr": "temporal_worker_task_slots_used",
"legendFormat": "Used - {{worker_id}}"
}
],
"type": "graph",
"stack": true
}
Panel 4: Top Slowest Activities
{
"title": "Top 10 Slowest Activities",
"targets": [{
"expr": "topk(10, avg(rate(temporal_activity_execution_latency_seconds_sum[5m])) by (activity_type))",
"legendFormat": "{{activity_type}}",
"instant": true
}],
"type": "table"
}
Complete Dashboard Example (kompakt):
{
"dashboard": {
"title": "Temporal - Order Processing",
"timezone": "browser",
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(temporal_workflow_task_execution_count, namespace)"
},
{
"name": "task_queue",
"type": "query",
"query": "label_values(temporal_workflow_task_execution_count{namespace=\"$namespace\"}, task_queue)"
}
]
},
"panels": [
{
"title": "Workflow Execution Rate",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [{
"expr": "rate(temporal_workflow_task_execution_count{namespace=\"$namespace\",task_queue=\"$task_queue\"}[5m])"
}]
},
{
"title": "Activity Success Rate",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [{
"expr": "rate(temporal_activity_execution_count{status=\"completed\"}[5m]) / rate(temporal_activity_execution_count[5m])"
}]
},
{
"title": "Task Queue Lag",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [{
"expr": "temporal_task_queue_lag_seconds{task_queue=\"$task_queue\"}"
}]
}
]
}
}
11.2.4 Alerting in Grafana
Alert 1: High Workflow Failure Rate
# Alert Definition
- alert: HighWorkflowFailureRate
expr: |
(
rate(temporal_workflow_completed_count{status="failed"}[5m])
/
rate(temporal_workflow_completed_count[5m])
) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High workflow failure rate"
description: "{{ $value | humanizePercentage }} of workflows are failing"
Alert 2: Task Queue Backlog
- alert: TaskQueueBacklog
expr: temporal_task_queue_lag_seconds > 300
for: 10m
labels:
severity: critical
annotations:
summary: "Task queue has significant backlog"
description: "Task queue {{ $labels.task_queue }} has {{ $value }}s lag"
Alert 3: Worker Unavailable
- alert: WorkerUnavailable
expr: up{job="temporal-workers"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Worker is down"
description: "Worker {{ $labels.instance }} is not responding"
Alert 4: Activity Latency Spike
- alert: ActivityLatencySpike
expr: |
histogram_quantile(0.95,
rate(temporal_activity_execution_latency_seconds_bucket[5m])
) > 60
for: 5m
labels:
severity: warning
activity_type: "{{ $labels.activity_type }}"
annotations:
summary: "Activity latency is high"
description: "p95 latency for {{ $labels.activity_type }}: {{ $value }}s"
11.3 OpenTelemetry Integration
11.3.1 Warum OpenTelemetry?
Prometheus + Grafana geben Ihnen Metrics. Aber für Distributed Tracing brauchen Sie mehr:
- End-to-End Traces: Verfolgen Sie einen Request durch Ihr gesamtes System
- Spans: Sehen Sie, wie lange jede Activity dauert
- Context Propagation: Korrelieren Sie Logs, Metrics und Traces
- Service Dependencies: Visualisieren Sie, wie Services miteinander kommunizieren
Use Case: Ein Workflow ruft 5 verschiedene Microservices auf. Welcher Service verursacht die Latenz?
HTTP Request → API Gateway → Order Workflow
├─> Payment Service (500ms)
├─> Inventory Service (200ms)
├─> Shipping Service (3000ms) ← BOTTLENECK!
├─> Email Service (100ms)
└─> Analytics Service (50ms)
Mit OpenTelemetry sehen Sie diese gesamte Kette als einen zusammenhängenden Trace.
11.3.2 OpenTelemetry Setup
Dependencies:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-temporal \
opentelemetry-exporter-otlp
Tracer Setup:
"""
OpenTelemetry Integration für Temporal
Chapter: 11 - Monitoring und Observability
"""
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio import workflow, activity
import asyncio
def setup_telemetry(service_name: str):
"""Setup OpenTelemetry Tracing"""
# Resource: Identifiziert diesen Service
resource = Resource.create({
"service.name": service_name,
"service.version": "1.0.0",
"deployment.environment": "production"
})
# Tracer Provider
provider = TracerProvider(resource=resource)
# OTLP Exporter (zu Tempo, Jaeger, etc.)
otlp_exporter = OTLPSpanExporter(
endpoint="http://tempo:4317",
insecure=True
)
# Batch Processor (für Performance)
span_processor = BatchSpanProcessor(otlp_exporter)
provider.add_span_processor(span_processor)
# Global Tracer setzen
trace.set_tracer_provider(provider)
return trace.get_tracer(service_name)
# Tracer erstellen
tracer = setup_telemetry("order-service")
@activity.defn
async def process_payment(order_id: str, amount: float) -> dict:
"""Activity mit manual tracing"""
# Span für diese Activity
with tracer.start_as_current_span("process_payment") as span:
# Span Attributes (Metadata)
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)
span.set_attribute("activity.type", "process_payment")
# External Service Call tracen
with tracer.start_as_current_span("call_payment_api") as api_span:
api_span.set_attribute("http.method", "POST")
api_span.set_attribute("http.url", "https://payment.api/charge")
# Simulierter API Call
await asyncio.sleep(0.5)
api_span.set_attribute("http.status_code", 200)
# Span Status
span.set_status(trace.StatusCode.OK)
return {
"success": True,
"transaction_id": f"txn_{order_id}"
}
@workflow.defn
class OrderWorkflow:
"""Workflow mit Tracing"""
@workflow.run
async def run(self, order_id: str) -> dict:
# Workflow-Context als Span
# (automatisch durch Temporal SDK + OpenTelemetry Instrumentation)
workflow.logger.info(f"Processing order {order_id}")
# Activities werden automatisch als Child Spans getrackt
payment = await workflow.execute_activity(
process_payment,
args=[order_id, 99.99],
start_to_close_timeout=timedelta(seconds=30)
)
# Weitere Activities...
return {"status": "completed", "payment": payment}
11.3.3 Automatic Instrumentation
Einfachere Alternative: Temporal SDK Instrumentation (experimentell):
from opentelemetry.instrumentation.temporal import TemporalInstrumentor
# Automatische Instrumentation
TemporalInstrumentor().instrument()
# Jetzt werden Workflows und Activities automatisch getrackt
client = await Client.connect("localhost:7233")
Was wird automatisch getrackt:
- Workflow Start/Complete
- Activity Execution
- Task Queue Operations
- Signals/Queries
- Child Workflows
11.3.4 Tempo + Grafana Setup
Docker Compose:
version: '3.8'
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "3200:3200" # Tempo Query Frontend
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo-data:/var/tempo
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
volumes:
tempo-data:
tempo.yaml:
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
storage:
trace:
backend: local
local:
path: /var/tempo/traces
query_frontend:
search:
enabled: true
grafana-datasources.yaml:
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
isDefault: false
11.3.5 Trace Visualisierung
In Grafana Explore:
1. Data Source: Tempo
2. Query: trace_id = "abc123..."
3. Visualisierung:
OrderWorkflow [========== 5.2s ==========]
├─ process_payment [=== 0.5s ===]
│ └─ call_payment_api [== 0.48s ==]
├─ check_inventory [= 0.2s =]
├─ create_shipment [======== 3.0s ========] ← SLOW!
├─ send_confirmation_email [= 0.1s =]
└─ update_analytics [= 0.05s =]
Trace Search Queries:
# Alle Traces für einen Workflow
service.name="order-service" && workflow.type="OrderWorkflow"
# Langsame Traces (> 5s)
service.name="order-service" && duration > 5s
# Fehlerhafte Traces
status=error
# Traces für bestimmte Order
order_id="order-12345"
11.3.6 Correlation: Metrics + Logs + Traces
Das Problem: Metrics zeigen ein Problem, aber Sie brauchen Details.
Lösung: Exemplars + Trace IDs in Logs
Prometheus Exemplars:
from prometheus_client import Histogram
from opentelemetry import trace
# Histogram mit Exemplar Support
activity_latency = Histogram(
'activity_execution_seconds',
'Activity execution time'
)
@activity.defn
async def my_activity():
start = time.time()
# ... Activity Logic ...
latency = time.time() - start
# Metric + Trace ID als Exemplar
current_span = trace.get_current_span()
trace_id = current_span.get_span_context().trace_id
activity_latency.observe(
latency,
exemplar={'trace_id': format(trace_id, '032x')}
)
In Grafana: Click auf Metric Point → Jump zu Trace!
Structured Logging mit Trace Context:
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
@activity.defn
async def my_activity(order_id: str):
# Trace Context extrahieren
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
span_id = format(span.get_span_context().span_id, '016x')
# Structured Log mit Trace IDs
logger.info(
"Processing order",
extra={
"order_id": order_id,
"trace_id": trace_id,
"span_id": span_id,
"workflow_id": activity.info().workflow_id,
"activity_type": activity.info().activity_type
}
)
Log Output (JSON):
{
"timestamp": "2025-01-19T10:30:45Z",
"level": "INFO",
"message": "Processing order",
"order_id": "order-12345",
"trace_id": "a1b2c3d4e5f6...",
"span_id": "789abc...",
"workflow_id": "order-workflow-12345",
"activity_type": "process_payment"
}
In Grafana Loki: Search for trace_id="a1b2c3d4e5f6..." → Alle Logs für diesen Trace!
11.4 Logging Best Practices
11.4.1 Structured Logging Setup
Warum Structured Logging?
Unstructured (schlecht):
logger.info(f"Order {order_id} completed in {duration}s")
Structured (gut):
logger.info("Order completed", extra={
"order_id": order_id,
"duration_seconds": duration,
"status": "success"
})
Vorteile:
- Suchbar nach Feldern
- Aggregierbar
- Maschinenlesbar
- Integriert mit Observability Tools
Python Setup mit structlog:
import structlog
from temporalio import activity, workflow
# Structlog konfigurieren
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
],
logger_factory=structlog.PrintLoggerFactory(),
)
logger = structlog.get_logger()
@activity.defn
async def process_order(order_id: str):
"""Activity mit strukturiertem Logging"""
# Workflow Context hinzufügen
log = logger.bind(
workflow_id=activity.info().workflow_id,
activity_id=activity.info().activity_id,
activity_type="process_order",
order_id=order_id
)
log.info("activity_started")
try:
# Business Logic
result = await do_something(order_id)
log.info(
"activity_completed",
result=result,
duration_ms=123
)
return result
except Exception as e:
log.error(
"activity_failed",
error=str(e),
error_type=type(e).__name__
)
raise
Log Output:
{
"timestamp": "2025-01-19T10:30:45.123456Z",
"level": "info",
"event": "activity_started",
"workflow_id": "order-workflow-abc",
"activity_id": "activity-xyz",
"activity_type": "process_order",
"order_id": "order-12345"
}
{
"timestamp": "2025-01-19T10:30:45.345678Z",
"level": "info",
"event": "activity_completed",
"workflow_id": "order-workflow-abc",
"activity_id": "activity-xyz",
"result": "success",
"duration_ms": 123,
"order_id": "order-12345"
}
11.4.2 Temporal Logger Integration
Temporal SDK Logger nutzen:
from temporalio import workflow, activity
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self):
# Temporal Workflow Logger (automatisch mit Context)
workflow.logger.info(
"Workflow started",
extra={"custom_field": "value"}
)
# Logging ist replay-safe!
# Logs werden nur bei echter Execution ausgegeben
@activity.defn
async def my_activity():
# Temporal Activity Logger (automatisch mit Context)
activity.logger.info(
"Activity started",
extra={"custom_field": "value"}
)
Automatischer Context:
Temporal Logger fügen automatisch hinzu:
workflow_idworkflow_typerun_idactivity_idactivity_typenamespacetask_queue
11.4.3 Log Aggregation mit Loki
Loki Setup:
# docker-compose.yml
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
loki-config.yaml:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
promtail-config.yaml:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: temporal-workers
static_configs:
- targets:
- localhost
labels:
job: temporal-workers
__path__: /var/log/temporal-worker/*.log
# JSON Log Parsing
pipeline_stages:
- json:
expressions:
timestamp: timestamp
level: level
message: event
workflow_id: workflow_id
activity_type: activity_type
- labels:
level:
workflow_id:
activity_type:
- timestamp:
source: timestamp
format: RFC3339
LogQL Queries in Grafana:
# Alle Logs für einen Workflow
{job="temporal-workers"} | json | workflow_id="order-workflow-abc"
# Fehler-Logs
{job="temporal-workers"} | json | level="error"
# Langsame Activities (> 5s)
{job="temporal-workers"}
| json
| duration_ms > 5000
| line_format "{{.activity_type}}: {{.duration_ms}}ms"
# Rate von Errors
rate({job="temporal-workers"} | json | level="error" [5m])
# Top Activities nach Count
topk(10,
sum by (activity_type) (
count_over_time({job="temporal-workers"} | json [1h])
)
)
11.4.4 Best Practices
DO:
- ✅ Strukturierte Logs (JSON)
- ✅ Correlation IDs (workflow_id, trace_id)
- ✅ Log Level appropriate nutzen (DEBUG, INFO, WARN, ERROR)
- ✅ Performance-relevante Metrics loggen
- ✅ Business Events loggen
- ✅ Fehler mit Context loggen
DON’T:
- ❌ Sensitive Daten loggen (Passwords, PII, Credit Cards)
- ❌ Zu viel loggen (Performance-Impact)
- ❌ Unstrukturierte Logs
- ❌ Logging in Workflows ohne Replay-Safety
Replay-Safe Logging:
@workflow.defn
class MyWorkflow:
@workflow.run
async def run(self):
# FALSCH: Logging ohne Replay-Check
print(f"Workflow started at {datetime.now()}") # ❌ Non-deterministic!
# RICHTIG: Temporal Logger (replay-safe)
workflow.logger.info("Workflow started") # ✅ Nur bei echter Execution
Sensitive Data redaktieren:
import re
def redact_sensitive(data: dict) -> dict:
"""Redact sensitive fields"""
sensitive_fields = ['password', 'credit_card', 'ssn', 'api_key']
redacted = data.copy()
for key in redacted:
if any(field in key.lower() for field in sensitive_fields):
redacted[key] = "***REDACTED***"
return redacted
@activity.defn
async def process_payment(payment_data: dict):
# Log mit redaktierten Daten
activity.logger.info(
"Processing payment",
extra=redact_sensitive(payment_data)
)
11.5 SLO-basiertes Alerting
11.5.1 Was sind SLIs, SLOs, SLAs?
SLI (Service Level Indicator): Messgröße für Service-Qualität
- Beispiel: “99.5% der Workflows werden erfolgreich abgeschlossen”
SLO (Service Level Objective): Ziel für SLI
- Beispiel: “SLO: 99.9% Workflow Success Rate”
SLA (Service Level Agreement): Vertragliche Vereinbarung
- Beispiel: “SLA: 99.5% Uptime mit finanziellen Konsequenzen”
Verhältnis: SLI ≤ SLO ≤ SLA
11.5.2 SLIs für Temporal Workflows
Request Success Rate (wichtigster SLI):
# Workflow Success Rate
sum(rate(temporal_workflow_completed_count{status="completed"}[5m]))
/
sum(rate(temporal_workflow_completed_count[5m]))
Latency (p50, p95, p99):
# Workflow p95 Latency
histogram_quantile(0.95,
rate(temporal_workflow_execution_latency_seconds_bucket[5m])
)
Availability:
# Worker Availability
avg(up{job="temporal-workers"})
Beispiel SLOs:
| SLI | SLO | Messung |
|---|---|---|
| Workflow Success Rate | ≥ 99.9% | Last 30d |
| Order Workflow p95 Latency | ≤ 5s | Last 1h |
| Worker Availability | ≥ 99.5% | Last 30d |
| Task Queue Lag | ≤ 30s | Last 5m |
11.5.3 Error Budget
Konzept: Wie viel “Failure” ist erlaubt?
Berechnung:
Error Budget = 100% - SLO
Beispiel:
SLO: 99.9% Success Rate
Error Budget: 0.1% = 1 von 1000 Requests darf fehlschlagen
Bei 1M Workflows/Monat:
Error Budget = 1M * 0.001 = 1,000 erlaubte Failures
Error Budget Tracking:
# Verbleibender Error Budget (30d window)
(
1 - (
sum(increase(temporal_workflow_completed_count{status="completed"}[30d]))
/
sum(increase(temporal_workflow_completed_count[30d]))
)
) / 0.001 # 0.001 = Error Budget für 99.9% SLO
Interpretation:
Result = 0.5 → 50% Error Budget verbraucht ✅
Result = 0.9 → 90% Error Budget verbraucht ⚠️
Result = 1.2 → 120% Error Budget verbraucht ❌ SLO missed!
11.5.4 Multi-Window Multi-Burn-Rate Alerts
Problem mit einfachen Alerts:
# Zu simpel
- alert: HighErrorRate
expr: error_rate > 0.01
for: 5m
Probleme:
- Flapping bei kurzen Spikes
- Langsame Reaktion bei echten Problemen
- Keine Unterscheidung: Kurzer Spike vs. anhaltender Ausfall
Lösung: Multi-Window Alerts (aus Google SRE Workbook)
Konzept:
| Severity | Burn Rate | Short Window | Long Window | Alert |
|---|---|---|---|---|
| Critical | 14.4x | 1h | 5m | Page immediately |
| High | 6x | 6h | 30m | Page during business hours |
| Medium | 3x | 1d | 2h | Ticket |
| Low | 1x | 3d | 6h | No alert |
Implementation:
groups:
- name: temporal_slo_alerts
rules:
# Critical: 14.4x burn rate (1h budget in 5m)
- alert: WorkflowSLOCritical
expr: |
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[1h]))
/
sum(rate(temporal_workflow_completed_count[1h]))
)) > (14.4 * 0.001)
)
and
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[5m]))
/
sum(rate(temporal_workflow_completed_count[5m]))
)) > (14.4 * 0.001)
)
labels:
severity: critical
annotations:
summary: "Critical: Workflow SLO burn rate too high"
description: "Error budget will be exhausted in < 2 days at current rate"
# High: 6x burn rate
- alert: WorkflowSLOHigh
expr: |
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[6h]))
/
sum(rate(temporal_workflow_completed_count[6h]))
)) > (6 * 0.001)
)
and
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[30m]))
/
sum(rate(temporal_workflow_completed_count[30m]))
)) > (6 * 0.001)
)
labels:
severity: warning
annotations:
summary: "High: Workflow SLO burn rate elevated"
description: "Error budget will be exhausted in < 5 days at current rate"
# Medium: 3x burn rate
- alert: WorkflowSLOMedium
expr: |
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[1d]))
/
sum(rate(temporal_workflow_completed_count[1d]))
)) > (3 * 0.001)
)
and
(
(1 - (
sum(rate(temporal_workflow_completed_count{status="completed"}[2h]))
/
sum(rate(temporal_workflow_completed_count[2h]))
)) > (3 * 0.001)
)
labels:
severity: info
annotations:
summary: "Medium: Workflow SLO burn rate concerning"
description: "Error budget will be exhausted in < 10 days at current rate"
11.5.5 Activity-Specific SLOs
Nicht alle Activities sind gleich wichtig!
Beispiel:
# Critical Activity: Payment Processing
- alert: PaymentActivitySLOBreach
expr: |
(
sum(rate(temporal_activity_execution_count{
activity_type="process_payment",
status="completed"
}[5m]))
/
sum(rate(temporal_activity_execution_count{
activity_type="process_payment"
}[5m]))
) < 0.999 # 99.9% SLO
for: 5m
labels:
severity: critical
activity: process_payment
annotations:
summary: "Payment activity SLO breach"
description: "Success rate: {{ $value | humanizePercentage }}"
# Low-Priority Activity: Analytics Update
- alert: AnalyticsActivitySLOBreach
expr: |
(
sum(rate(temporal_activity_execution_count{
activity_type="update_analytics",
status="completed"
}[30m]))
/
sum(rate(temporal_activity_execution_count{
activity_type="update_analytics"
}[30m]))
) < 0.95 # 95% SLO (relaxed)
for: 30m
labels:
severity: warning
activity: update_analytics
annotations:
summary: "Analytics activity degraded"
11.5.6 Alertmanager Configuration
alertmanager.yml:
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Critical alerts → Slack #alerts
- match:
severity: critical
receiver: slack-critical
# Warnings → Slack #monitoring
- match:
severity: warning
receiver: slack-monitoring
# Info → Slack #monitoring (low priority)
- match:
severity: info
receiver: slack-monitoring
group_wait: 5m
repeat_interval: 12h
receivers:
- name: 'default'
slack_configs:
- channel: '#monitoring'
title: 'Temporal Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-critical'
slack_configs:
- channel: '#alerts'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'danger'
- name: 'slack-monitoring'
slack_configs:
- channel: '#monitoring'
title: '⚠️ {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'warning'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
11.6 Temporal Cloud Observability
11.6.1 Cloud Metrics Zugriff
Temporal Cloud bietet zwei Metrics Endpoints:
- Prometheus Endpoint (Scraping):
https://cloud-metrics.temporal.io/prometheus/<account-id>/<namespace>
- PromQL Endpoint (Querying):
https://cloud-metrics.temporal.io/api/v1/query
Authentication:
# API Key generieren (Temporal Cloud UI)
# Settings > Integrations > Prometheus
# Metrics abrufen
curl -H "Authorization: Bearer <API_KEY>" \
https://cloud-metrics.temporal.io/prometheus/<account-id>/<namespace>/metrics
11.6.2 Prometheus Scrape Config
scrape_configs:
- job_name: 'temporal-cloud'
scheme: https
static_configs:
- targets:
- 'cloud-metrics.temporal.io'
authorization:
credentials: '<YOUR_API_KEY>'
params:
account: ['<account-id>']
namespace: ['<namespace>']
scrape_interval: 60s # Cloud Metrics: Max 1/minute
11.6.3 Verfügbare Cloud Metrics
Namespace Metrics:
# Workflow Execution Rate
temporal_cloud_v0_workflow_started
# Workflow Success/Failure
temporal_cloud_v0_workflow_success
temporal_cloud_v0_workflow_failed
# Active Workflows
temporal_cloud_v0_workflow_running
# Task Queue Depth
temporal_cloud_v0_task_queue_depth{task_queue="order-processing"}
Resource Metrics:
# Actions per Second (Billing)
temporal_cloud_v0_resource_actions_count
# Storage Usage
temporal_cloud_v0_resource_storage_bytes
11.6.4 Grafana Dashboard für Cloud
Cloud-specific Dashboard:
{
"title": "Temporal Cloud Overview",
"panels": [
{
"title": "Workflow Start Rate",
"targets": [{
"expr": "rate(temporal_cloud_v0_workflow_started[5m])",
"legendFormat": "{{namespace}}"
}]
},
{
"title": "Workflow Success Rate",
"targets": [{
"expr": "rate(temporal_cloud_v0_workflow_success[5m]) / rate(temporal_cloud_v0_workflow_started[5m])",
"legendFormat": "Success Rate"
}]
},
{
"title": "Active Workflows",
"targets": [{
"expr": "temporal_cloud_v0_workflow_running",
"legendFormat": "{{workflow_type}}"
}]
},
{
"title": "Actions per Second (Billing)",
"targets": [{
"expr": "rate(temporal_cloud_v0_resource_actions_count[5m])",
"legendFormat": "Actions/s"
}]
}
]
}
11.6.5 SDK Metrics vs. Cloud Metrics
Wichtig: Verwenden Sie die richtige Metrik-Quelle!
| Use Case | Source | Warum |
|---|---|---|
| “Wie lange dauert meine Activity?” | SDK Metrics | Misst aus Worker-Sicht |
| “Wie viele Workflows sind aktiv?” | Cloud Metrics | Server-seitige Sicht |
| “Ist mein Worker überlastet?” | SDK Metrics | Worker-spezifisch |
| “Task Queue Backlog?” | Cloud Metrics | Server-seitiger Zustand |
| “Billing/Cost?” | Cloud Metrics | Nur Cloud kennt Actions |
Best Practice: Beide kombinieren!
# Workflow End-to-End Latency (Cloud)
temporal_cloud_v0_workflow_execution_time
# Activity Latency within Workflow (SDK)
temporal_activity_execution_latency_seconds{activity_type="process_payment"}
11.7 Debugging mit Observability
11.7.1 Problem → Metrics → Traces → Logs
Workflow: Von groß zu klein
1. Metrics: "Payment workflows sind langsam (p95: 30s statt 5s)"
↓
2. Traces: "process_payment Activity dauert 25s"
↓
3. Logs: "Connection timeout zu payment.api.com"
↓
4. Root Cause: Payment API ist down
Grafana Workflow:
1. Öffne Dashboard "Temporal - Orders"
2. Panel "Activity Latency" zeigt Spike
3. Click auf Spike → "View Traces"
4. Trace zeigt: "process_payment span: 25s"
5. Click auf Span → "View Logs"
6. Log: "ERROR: connection timeout after 20s"
11.7.2 Temporal Web UI Integration
Web UI: https://cloud.temporal.io oder http://localhost:8080
Features:
- Workflow Execution History
- Event Timeline
- Pending Activities
- Stack Traces
- Retry History
Von Grafana zu Web UI:
Grafana Alert: "Workflow order-workflow-abc failed"
↓
Annotation Link: https://cloud.temporal.io/namespaces/default/workflows/order-workflow-abc
↓
Web UI: Zeigt komplette Workflow History
Grafana Annotation Setup:
import requests
def send_workflow_annotation(workflow_id: str, message: str):
"""Send Grafana annotation for workflow event"""
requests.post(
'http://grafana:3000/api/annotations',
json={
'text': message,
'tags': ['temporal', 'workflow', workflow_id],
'time': int(time.time() * 1000), # Unix timestamp ms
},
headers={
'Authorization': 'Bearer <GRAFANA_API_KEY>',
'Content-Type': 'application/json'
}
)
@activity.defn
async def critical_activity():
workflow_id = activity.info().workflow_id
try:
result = await do_something()
send_workflow_annotation(
workflow_id,
f"✓ Critical activity completed"
)
return result
except Exception as e:
send_workflow_annotation(
workflow_id,
f"❌ Critical activity failed: {e}"
)
raise
11.7.3 Correlation Queries
Problem: Metrics/Traces/Logs sind isoliert.
Lösung: Queries mit Correlation IDs.
Find all data for a workflow:
# 1. Prometheus: Get workflow start time
workflow_start_time=$(
promtool query instant \
'temporal_workflow_started_time{workflow_id="order-abc"}'
)
# 2. Tempo: Find traces for workflow
curl -G http://tempo:3200/api/search \
--data-urlencode 'q={workflow_id="order-abc"}'
# 3. Loki: Find logs for workflow
curl -G http://loki:3100/loki/api/v1/query_range \
--data-urlencode 'query={job="workers"} | json | workflow_id="order-abc"' \
--data-urlencode "start=$workflow_start_time"
In Grafana Explore (einfacher):
1. Data Source: Prometheus
2. Query: temporal_workflow_started{workflow_id="order-abc"}
3. Click auf Datapoint → "View in Tempo"
4. Trace öffnet sich → Click auf Span → "View in Loki"
5. Logs erscheinen für diesen Span
11.7.4 Common Debugging Scenarios
Scenario 1: “Workflows are slow”
1. Check: Workflow p95 latency metric
→ Which workflow type is slow?
2. Check: Activity latency breakdown
→ Which activity is the bottleneck?
3. Check: Traces for slow workflow instances
→ Is it always slow or intermittent?
4. Check: Logs for slow activity executions
→ What error/timeout is occurring?
5. Check: External service metrics
→ Is downstream service degraded?
Scenario 2: “High failure rate”
1. Check: Workflow failure rate by type
→ Which workflow is failing?
2. Check: Activity failure rate
→ Which activity is failing?
3. Check: Error logs
→ What error messages appear?
4. Check: Temporal Web UI
→ Look at failed workflow history
5. Check: Deployment timeline
→ Did failure start after deployment?
Scenario 3: “Task queue is backing up”
1. Check: Task queue lag metric
→ How large is the backlog?
2. Check: Worker availability
→ Are workers up?
3. Check: Worker task slots
→ Are workers saturated?
4. Check: Activity execution rate
→ Is processing rate dropping?
5. Check: Worker logs
→ Are workers crashing/restarting?
11.8 Zusammenfassung
Was Sie gelernt haben
SDK Metrics:
- ✅ Prometheus Export aus Python Workers konfigurieren
- ✅ Wichtige Metrics: Workflow/Activity Rate, Latency, Success Rate
- ✅ Custom Business Metrics in Activities
- ✅ Prometheus Scraping für Kubernetes
Grafana:
- ✅ Community Dashboards installieren
- ✅ Custom Dashboards erstellen
- ✅ PromQL Queries für Temporal Metrics
- ✅ Alerting Rules definieren
OpenTelemetry:
- ✅ Distributed Tracing Setup
- ✅ Automatic Instrumentation für Workflows
- ✅ Manual Spans in Activities
- ✅ Tempo Integration
- ✅ Correlation: Metrics + Traces + Logs
Logging:
- ✅ Structured Logging mit
structlog - ✅ Temporal Logger mit Auto-Context
- ✅ Loki für Log Aggregation
- ✅ LogQL Queries
- ✅ Replay-Safe Logging
SLO-basiertes Alerting:
- ✅ SLI/SLO/SLA Konzepte
- ✅ Error Budget Tracking
- ✅ Multi-Window Multi-Burn-Rate Alerts
- ✅ Activity-specific SLOs
- ✅ Alertmanager Configuration
Temporal Cloud:
- ✅ Cloud Metrics API
- ✅ Prometheus Scraping
- ✅ SDK vs. Cloud Metrics
- ✅ Billing Metrics
Debugging:
- ✅ Von Metrics zu Traces zu Logs
- ✅ Temporal Web UI Integration
- ✅ Correlation Queries
- ✅ Common Debugging Scenarios
Production Checklist
Monitoring Setup:
- SDK Metrics Export konfiguriert
- Prometheus scraping Workers
- Grafana Dashboards deployed
- Alerting Rules definiert
- Alertmanager konfiguriert (Slack/PagerDuty)
- On-Call Rotation definiert
Observability:
- Structured Logging implementiert
- Log Aggregation (Loki/ELK) läuft
- OpenTelemetry Tracing aktiviert
- Trace Backend (Tempo/Jaeger) deployed
- Correlation IDs in allen Logs
SLOs:
- SLIs für kritische Workflows definiert
- SLOs festgelegt (99.9%? 99.5%?)
- Error Budget Dashboard erstellt
- Multi-Burn-Rate Alerts konfiguriert
- Activity-specific SLOs dokumentiert
Dashboards:
- Workflow Overview Dashboard
- Worker Health Dashboard
- Activity Performance Dashboard
- Business Metrics Dashboard
- SLO Tracking Dashboard
Alerts:
- High Workflow Failure Rate
- Task Queue Backlog
- Worker Unavailable
- Activity Latency Spike
- SLO Burn Rate Critical
- Error Budget Exhausted
Häufige Fehler
❌ Zu wenig monitoren
Problem: Nur Server-Metrics, keine SDK Metrics
Folge: Keine Sicht auf Ihre Application-Performance
✅ Richtig:
Beide monitoren: Server + SDK Metrics
SDK Metrics = Source of Truth für Application Performance
❌ Nur Metrics, keine Traces
Problem: Wissen, dass es langsam ist, aber nicht wo
Folge: Debugging dauert Stunden
✅ Richtig:
Metrics → Traces → Logs Pipeline
Correlation IDs überall
❌ Alert Fatigue
Problem: 100 Alerts pro Tag
Folge: Wichtige Alerts werden ignoriert
✅ Richtig:
SLO-basiertes Alerting
Multi-Burn-Rate Alerts (weniger False Positives)
Alert nur auf SLO-Verletzungen
❌ Keine Correlation
Problem: Metrics, Logs, Traces sind isoliert
Folge: Müssen manuell korrelieren
✅ Richtig:
Exemplars in Metrics
Trace IDs in Logs
Grafana-Integration
Best Practices
-
Metriken hierarchisch organisieren
System Metrics (Server CPU, Memory) → Temporal Metrics (Workflows, Activities) → Business Metrics (Orders, Revenue) -
Alerts nach Severity gruppieren
Critical → Page immediately (SLO breach) Warning → Page during business hours Info → Ticket for next sprint -
Dashboards für Rollen
Executive: Business KPIs (Orders/hour, Revenue) Engineering: Technical Metrics (Latency, Error Rate) SRE: Operational (Worker Health, Queue Depth) On-Call: Incident Response (Recent Alerts, Anomalies) -
Retention Policies
Metrics: 30 days high-res, 1 year downsampled Logs: 7 days full, 30 days search indices Traces: 7 days (sampling: 10% background, 100% errors) -
Cost Optimization
- Use sampling for traces (not every request) - Downsample old metrics - Compress logs - Use Cloud Metrics API efficiently (max 1 req/min)
Weiterführende Ressourcen
Temporal Docs:
Grafana:
OpenTelemetry:
SRE:
Nächste Schritte
Sie können jetzt Production-ready Monitoring aufsetzen! Aber Observability ist nur ein Teil des Betriebsalltags.
Weiter geht’s mit:
- Kapitel 12: Testing Strategies – Wie Sie Workflows umfassend testen
- Kapitel 13: Best Practices und Anti-Muster – Production-ready Temporal-Anwendungen
- Kapitel 14-15: Kochbuch – Konkrete Patterns und Rezepte für häufige Use Cases
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 12: Testing Strategies
Code-Beispiele für dieses Kapitel: examples/part-04/chapter-11/
💡 Tipp: Monitoring ist nicht “set and forget”. Review your dashboards and alerts regularly:
- Monatlich: SLO Review (wurden sie eingehalten?)
- Quarterly: Alert Review (zu viele False Positives?)
- Nach Incidents: Post-Mortem → Update Alerts/Dashboards
Kapitel 12: Testing Strategies
Einleitung
Sie haben einen komplexen Workflow implementiert, der mehrere External Services orchestriert, komplizierte Retry-Logik hat und über Tage hinweg laufen kann. Alles funktioniert lokal. Sie deployen in Production – und plötzlich:
- Ein Edge Case bricht den Workflow
- Eine kürzlich geänderte Activity verhält sich anders als erwartet
- Ein Refactoring führt zu Non-Determinismus-Fehlern
- Ein Workflow, der Tage dauert, kann nicht schnell getestet werden
Ohne Testing-Strategie sind Sie:
- Unsicher bei jedem Deployment
- Abhängig von manuellen Tests in Production
- Blind gegenüber Breaking Changes
- Langsam beim Debugging
Mit einer robusten Testing-Strategie haben Sie:
- Vertrauen in Ihre Changes
- Schnelles Feedback (Sekunden statt Tage)
- Automatische Regression-Detection
- Sichere Workflow-Evolution
Temporal bietet leistungsstarke Testing-Tools, die speziell für durable, long-running Workflows entwickelt wurden. Dieses Kapitel zeigt Ihnen, wie Sie sie effektiv nutzen.
Das Grundproblem
Scenario: Sie entwickeln einen Order Processing Workflow:
@workflow.defn
class OrderWorkflow:
async def run(self, order_id: str) -> str:
# Payment (mit Retry-Logik)
payment = await workflow.execute_activity(
process_payment,
order_id,
start_to_close_timeout=timedelta(minutes=5),
retry_policy=RetryPolicy(maximum_attempts=3)
)
# Inventory (kann lange dauern)
await workflow.execute_activity(
reserve_inventory,
order_id,
start_to_close_timeout=timedelta(hours=24)
)
# Warte auf manuelle Approval (via Signal)
await workflow.wait_condition(lambda: self.approved)
# Shipping
tracking = await workflow.execute_activity(
create_shipment,
order_id,
start_to_close_timeout=timedelta(hours=1)
)
return tracking
Ohne Testing-Framework:
❌ Test dauert 24+ Stunden (wegen inventory timeout)
❌ Manuelle Approval muss simuliert werden
❌ External Services müssen verfügbar sein
❌ Retry-Logik schwer zu testen
❌ Workflow-Evolution kann nicht validiert werden
→ Tests werden nicht geschrieben
→ Bugs landen in Production
→ Debugging dauert Stunden
Mit Temporal Testing:
✓ Test läuft in Sekunden (time-skipping)
✓ Activities werden gemockt
✓ Signals können simuliert werden
✓ Retry-Verhalten ist testbar
✓ Workflow History kann replayed werden
→ Comprehensive Test Suite
→ Bugs werden vor Deployment gefunden
→ Sichere Refactorings
Lernziele
Nach diesem Kapitel können Sie:
- Unit Tests für Activities und Workflows schreiben
- Integration Tests mit
WorkflowEnvironmentimplementieren - Time-Skipping für Tests mit langen Timeouts nutzen
- Activities mocken für isolierte Workflow-Tests
- Replay Tests für Workflow-Evolution durchführen
- pytest Fixtures für Test-Isolation aufsetzen
- CI/CD Integration mit automatisierten Tests
- Production Histories in Tests verwenden
12.1 Unit Testing: Activities in Isolation
Der einfachste Test-Ansatz: Activities direkt aufrufen, ohne Worker oder Workflow.
12.1.1 Warum Activity Unit Tests?
Vorteile:
- ⚡ Schnell (keine Temporal-Infrastruktur nötig)
- 🎯 Fokussiert (nur Business-Logik)
- 🔄 Einfach zu debuggen
- 📊 Hohe Code Coverage
Best Practice: 80% Unit Tests, 15% Integration Tests, 5% E2E Tests
12.1.2 Activity Unit Test Beispiel
# activities.py
from temporalio import activity
from dataclasses import dataclass
import httpx
@dataclass
class PaymentRequest:
order_id: str
amount: float
@dataclass
class PaymentResult:
success: bool
transaction_id: str
@activity.defn
async def process_payment(request: PaymentRequest) -> PaymentResult:
"""Process payment via external API"""
activity.logger.info(f"Processing payment for {request.order_id}")
# Call external payment API
async with httpx.AsyncClient() as client:
response = await client.post(
"https://payment.api.com/charge",
json={
"order_id": request.order_id,
"amount": request.amount
},
timeout=30.0
)
response.raise_for_status()
data = response.json()
return PaymentResult(
success=data["status"] == "success",
transaction_id=data["transaction_id"]
)
Test (ohne Temporal):
# tests/test_activities.py
import pytest
from unittest.mock import AsyncMock, patch
from activities import process_payment, PaymentRequest, PaymentResult
@pytest.mark.asyncio
async def test_process_payment_success():
"""Test successful payment processing"""
# Mock httpx client
mock_response = AsyncMock()
mock_response.json.return_value = {
"status": "success",
"transaction_id": "txn_12345"
}
with patch("httpx.AsyncClient") as mock_client:
mock_client.return_value.__aenter__.return_value.post = AsyncMock(
return_value=mock_response
)
# Call activity directly (no Temporal needed!)
result = await process_payment(
PaymentRequest(order_id="order-001", amount=99.99)
)
# Assert
assert result.success is True
assert result.transaction_id == "txn_12345"
@pytest.mark.asyncio
async def test_process_payment_failure():
"""Test payment processing failure"""
with patch("httpx.AsyncClient") as mock_client:
# Simulate API error
mock_client.return_value.__aenter__.return_value.post = AsyncMock(
side_effect=httpx.HTTPStatusError(
"Payment failed",
request=AsyncMock(),
response=AsyncMock(status_code=400)
)
)
# Expect activity to raise
with pytest.raises(httpx.HTTPStatusError):
await process_payment(
PaymentRequest(order_id="order-002", amount=199.99)
)
Vorteile:
- ✅ Keine Temporal Server nötig
- ✅ Tests laufen in Millisekunden
- ✅ External API wird gemockt
- ✅ Error Cases sind testbar
12.2 Integration Testing mit WorkflowEnvironment
Integration Tests testen Workflows UND Activities zusammen, mit einem in-memory Temporal Server.
12.2.1 WorkflowEnvironment Setup
# tests/test_workflows.py
import pytest
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
from workflows import OrderWorkflow
from activities import process_payment, reserve_inventory, create_shipment
@pytest.fixture
async def workflow_env():
"""Fixture: Temporal test environment"""
async with await WorkflowEnvironment.start_time_skipping() as env:
yield env
@pytest.fixture
async def worker(workflow_env):
"""Fixture: Worker mit Workflows und Activities"""
async with Worker(
workflow_env.client,
task_queue="test-queue",
workflows=[OrderWorkflow],
activities=[process_payment, reserve_inventory, create_shipment]
):
yield
Wichtig: start_time_skipping() aktiviert automatisches Time-Skipping!
12.2.2 Workflow Integration Test
@pytest.mark.asyncio
async def test_order_workflow_success(workflow_env, worker):
"""Test successful order workflow execution"""
# Start workflow
handle = await workflow_env.client.start_workflow(
OrderWorkflow.run,
"order-test-001",
id="test-order-001",
task_queue="test-queue"
)
# Send approval signal (simulating manual step)
await handle.signal(OrderWorkflow.approve)
# Wait for result
result = await handle.result()
# Assert
assert result.startswith("TRACKING-")
Was passiert hier?
workflow_envstartet in-memory Temporal Serverworkerregistriert Workflows/Activities- Workflow wird gestartet
- Signal wird gesendet (simuliert manuellen Schritt)
- Ergebnis wird validiert
Time-Skipping: 24-Stunden Timeout dauert nur Sekunden!
12.3 Time-Skipping: Tage in Sekunden testen
12.3.1 Das Problem: Lange Timeouts
@workflow.defn
class NotificationWorkflow:
async def run(self, user_id: str):
# Send initial notification
await workflow.execute_activity(
send_email,
user_id,
start_to_close_timeout=timedelta(minutes=5)
)
# Wait 3 days
await asyncio.sleep(timedelta(days=3).total_seconds())
# Send reminder
await workflow.execute_activity(
send_reminder,
user_id,
start_to_close_timeout=timedelta(minutes=5)
)
Ohne Time-Skipping: Test dauert 3 Tage 😱
Mit Time-Skipping: Test dauert Sekunden ⚡
12.3.2 Time-Skipping in Action
@pytest.mark.asyncio
async def test_notification_workflow_with_delay(workflow_env, worker):
"""Test workflow with 3-day sleep (executes in seconds!)"""
# Start workflow
handle = await workflow_env.client.start_workflow(
NotificationWorkflow.run,
"user-123",
id="test-notification",
task_queue="test-queue"
)
# Wait for completion (time is automatically skipped!)
await handle.result()
# Verify both activities were called
history = await handle.fetch_history()
activity_events = [
e for e in history.events
if e.event_type == "ACTIVITY_TASK_SCHEDULED"
]
assert len(activity_events) == 2 # send_email + send_reminder
Wie funktioniert Time-Skipping?
- WorkflowEnvironment erkennt, dass keine Activities laufen
- Zeit wird automatisch vorwärts gespult bis zum nächsten Event
asyncio.sleep(3 days)wird instant übersprungen- Test läuft in <1 Sekunde
12.3.3 Manuelles Time-Skipping
@pytest.mark.asyncio
async def test_manual_time_skip(workflow_env):
"""Manually control time skipping"""
# Start workflow
handle = await workflow_env.client.start_workflow(
NotificationWorkflow.run,
"user-456",
id="test-manual-skip",
task_queue="test-queue"
)
# Manually skip time
await workflow_env.sleep(timedelta(days=3))
# Check workflow state via query
state = await handle.query("get_state")
assert state == "reminder_sent"
12.4 Mocking Activities
Problem: Activities rufen externe Services auf (Datenbanken, APIs, etc.). Im Test wollen wir diese nicht aufrufen.
12.4.1 Activity Mocking mit Mock-Implementierung
# activities.py (production code)
@activity.defn
async def send_email(user_id: str, subject: str, body: str):
"""Send email via SendGrid"""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.sendgrid.com/v3/mail/send",
json={
"to": f"user-{user_id}@example.com",
"subject": subject,
"body": body
},
headers={"Authorization": f"Bearer {SENDGRID_API_KEY}"}
)
response.raise_for_status()
# tests/mocks.py (test code)
@activity.defn(name="send_email") # Same name as production activity!
async def mock_send_email(user_id: str, subject: str, body: str):
"""Mock email sending (no external call)"""
activity.logger.info(f"MOCK: Sending email to user {user_id}")
# No actual API call - just return success
return None
Test mit Mock:
from tests.mocks import mock_send_email
@pytest.mark.asyncio
async def test_with_mock_activity(workflow_env):
"""Test workflow with mocked activity"""
# Worker uses MOCK activity instead of production one
async with Worker(
workflow_env.client,
task_queue="test-queue",
workflows=[NotificationWorkflow],
activities=[mock_send_email] # Mock statt Production!
):
handle = await workflow_env.client.start_workflow(
NotificationWorkflow.run,
"user-789",
id="test-with-mock",
task_queue="test-queue"
)
await handle.result()
# Verify workflow completed without calling SendGrid
Vorteile:
- ✅ Keine external dependencies
- ✅ Tests laufen offline
- ✅ Schneller (keine Network Latency)
- ✅ Deterministisch (keine Flakiness)
12.4.2 Conditional Mocking (Production vs Test)
# config.py
import os
IS_TEST = os.getenv("TESTING", "false") == "true"
# activities.py
@activity.defn
async def send_email(user_id: str, subject: str, body: str):
if IS_TEST:
activity.logger.info(f"TEST MODE: Would send email to {user_id}")
return
# Production code
async with httpx.AsyncClient() as client:
# ... real API call
pass
Nachteile dieses Ansatzes:
- ⚠️ Vermischt Production und Test-Code
- ⚠️ Schwieriger zu maintainen
- ✅ Besser: Separate Mock-Implementierungen (siehe oben)
12.5 Replay Testing: Workflow-Evolution validieren
Replay Testing ist Temporals Killer-Feature für sichere Workflow-Evolution.
12.5.1 Was ist Replay Testing?
Konzept:
- Workflow wird ausgeführt → History wird aufgezeichnet
- Workflow-Code wird geändert
- Replay: Alte History wird mit neuem Code replayed
- Validierung: Prüfen, ob neuer Code deterministisch ist
Use Case: Sie deployen eine neue Workflow-Version. Replay Testing stellt sicher, dass alte, noch laufende Workflows nicht brechen.
12.5.2 Replay Test Setup
# tests/test_replay.py
from temporalio.worker import Replayer
from temporalio.client import WorkflowHistory
from workflows import OrderWorkflowV1, OrderWorkflowV2
@pytest.mark.asyncio
async def test_workflow_v2_replays_v1_history():
"""Test that v2 workflow can replay v1 history"""
# 1. Execute v1 workflow and capture history
async with await WorkflowEnvironment.start_time_skipping() as env:
async with Worker(
env.client,
task_queue="test-queue",
workflows=[OrderWorkflowV1],
activities=[process_payment]
):
handle = await env.client.start_workflow(
OrderWorkflowV1.run,
"order-replay-test",
id="replay-test",
task_queue="test-queue"
)
await handle.result()
# Capture workflow history
history = await handle.fetch_history()
# 2. Create Replayer with v2 workflow
replayer = Replayer(
workflows=[OrderWorkflowV2],
activities=[process_payment]
)
# 3. Replay v1 history with v2 code
try:
await replayer.replay_workflow(history)
print("✅ Replay successful - v2 is compatible!")
except Exception as e:
pytest.fail(f"❌ Replay failed - non-determinism detected: {e}")
12.5.3 Breaking Change Detection
Scenario: Sie ändern Activity-Reihenfolge (Breaking Change!)
# workflows.py (v1)
@workflow.defn
class OrderWorkflowV1:
async def run(self, order_id: str):
payment = await workflow.execute_activity(process_payment, ...)
inventory = await workflow.execute_activity(reserve_inventory, ...)
return "done"
# workflows.py (v2 - BREAKING!)
@workflow.defn
class OrderWorkflowV2:
async def run(self, order_id: str):
# WRONG: Changed order!
inventory = await workflow.execute_activity(reserve_inventory, ...)
payment = await workflow.execute_activity(process_payment, ...)
return "done"
Replay Test fängt das ab:
❌ Replay failed - non-determinism detected:
Expected ActivityScheduled(process_payment)
Got ActivityScheduled(reserve_inventory)
Lösung: Verwende workflow.patched() (siehe Kapitel 8)
@workflow.defn
class OrderWorkflowV2Fixed:
async def run(self, order_id: str):
if workflow.patched("swap-order-v2"):
# New order
inventory = await workflow.execute_activity(reserve_inventory, ...)
payment = await workflow.execute_activity(process_payment, ...)
else:
# Old order (for replay)
payment = await workflow.execute_activity(process_payment, ...)
inventory = await workflow.execute_activity(reserve_inventory, ...)
return "done"
12.5.4 Production History Replay
Best Practice: Replay echte Production Histories in CI/CD!
# tests/test_production_replay.py
import json
from pathlib import Path
@pytest.mark.asyncio
async def test_replay_production_histories():
"""Replay 100 most recent production histories"""
# Load histories from exported JSON files
history_dir = Path("tests/fixtures/production_histories")
replayer = Replayer(
workflows=[OrderWorkflowV2],
activities=[process_payment, reserve_inventory, create_shipment]
)
for history_file in history_dir.glob("*.json"):
with open(history_file) as f:
history_data = json.load(f)
workflow_id = history_file.stem
history = WorkflowHistory.from_json(workflow_id, history_data)
try:
await replayer.replay_workflow(history)
print(f"✅ {workflow_id} replayed successfully")
except Exception as e:
pytest.fail(f"❌ {workflow_id} failed: {e}")
Workflow Histories exportieren:
# Export history for a workflow
temporal workflow show \
--workflow-id order-12345 \
--output json > tests/fixtures/production_histories/order-12345.json
# Batch export (last 100 workflows)
temporal workflow list \
--query 'WorkflowType="OrderWorkflow"' \
--limit 100 \
--fields WorkflowId \
| xargs -I {} temporal workflow show --workflow-id {} --output json > {}.json
12.6 pytest Fixtures für Test-Isolation
Problem: Tests beeinflussen sich gegenseitig, wenn sie Workflows mit denselben IDs starten.
Lösung: pytest Fixtures + eindeutige Workflow IDs
12.6.1 Wiederverwendbare Fixtures
# tests/conftest.py (shared fixtures)
import pytest
import uuid
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
from workflows import OrderWorkflow
from activities import process_payment, reserve_inventory
@pytest.fixture
async def temporal_env():
"""Fixture: Temporal test environment (time-skipping)"""
async with await WorkflowEnvironment.start_time_skipping() as env:
yield env
@pytest.fixture
async def worker(temporal_env):
"""Fixture: Worker with all workflows/activities"""
async with Worker(
temporal_env.client,
task_queue="test-queue",
workflows=[OrderWorkflow],
activities=[process_payment, reserve_inventory]
):
yield
@pytest.fixture
def unique_workflow_id():
"""Fixture: Generate unique workflow ID for each test"""
return f"test-{uuid.uuid4()}"
12.6.2 Test Isolation
# tests/test_order_workflow.py
import pytest
@pytest.mark.asyncio
async def test_order_success(temporal_env, worker, unique_workflow_id):
"""Test successful order (isolated via unique ID)"""
handle = await temporal_env.client.start_workflow(
OrderWorkflow.run,
"order-001",
id=unique_workflow_id, # Unique ID!
task_queue="test-queue"
)
result = await handle.result()
assert result == "ORDER_COMPLETED"
@pytest.mark.asyncio
async def test_order_payment_failure(temporal_env, worker, unique_workflow_id):
"""Test order with payment failure (isolated)"""
handle = await temporal_env.client.start_workflow(
OrderWorkflow.run,
"order-002",
id=unique_workflow_id, # Different unique ID!
task_queue="test-queue"
)
# Expect workflow to fail
with pytest.raises(Exception, match="Payment failed"):
await handle.result()
Vorteile:
- ✅ Keine Test-Interferenz
- ✅ Tests können parallel laufen
- ✅ Deterministisch (kein Flakiness)
12.6.3 Parametrisierte Tests
@pytest.mark.parametrize("order_id,expected_status", [
("order-001", "COMPLETED"),
("order-002", "PAYMENT_FAILED"),
("order-003", "INVENTORY_UNAVAILABLE"),
])
@pytest.mark.asyncio
async def test_order_scenarios(
temporal_env,
worker,
unique_workflow_id,
order_id,
expected_status
):
"""Test multiple order scenarios"""
handle = await temporal_env.client.start_workflow(
OrderWorkflow.run,
order_id,
id=unique_workflow_id,
task_queue="test-queue"
)
result = await handle.result()
assert result["status"] == expected_status
12.7 CI/CD Integration
12.7.1 pytest in CI/CD Pipeline
GitHub Actions Beispiel:
# .github/workflows/test.yml
name: Temporal Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-asyncio
- name: Run unit tests
run: pytest tests/test_activities.py -v
- name: Run integration tests
run: pytest tests/test_workflows.py -v
- name: Run replay tests
run: pytest tests/test_replay.py -v
- name: Generate coverage report
run: |
pip install pytest-cov
pytest --cov=workflows --cov=activities --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v3
12.7.2 Test-Organisation
tests/
├── conftest.py # Shared fixtures
├── test_activities.py # Unit tests (fast)
├── test_workflows.py # Integration tests (slower)
├── test_replay.py # Replay tests (critical)
├── fixtures/
│ └── production_histories/ # Exported workflow histories
│ ├── order-12345.json
│ └── order-67890.json
└── mocks/
└── mock_activities.py # Mock implementations
pytest Marker für Test-Kategorien:
# tests/test_workflows.py
import pytest
@pytest.mark.unit
@pytest.mark.asyncio
async def test_activity_directly():
"""Fast unit test"""
result = await process_payment(...)
assert result.success
@pytest.mark.integration
@pytest.mark.asyncio
async def test_workflow_with_worker(temporal_env, worker):
"""Slower integration test"""
handle = await temporal_env.client.start_workflow(...)
await handle.result()
@pytest.mark.replay
@pytest.mark.asyncio
async def test_replay_production_history():
"""Critical replay test"""
replayer = Replayer(...)
await replayer.replay_workflow(history)
Selektives Ausführen:
# Nur Unit Tests (schnell)
pytest -m unit
# Nur Integration Tests
pytest -m integration
# Nur Replay Tests (vor Deployment!)
pytest -m replay
# Alle Tests
pytest
12.7.3 Pre-Commit Hook für Replay Tests
# .git/hooks/pre-commit
#!/bin/bash
echo "Running replay tests before commit..."
pytest tests/test_replay.py -v
if [ $? -ne 0 ]; then
echo "❌ Replay tests failed! Commit blocked."
exit 1
fi
echo "✅ Replay tests passed!"
12.8 Advanced: Testing mit echten Temporal Server
Use Case: End-to-End Tests mit realem Temporal Server (nicht in-memory).
12.8.1 Temporal Dev Server in CI
# .github/workflows/e2e.yml
jobs:
e2e-test:
runs-on: ubuntu-latest
services:
temporal:
image: temporalio/auto-setup:latest
ports:
- 7233:7233
env:
TEMPORAL_ADDRESS: localhost:7233
steps:
- uses: actions/checkout@v3
- name: Wait for Temporal
run: |
timeout 60 bash -c 'until nc -z localhost 7233; do sleep 1; done'
- name: Run E2E tests
run: pytest tests/e2e/ -v
env:
TEMPORAL_ADDRESS: localhost:7233
12.8.2 E2E Test mit realem Server
# tests/e2e/test_order_e2e.py
import pytest
from temporalio.client import Client
from temporalio.worker import Worker
@pytest.mark.e2e
@pytest.mark.asyncio
async def test_order_workflow_e2e():
"""E2E test with real Temporal server"""
# Connect to real Temporal server
client = await Client.connect("localhost:7233")
# Start real worker
async with Worker(
client,
task_queue="e2e-queue",
workflows=[OrderWorkflow],
activities=[process_payment, reserve_inventory]
):
# Execute workflow
handle = await client.start_workflow(
OrderWorkflow.run,
"order-e2e-001",
id="e2e-test-001",
task_queue="e2e-queue"
)
result = await handle.result()
assert result == "ORDER_COMPLETED"
# Verify via Temporal UI (optional)
history = await handle.fetch_history()
assert len(history.events) > 0
12.9 Testing Best Practices
12.9.1 Test-Pyramide für Temporal
/\
/ \ E2E Tests (5%)
/____\ - Real Temporal Server
/ \ - All services integrated
/________\ Integration Tests (15%)
/ \ - WorkflowEnvironment
/____________\ - Time-skipping
/ \ - Mocked activities
/________________\ Unit Tests (80%)
- Direct activity calls
- Fast, isolated
12.9.2 Checkliste: Was testen?
Workflows:
- ✅ Happy Path (erfolgreiches Durchlaufen)
- ✅ Error Cases (Activity Failures, Timeouts)
- ✅ Signal Handling (korrekte Reaktion auf Signals)
- ✅ Query Responses (richtige State-Rückgabe)
- ✅ Retry Behavior (Retries funktionieren wie erwartet)
- ✅ Long-running Scenarios (mit Time-Skipping)
- ✅ Replay Compatibility (nach Code-Änderungen)
Activities:
- ✅ Business Logic (korrekte Berechnung/Verarbeitung)
- ✅ Error Handling (Exceptions werden richtig geworfen)
- ✅ Edge Cases (null, empty, extreme values)
- ✅ External API Mocking (keine echten Calls im Test)
Workflow Evolution:
- ✅ Replay Tests (alte Histories mit neuem Code)
- ✅ Patching Scenarios (workflow.patched() funktioniert)
- ✅ Breaking Change Detection (Non-Determinismus)
12.9.3 Common Testing Mistakes
| Fehler | Problem | Lösung |
|---|---|---|
| Keine Replay Tests | Breaking Changes in Production | Replay Tests in CI/CD |
| Tests dauern zu lang | Keine Time-Skipping-Nutzung | start_time_skipping() |
| Flaky Tests | Shared Workflow IDs | Unique IDs pro Test |
| Nur Happy Path | Bugs in Error Cases | Edge Cases testen |
| External Calls im Test | Langsam, flaky, Kosten | Activities mocken |
| Keine Production History | Ungetestete Edge Cases | Production Histories exportieren |
12.9.4 Performance-Optimierung
# SLOW: Neues Environment pro Test
@pytest.mark.asyncio
async def test_workflow_1():
async with await WorkflowEnvironment.start_time_skipping() as env:
# Test...
pass
@pytest.mark.asyncio
async def test_workflow_2():
async with await WorkflowEnvironment.start_time_skipping() as env:
# Test...
pass
# FAST: Shared environment via fixture (session scope)
@pytest.fixture(scope="session")
async def shared_env():
async with await WorkflowEnvironment.start_time_skipping() as env:
yield env
@pytest.mark.asyncio
async def test_workflow_1(shared_env):
# Test... (uses same environment)
pass
@pytest.mark.asyncio
async def test_workflow_2(shared_env):
# Test... (uses same environment)
pass
Speedup: 10x schneller bei vielen Tests!
12.10 Zusammenfassung
Testing Strategy Checklist
Development:
- Unit Tests für alle Activities
- Integration Tests für kritische Workflows
- Replay Tests für Workflow-Versionen
- Mocks für externe Services
- Time-Skipping für lange Workflows
CI/CD:
- pytest in GitHub Actions / GitLab CI
- Replay Tests vor jedem Deployment
- Production History Replay (wöchentlich)
- Test Coverage Tracking (>80%)
- Pre-Commit Hooks für Replay Tests
Production:
- Workflow Histories regelmäßig exportieren
- Replay Tests mit Production Histories
- Monitoring für Test-Failures in CI
- Rollback-Plan bei Breaking Changes
Häufige Fehler
❌ FEHLER 1: Keine Replay Tests
# Deployment ohne Replay Testing
# → Breaking Changes landen in Production
✅ RICHTIG:
@pytest.mark.asyncio
async def test_replay_before_deploy():
replayer = Replayer(workflows=[WorkflowV2])
await replayer.replay_workflow(production_history)
❌ FEHLER 2: Tests dauern ewig
# Warten auf echte Timeouts
await asyncio.sleep(3600) # 1 Stunde
✅ RICHTIG:
# Time-Skipping nutzen
async with await WorkflowEnvironment.start_time_skipping() as env:
# 1 Stunde wird instant übersprungen
❌ FEHLER 3: Flaky Tests
# Feste Workflow ID
id="test-workflow" # Mehrere Tests kollidieren!
✅ RICHTIG:
# Unique ID pro Test
id=f"test-{uuid.uuid4()}"
Best Practices
- 80/15/5 Regel: 80% Unit, 15% Integration, 5% E2E
- Time-Skipping immer nutzen für Integration Tests
- Replay Tests in CI/CD vor jedem Deployment
- Production Histories regelmäßig exportieren und testen
- Activities mocken für schnelle, deterministische Tests
- Unique Workflow IDs für Test-Isolation
- pytest Fixtures für Wiederverwendbarkeit
- Test-Marker für selektives Ausführen
Testing Anti-Patterns
| Anti-Pattern | Warum schlecht? | Alternative |
|---|---|---|
| Nur manuelle Tests | Langsam, fehleranfällig | Automatisierte pytest Suite |
| Keine Mocks | Tests brauchen externe Services | Mock Activities |
| Feste Workflow IDs | Tests beeinflussen sich | Unique IDs via uuid |
| Warten auf echte Zeit | Tests dauern Stunden/Tage | Time-Skipping |
| Kein Replay Testing | Breaking Changes unentdeckt | Replay in CI/CD |
| Nur Happy Path | Bugs in Edge Cases | Error Cases testen |
Nächste Schritte
Nach diesem Kapitel sollten Sie:
-
Test Suite aufsetzen:
mkdir tests touch tests/conftest.py tests/test_activities.py tests/test_workflows.py -
pytest konfigurieren:
# pytest.ini [pytest] asyncio_mode = auto markers = unit: Unit tests integration: Integration tests replay: Replay tests -
CI/CD Pipeline erweitern:
# .github/workflows/test.yml - name: Run tests run: pytest -v --cov -
Production History Export automatisieren:
# Wöchentlicher Cron Job temporal workflow list --limit 100 | xargs -I {} temporal workflow show ...
Ressourcen
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 13: Best Practices und Anti-Muster
Code-Beispiele für dieses Kapitel: examples/part-04/chapter-12/
Kapitel 13: Best Practices und Anti-Muster
Einleitung
Sie haben die Grundlagen von Temporal gelernt, Workflows geschrieben, Testing implementiert und Monitoring aufgesetzt. Ihr System läuft in Production. Doch dann kommt der Moment:
- Ein Workflow bricht plötzlich mit Non-Determinismus-Fehlern ab
- Die Event History überschreitet 50.000 Events und der Workflow wird terminiert
- Worker können die Last nicht bewältigen, obwohl genug Ressourcen verfügbar sind
- Ein vermeintlich kleines Refactoring führt zu Production-Incidents
- Code-Reviews dauern Stunden, weil niemand die Workflow-Struktur versteht
Diese Probleme sind vermeidbar – wenn Sie von Anfang an bewährte Patterns folgen und häufige Anti-Patterns vermeiden.
Dieses Kapitel destilliert Jahre an Production-Erfahrung aus der Temporal-Community in konkrete, umsetzbare Guidelines. Sie lernen was funktioniert, was nicht funktioniert, und warum.
Das Grundproblem
Scenario: Ein Team entwickelt einen E-Commerce Workflow. Nach einigen Monaten in Production:
# ❌ ANTI-PATTERN: Alles in einem gigantischen Workflow
@workflow.defn
class MonolithWorkflow:
"""Ein 3000-Zeilen Monster-Workflow"""
def __init__(self):
self.orders = [] # ❌ Unbegrenzte Liste
self.user_sessions = {} # ❌ Wächst endlos
self.cache = {} # ❌ Memory Leak
@workflow.run
async def run(self, user_id: str):
# ❌ Non-deterministic!
if random.random() > 0.5:
discount = 0.1
# ❌ Business Logic im Workflow
price = self.calculate_complex_pricing(...)
# ❌ Externe API direkt aufrufen
async with httpx.AsyncClient() as client:
response = await client.post("https://payment.api/charge")
# ❌ Workflow läuft ewig ohne Continue-As-New
while True:
order = await workflow.wait_condition(lambda: len(self.orders) > 0)
# Process order...
# Event History wächst ins Unendliche
# ❌ Map-Iteration (random order!)
for session_id, session in self.user_sessions.items():
await self.process_session(session)
Konsequenzen nach 6 Monaten:
❌ Event History: 75.000 Events → Workflow terminiert
❌ Non-Determinismus bei Replay → 30% der Workflows brechen ab
❌ Worker Overload → Schedule-To-Start > 10 Minuten
❌ Deployment dauert 6 Stunden → Rollback bei jedem Change
❌ Debugging unmöglich → Team ist frustriert
Mit Best Practices:
# ✅ BEST PRACTICE: Clean, maintainable, production-ready
@dataclass
class OrderInput:
"""Single object input pattern"""
user_id: str
cart_items: List[str]
discount_code: Optional[str] = None
@workflow.defn
class OrderWorkflow:
"""Focused workflow: Orchestrate, don't implement"""
@workflow.run
async def run(self, input: OrderInput) -> OrderResult:
# ✅ Deterministic: All randomness in activities
discount = await workflow.execute_activity(
calculate_discount,
input.discount_code,
start_to_close_timeout=timedelta(seconds=30)
)
# ✅ Business logic in activities
payment = await workflow.execute_activity(
process_payment,
PaymentInput(user_id=input.user_id, discount=discount),
start_to_close_timeout=timedelta(minutes=5),
retry_policy=RetryPolicy(maximum_attempts=3)
)
# ✅ External calls in activities
tracking = await workflow.execute_activity(
create_shipment,
payment.order_id,
start_to_close_timeout=timedelta(hours=1)
)
return OrderResult(
order_id=payment.order_id,
tracking_number=tracking
)
Resultat:
✓ Event History: ~20 Events pro Workflow
✓ 100% Replay Success Rate
✓ Schedule-To-Start: <100ms
✓ Zero-Downtime Deployments
✓ Debugging in Minuten statt Stunden
Lernziele
Nach diesem Kapitel können Sie:
- Best Practices für Workflow-Design, Code-Organisation und Worker-Konfiguration anwenden
- Anti-Patterns erkennen und vermeiden, bevor sie Production-Probleme verursachen
- Determinismus garantieren durch korrektes Pattern-Anwendung
- Performance optimieren durch Worker-Tuning und Event History Management
- Code-Organisation strukturieren für Wartbarkeit und Skalierbarkeit
- Production-Ready Workflows schreiben, die jahrelang laufen
- Code Reviews durchführen mit klarer Checkliste
- Refactorings sicher vornehmen ohne Breaking Changes
13.1 Workflow Design Best Practices
Orchestration vs. Implementation
Regel: Workflows orchestrieren, Activities implementieren.
# ❌ ANTI-PATTERN: Business Logic im Workflow
@workflow.defn
class PricingWorkflowBad:
@workflow.run
async def run(self, product_id: str) -> float:
# ❌ Complex logic in workflow (non-testable, non-deterministic risk)
base_price = 100.0
# ❌ Time-based logic (non-deterministic!)
current_hour = datetime.now().hour
if current_hour >= 18:
base_price *= 1.2 # Evening surge pricing
# ❌ Heavy computation
for i in range(1000000):
base_price += math.sin(i) * 0.0001
return base_price
# ✅ BEST PRACTICE: Orchestration only
@workflow.defn
class PricingWorkflowGood:
@workflow.run
async def run(self, product_id: str) -> float:
# ✅ Delegate to activity
price = await workflow.execute_activity(
calculate_price,
product_id,
start_to_close_timeout=timedelta(seconds=30)
)
return price
# ✅ Logic in activity
@activity.defn
async def calculate_price(product_id: str) -> float:
"""Complex pricing logic isolated in activity"""
base_price = await fetch_base_price(product_id)
# Time-based logic OK in activity
current_hour = datetime.now().hour
if current_hour >= 18:
base_price *= 1.2
# Heavy computation OK in activity
for i in range(1000000):
base_price += math.sin(i) * 0.0001
return base_price
Warum?
- ✅ Workflows bleiben deterministisch
- ✅ Activities sind unit-testbar
- ✅ Retry-Logik funktioniert korrekt
- ✅ Workflow History bleibt klein
Single Object Input Pattern
Regel: Ein Input-Objekt statt mehrere Parameter.
# ❌ ANTI-PATTERN: Multiple primitive arguments
@workflow.defn
class OrderWorkflowBad:
@workflow.run
async def run(
self,
user_id: str,
product_id: str,
quantity: int,
discount: float,
shipping_address: str,
billing_address: str,
gift_wrap: bool,
express_shipping: bool
) -> str:
# ❌ Signature-Änderungen brechen alte Workflows
# ❌ Schwer zu lesen
# ❌ Keine Validierung
...
# ✅ BEST PRACTICE: Single dataclass input
from dataclasses import dataclass
from typing import Optional
@dataclass
class OrderInput:
"""Order workflow input (versioned)"""
user_id: str
product_id: str
quantity: int
shipping_address: str
# Optional fields für Evolution
discount: Optional[float] = None
billing_address: Optional[str] = None
gift_wrap: bool = False
express_shipping: bool = False
def __post_init__(self):
# ✅ Validation at input
if self.quantity <= 0:
raise ValueError("Quantity must be positive")
@workflow.defn
class OrderWorkflowGood:
@workflow.run
async def run(self, input: OrderInput) -> OrderResult:
# ✅ Neue Felder hinzufügen ist safe
# ✅ Validierung ist gekapselt
# ✅ Lesbar und wartbar
...
Vorteile:
- ✅ Einfacher zu erweitern (neue optionale Felder)
- ✅ Bessere Validierung
- ✅ Lesbarerer Code
- ✅ Type-Safety
Continue-As-New für Long-Running Workflows
Regel: Verwenden Sie Continue-As-New, wenn Event History groß wird.
# ❌ ANTI-PATTERN: Endlos-Workflow ohne Continue-As-New
@workflow.defn
class UserSessionWorkflowBad:
def __init__(self):
self.events = [] # ❌ Wächst unbegrenzt
@workflow.run
async def run(self, user_id: str):
while True: # ❌ Läuft ewig
event = await workflow.wait_condition(
lambda: len(self.pending_events) > 0
)
self.events.append(event) # ❌ Event History explodiert
# Nach 1 Jahr: 50.000+ Events
# → Workflow wird terminiert!
# ✅ BEST PRACTICE: Continue-As-New mit Limit
@workflow.defn
class UserSessionWorkflowGood:
def __init__(self):
self.events = []
self.processed_count = 0
@workflow.run
async def run(self, user_id: str, total_processed: int = 0):
while True:
# ✅ Check history size regularly
info = workflow.info()
if info.get_current_history_length() > 1000:
workflow.logger.info(
f"History size: {info.get_current_history_length()}, "
"continuing as new"
)
# ✅ Continue with fresh history
workflow.continue_as_new(
user_id,
total_processed=total_processed + self.processed_count
)
event = await workflow.wait_condition(
lambda: len(self.pending_events) > 0
)
await workflow.execute_activity(
process_event,
event,
start_to_close_timeout=timedelta(seconds=30)
)
self.processed_count += 1
Wann Continue-As-New verwenden:
- Event History > 1.000 Events
- Workflow läuft > 1 Jahr
- State wächst unbegrenzt
- Workflow ist ein “Entity Workflow” (z.B. User Session, Shopping Cart)
Limits:
- ⚠️ Workflow terminiert automatisch bei 50.000 Events
- ⚠️ Workflow terminiert bei 50 MB History Size
13.2 Determinismus Best Practices
Alles Non-Deterministische in Activities
Regel: Workflows müssen deterministisch sein. Alles andere → Activity.
# ❌ ANTI-PATTERN: Non-deterministic workflow code
@workflow.defn
class FraudCheckWorkflowBad:
@workflow.run
async def run(self, transaction_id: str) -> bool:
# ❌ random() ist non-deterministic!
risk_score = random.random()
# ❌ datetime.now() ist non-deterministic!
if datetime.now().hour > 22:
risk_score += 0.3
# ❌ UUID generation non-deterministic!
audit_id = str(uuid.uuid4())
# ❌ Map iteration order non-deterministic!
checks = {"ip": check_ip, "device": check_device}
for check_name, check_fn in checks.items(): # ❌ Random order!
await check_fn()
return risk_score < 0.5
# ✅ BEST PRACTICE: Deterministic workflow
@workflow.defn
class FraudCheckWorkflowGood:
@workflow.run
async def run(self, transaction_id: str) -> bool:
# ✅ Random logic in activity
risk_score = await workflow.execute_activity(
calculate_risk_score,
transaction_id,
start_to_close_timeout=timedelta(seconds=30)
)
# ✅ Time-based logic in activity
time_modifier = await workflow.execute_activity(
get_time_based_modifier,
start_to_close_timeout=timedelta(seconds=5)
)
# ✅ UUID generation in activity
audit_id = await workflow.execute_activity(
generate_audit_id,
start_to_close_timeout=timedelta(seconds=5)
)
# ✅ Deterministic iteration order
check_names = sorted(["ip", "device", "location"]) # ✅ Sorted!
for check_name in check_names:
result = await workflow.execute_activity(
run_fraud_check,
FraudCheckInput(transaction_id, check_name),
start_to_close_timeout=timedelta(seconds=30)
)
return risk_score + time_modifier < 0.5
# ✅ Non-deterministic logic in activities
@activity.defn
async def calculate_risk_score(transaction_id: str) -> float:
"""Random logic OK in activity"""
return random.random()
@activity.defn
async def get_time_based_modifier() -> float:
"""Time-based logic OK in activity"""
if datetime.now().hour > 22:
return 0.3
return 0.0
@activity.defn
async def generate_audit_id() -> str:
"""UUID generation OK in activity"""
return str(uuid.uuid4())
Non-Deterministische Operationen:
| Operation | Wo? | Warum? |
|---|---|---|
random.random() | ❌ Workflow | Replay generiert anderen Wert |
datetime.now() | ❌ Workflow | Replay hat andere Zeit |
uuid.uuid4() | ❌ Workflow | Replay generiert andere UUID |
time.time() | ❌ Workflow | Replay hat andere Timestamp |
dict.items() iteration | ❌ Workflow | Order ist non-deterministic in Python <3.7 |
set iteration | ❌ Workflow | Order ist non-deterministic |
| External API calls | ❌ Workflow | Response kann sich ändern |
| File I/O | ❌ Workflow | Datei-Inhalt kann sich ändern |
| Database queries | ❌ Workflow | Daten können sich ändern |
✅ Alle diese Operationen sind OK in Activities!
Workflow-Code-Order nie ändern
Regel: Activity-Aufrufe dürfen nicht umgeordnet werden.
# v1: Original Workflow
@workflow.defn
class OnboardingWorkflowV1:
@workflow.run
async def run(self, user_id: str):
# Step 1: Validate
await workflow.execute_activity(
validate_user,
user_id,
start_to_close_timeout=timedelta(seconds=30)
)
# Step 2: Create account
await workflow.execute_activity(
create_account,
user_id,
start_to_close_timeout=timedelta(seconds=30)
)
# ❌ v2-bad: Reihenfolge geändert (NON-DETERMINISTIC!)
@workflow.defn
class OnboardingWorkflowV2Bad:
@workflow.run
async def run(self, user_id: str):
# ❌ FEHLER: Reihenfolge geändert!
# Step 1: Create account (war vorher Step 2)
await workflow.execute_activity(
create_account, # ❌ Replay erwartet validate_user!
user_id,
start_to_close_timeout=timedelta(seconds=30)
)
# Step 2: Validate (war vorher Step 1)
await workflow.execute_activity(
validate_user, # ❌ Replay erwartet create_account!
user_id,
start_to_close_timeout=timedelta(seconds=30)
)
Was passiert bei Replay:
History Event: ActivityTaskScheduled(activity_name="validate_user")
Replayed Code: workflow.execute_activity(create_account, ...)
❌ ERROR: Non-deterministic workflow!
Expected: validate_user
Got: create_account
# ✅ v2-good: Mit workflow.patched() ist Order-Änderung safe
@workflow.defn
class OnboardingWorkflowV2Good:
@workflow.run
async def run(self, user_id: str):
if workflow.patched("reorder-validation-v2"):
# ✅ NEW CODE PATH: Neue Reihenfolge
await workflow.execute_activity(create_account, ...)
await workflow.execute_activity(validate_user, ...)
else:
# ✅ OLD CODE PATH: Alte Reihenfolge für Replay
await workflow.execute_activity(validate_user, ...)
await workflow.execute_activity(create_account, ...)
13.3 State Management Best Practices
Vermeiden Sie große Workflow-State
Regel: Workflow-State klein halten. Große Daten in Activities oder extern speichern.
# ❌ ANTI-PATTERN: Große Daten im Workflow State
@workflow.defn
class DataProcessingWorkflowBad:
def __init__(self):
self.processed_records = [] # ❌ Wächst unbegrenzt!
self.results = {} # ❌ Kann riesig werden!
@workflow.run
async def run(self, dataset_id: str):
# ❌ 1 Million Records in Memory
records = await workflow.execute_activity(
fetch_all_records, # Returns 1M records
dataset_id,
start_to_close_timeout=timedelta(minutes=10)
)
for record in records:
result = await workflow.execute_activity(
process_record,
record,
start_to_close_timeout=timedelta(seconds=30)
)
self.processed_records.append(record) # ❌ State explodiert!
self.results[record.id] = result # ❌ Speichert alles!
# Event History: 50 MB+ → Workflow terminiert!
# ✅ BEST PRACTICE: Minimaler State, externe Speicherung
@workflow.defn
class DataProcessingWorkflowGood:
def __init__(self):
self.processed_count = 0 # ✅ Nur Counter
self.batch_id = None # ✅ Nur ID
@workflow.run
async def run(self, dataset_id: str):
# ✅ Activity gibt nur Batch-ID zurück (nicht die Daten!)
self.batch_id = await workflow.execute_activity(
create_processing_batch,
dataset_id,
start_to_close_timeout=timedelta(minutes=1)
)
# ✅ Activity returned nur Count
total_records = await workflow.execute_activity(
get_record_count,
self.batch_id,
start_to_close_timeout=timedelta(seconds=30)
)
# ✅ Process in batches
batch_size = 1000
for offset in range(0, total_records, batch_size):
# ✅ Activity verarbeitet Batch und speichert extern
processed = await workflow.execute_activity(
process_batch,
ProcessBatchInput(self.batch_id, offset, batch_size),
start_to_close_timeout=timedelta(minutes=5)
)
self.processed_count += processed # ✅ Nur Counter im State
# ✅ Final result aus externer DB
return await workflow.execute_activity(
finalize_batch,
self.batch_id,
start_to_close_timeout=timedelta(minutes=1)
)
# ✅ Activities speichern große Daten extern
@activity.defn
async def process_batch(input: ProcessBatchInput) -> int:
"""Process batch and store results in external DB"""
records = fetch_records_from_db(input.batch_id, input.offset, input.limit)
results = []
for record in records:
result = process_record(record)
results.append(result)
# ✅ Store in external database (S3, PostgreSQL, etc.)
store_results_in_db(input.batch_id, results)
return len(results) # ✅ Return only count, not data
Best Practices:
- ✅ Speichern Sie IDs, nicht Daten
- ✅ Verwenden Sie Counters statt Listen
- ✅ Große Daten in Activities → S3, DB, Redis
- ✅ Workflow State < 1 KB ideal
Query Handlers sind Read-Only
Regel: Queries dürfen niemals State mutieren.
# ❌ ANTI-PATTERN: Query mutiert State
@workflow.defn
class OrderWorkflowBad:
def __init__(self):
self.status = "pending"
self.view_count = 0
@workflow.query
def get_status(self) -> str:
self.view_count += 1 # ❌ MUTATION in Query!
return self.status # ❌ Non-deterministic!
Warum ist das schlimm?
- Queries werden nicht in History gespeichert
- Bei Replay werden Queries nicht ausgeführt
- State ist nach Replay anders als vor Replay
- → Non-Determinismus!
# ✅ BEST PRACTICE: Read-only Queries
@workflow.defn
class OrderWorkflowGood:
def __init__(self):
self.status = "pending"
self.view_count = 0 # Tracked via Signal instead
@workflow.query
def get_status(self) -> str:
"""Read-only query"""
return self.status # ✅ No mutation
@workflow.signal
def track_view(self):
"""Use Signal for mutations"""
self.view_count += 1 # ✅ Signal ist in History
13.4 Code-Organisation Best Practices
Struktur: Workflows, Activities, Worker getrennt
Regel: Klare Trennung zwischen Workflows, Activities und Worker.
# ❌ ANTI-PATTERN: Alles in einer Datei
my_project/
└── main.py # 5000 Zeilen: Workflows, Activities, Worker, Client, alles!
# ✅ BEST PRACTICE: Modulare Struktur
my_project/
├── workflows/
│ ├── __init__.py
│ ├── order_workflow.py # ✅ Ein Workflow pro File
│ ├── payment_workflow.py
│ └── shipping_workflow.py
│
├── activities/
│ ├── __init__.py
│ ├── order_activities.py # ✅ Activities grouped by domain
│ ├── payment_activities.py
│ ├── shipping_activities.py
│ └── shared_activities.py # ✅ Shared utilities
│
├── models/
│ ├── __init__.py
│ ├── order_models.py # ✅ Dataclasses für Inputs/Outputs
│ └── payment_models.py
│
├── workers/
│ ├── __init__.py
│ ├── order_worker.py # ✅ Worker per domain
│ └── payment_worker.py
│
├── client/
│ └── temporal_client.py # ✅ Client-Setup
│
└── tests/
├── test_workflows/
├── test_activities/
└── test_integration/
Beispiel: Order Workflow strukturiert
# workflows/order_workflow.py
from models.order_models import OrderInput, OrderResult
from activities.order_activities import validate_order, process_payment
@workflow.defn
class OrderWorkflow:
"""Order processing workflow"""
@workflow.run
async def run(self, input: OrderInput) -> OrderResult:
# Clean orchestration only
...
# activities/order_activities.py
@activity.defn
async def validate_order(input: OrderInput) -> bool:
"""Validate order data"""
...
@activity.defn
async def process_payment(order_id: str) -> PaymentResult:
"""Process payment"""
...
# models/order_models.py
@dataclass
class OrderInput:
"""Order workflow input"""
order_id: str
user_id: str
items: List[OrderItem]
@dataclass
class OrderResult:
"""Order workflow result"""
order_id: str
status: str
tracking_number: str
# workers/order_worker.py
async def main():
"""Order worker entrypoint"""
client = await create_temporal_client()
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[validate_order, process_payment]
)
await worker.run()
Vorteile:
- ✅ Testbarkeit: Jede Komponente isoliert testbar
- ✅ Wartbarkeit: Klare Zuständigkeiten
- ✅ Code Reviews: Kleinere, fokussierte Files
- ✅ Onboarding: Neue Entwickler finden sich schnell zurecht
Worker pro Domain/Use Case
Regel: Separate Workers für verschiedene Domains.
# ❌ ANTI-PATTERN: Ein Monolith-Worker für alles
async def main():
worker = Worker(
client,
task_queue="everything-queue", # ❌ Alle Workflows auf einer Queue
workflows=[
OrderWorkflow,
PaymentWorkflow,
ShippingWorkflow,
UserWorkflow,
NotificationWorkflow,
ReportWorkflow,
# ... 50+ Workflows
],
activities=[
# ... 200+ Activities
]
)
# ❌ Probleme:
# - Kann nicht unabhängig skaliert werden
# - Deployment ist All-or-Nothing
# - Ein Bug betrifft alle Workflows
# ✅ BEST PRACTICE: Worker pro Domain
# workers/order_worker.py
async def run_order_worker():
"""Dedicated worker for order workflows"""
client = await create_temporal_client()
worker = Worker(
client,
task_queue="order-queue", # ✅ Dedicated queue
workflows=[OrderWorkflow],
activities=[
validate_order,
process_payment,
reserve_inventory,
create_shipment
]
)
await worker.run()
# workers/notification_worker.py
async def run_notification_worker():
"""Dedicated worker for notifications"""
client = await create_temporal_client()
worker = Worker(
client,
task_queue="notification-queue", # ✅ Dedicated queue
workflows=[NotificationWorkflow],
activities=[
send_email,
send_sms,
send_push_notification
]
)
await worker.run()
Deployment:
# kubernetes/order-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-worker
spec:
replicas: 5 # ✅ Skaliert unabhängig
template:
spec:
containers:
- name: order-worker
image: myapp/order-worker:v2.3.0 # ✅ Unabhängige Versions
---
# kubernetes/notification-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-worker
spec:
replicas: 10 # ✅ Mehr Replicas für hohe Last
template:
spec:
containers:
- name: notification-worker
image: myapp/notification-worker:v1.5.0 # ✅ Andere Version OK
Vorteile:
- ✅ Unabhängige Skalierung
- ✅ Unabhängige Deployments
- ✅ Blast Radius Isolation
- ✅ Team Autonomie
13.5 Worker Configuration Best Practices
Immer mehr als ein Worker
Regel: Production braucht mindestens 2 Workers pro Queue.
# ❌ ANTI-PATTERN: Single Worker in Production
# ❌ Single Point of Failure!
# Wenn dieser Worker crashed:
# → Alle Tasks bleiben liegen
# → Schedule-To-Start explodiert
# → Workflows timeout
docker run my-worker:latest # ❌ Nur 1 Instance
# ✅ BEST PRACTICE: Multiple Workers für HA
# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-worker
spec:
replicas: 3 # ✅ Minimum 3 für High Availability
template:
spec:
containers:
- name: worker
image: my-worker:latest
env:
- name: TEMPORAL_TASK_QUEUE
value: "order-queue"
# ✅ Resource Limits
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# ✅ Health Checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# ✅ Graceful Shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
Warum mehrere Workers?
- ✅ High Availability: Worker-Crash betrifft nur Teil der Kapazität
- ✅ Rolling Updates: Zero-Downtime Deployments
- ✅ Load Balancing: Temporal verteilt automatisch
- ✅ Redundanz: Hardware-Failure resilient
Worker Tuning
Regel: Tunen Sie Worker basierend auf Schedule-To-Start Metrics.
# ❌ ANTI-PATTERN: Default Settings in Production
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment, create_shipment]
# ❌ Keine Tuning-Parameter
# → Worker kann überlastet werden
# → Oder underutilized sein
)
# ✅ BEST PRACTICE: Getunter Worker
from temporalio.worker import Worker, WorkerConfig
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment, create_shipment],
# ✅ Max concurrent Workflow Tasks
max_concurrent_workflow_tasks=100, # Default: 100
# ✅ Max concurrent Activity Tasks
max_concurrent_activities=50, # Default: 100
# ✅ Max concurrent Local Activities
max_concurrent_local_activities=200, # Default: 200
# ✅ Workflow Cache Size
max_cached_workflows=500, # Default: 600
# ✅ Sticky Queue Schedule-To-Start Timeout
sticky_queue_schedule_to_start_timeout=timedelta(seconds=5)
)
Tuning Guidelines:
| Metric | Wert | Aktion |
|---|---|---|
| Schedule-To-Start > 1s | Steigend | ❌ Mehr Workers oder max_concurrent erhöhen |
| Schedule-To-Start < 100ms | Konstant | ✅ Optimal |
| Worker CPU > 80% | Konstant | ❌ Weniger Concurrency oder mehr Workers |
| Worker Memory > 80% | Steigend | ❌ max_cached_workflows reduzieren |
Monitoring-basiertes Tuning:
# workers/tuned_worker.py
import os
# ✅ Environment-based tuning
MAX_WORKFLOW_TASKS = int(os.getenv("MAX_WORKFLOW_TASKS", "100"))
MAX_ACTIVITIES = int(os.getenv("MAX_ACTIVITIES", "50"))
async def main():
client = await create_temporal_client()
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment],
max_concurrent_workflow_tasks=MAX_WORKFLOW_TASKS,
max_concurrent_activities=MAX_ACTIVITIES
)
logging.info(
f"Starting worker with "
f"max_workflow_tasks={MAX_WORKFLOW_TASKS}, "
f"max_activities={MAX_ACTIVITIES}"
)
await worker.run()
# Deployment mit tuning
kubectl set env deployment/order-worker \
MAX_WORKFLOW_TASKS=200 \
MAX_ACTIVITIES=100
# ✅ Live-Tuning ohne Code-Change!
13.6 Performance Best Practices
Sandbox Performance Optimization
Regel: Pass deterministic modules through für bessere Performance.
# ❌ ANTI-PATTERN: Langsamer Sandbox (alles wird gesandboxed)
from temporalio.worker import Worker
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment]
# ❌ Alle Module werden gesandboxed
# → Pydantic Models sind sehr langsam
)
# ✅ BEST PRACTICE: Optimierter Sandbox
from temporalio.worker import Worker
from temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner, SandboxRestrictions
# ✅ Pass-through für deterministische Module
passthrough_modules = [
"pydantic", # ✅ Pydantic ist deterministisch
"dataclasses", # ✅ Dataclasses sind deterministisch
"models", # ✅ Unsere eigenen Models
"workflows.order_models", # ✅ Order-spezifische Models
]
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment],
# ✅ Custom Sandbox Configuration
workflow_runner=SandboxedWorkflowRunner(
restrictions=SandboxRestrictions.default.with_passthrough_modules(
*passthrough_modules
)
)
)
# ✅ Resultat: 5-10x schnellerer Workflow-Start!
Event History Size Monitoring
Regel: Monitoren Sie History Size und reagieren Sie frühzeitig.
# ✅ BEST PRACTICE: History Size Monitoring im Workflow
@workflow.defn
class LongRunningWorkflow:
@workflow.run
async def run(self, input: JobInput):
processed = 0
for item in input.items:
# ✅ Regelmäßig History Size checken
info = workflow.info()
history_length = info.get_current_history_length()
if history_length > 8000: # ✅ Warning bei 8k (limit: 50k)
workflow.logger.warning(
f"History size: {history_length} events, "
"approaching limit (50k). Consider Continue-As-New."
)
if history_length > 10000: # ✅ Continue-As-New bei 10k
workflow.logger.info(
f"History size: {history_length}, continuing as new"
)
workflow.continue_as_new(
JobInput(
items=input.items[processed:],
total_processed=input.total_processed + processed
)
)
result = await workflow.execute_activity(
process_item,
item,
start_to_close_timeout=timedelta(seconds=30)
)
processed += 1
Prometheus Metrics:
# workers/metrics.py
from prometheus_client import Histogram, Counter
workflow_history_size = Histogram(
'temporal_workflow_history_size',
'Workflow history event count',
buckets=[10, 50, 100, 500, 1000, 5000, 10000, 50000]
)
continue_as_new_counter = Counter(
'temporal_continue_as_new_total',
'Continue-As-New executions'
)
# Im Workflow
workflow_history_size.observe(history_length)
if history_length > 10000:
continue_as_new_counter.inc()
workflow.continue_as_new(...)
13.7 Anti-Pattern Katalog
1. SDK Over-Wrapping
Anti-Pattern: Temporal SDK zu stark wrappen.
# ❌ ANTI-PATTERN: Zu starkes Wrapping versteckt Features
class MyTemporalWrapper:
"""❌ Versteckt wichtige Temporal-Features"""
def __init__(self, namespace: str):
# ❌ Versteckt Client-Konfiguration
self.client = Client.connect(namespace)
async def run_workflow(self, name: str, data: dict):
# ❌ Kein Zugriff auf:
# - Workflow ID customization
# - Retry Policies
# - Timeouts
# - Signals/Queries
return await self.client.execute_workflow(name, data)
# ❌ SDK-Updates sind schwierig
# ❌ Team kennt Temporal nicht wirklich
# ❌ Features wie Schedules, Updates nicht nutzbar
# ✅ BEST PRACTICE: Dünner Helper, voller SDK-Zugriff
# helpers/temporal_helpers.py
async def create_temporal_client(
namespace: str = "default"
) -> Client:
"""Thin helper for client creation"""
return await Client.connect(
f"localhost:7233",
namespace=namespace,
# ✅ Weitere Config durchreichbar
)
# Application code: Voller SDK-Zugriff
async def main():
client = await create_temporal_client()
# ✅ Direkter SDK-Zugriff für alle Features
handle = await client.start_workflow(
OrderWorkflow.run,
order_input,
id=f"order-{order_id}",
task_queue="order-queue",
retry_policy=RetryPolicy(maximum_attempts=3),
execution_timeout=timedelta(days=7)
)
# ✅ Signals
await handle.signal(OrderWorkflow.approve)
# ✅ Queries
status = await handle.query(OrderWorkflow.get_status)
2. Local Activities ohne Idempotenz
Anti-Pattern: Local Activities verwenden ohne Idempotenz-Keys.
# ❌ ANTI-PATTERN: Non-Idempotent Local Activity
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, amount: float):
# ❌ Local Activity (kann mehrfach ausgeführt werden!)
await workflow.execute_local_activity(
charge_credit_card,
amount,
start_to_close_timeout=timedelta(seconds=5)
)
# ❌ Bei Retry: Kunde wird doppelt belastet!
@activity.defn
async def charge_credit_card(amount: float):
"""❌ Nicht idempotent!"""
# Charge without idempotency key
await payment_api.charge(amount) # ❌ Kann mehrfach passieren!
Was passiert:
1. Local Activity startet: charge_credit_card(100.0)
2. Payment API wird aufgerufen: $100 charged
3. Worker crashed vor Activity-Completion
4. Workflow replay: Local Activity wird NOCHMAL ausgeführt
5. Payment API wird NOCHMAL aufgerufen: $100 charged AGAIN
6. Kunde wurde $200 belastet statt $100!
# ✅ BEST PRACTICE: Idempotente Local Activity ODER Regular Activity
# Option 1: Idempotent Local Activity
@activity.defn
async def charge_credit_card_idempotent(
amount: float,
idempotency_key: str # ✅ Idempotency Key!
):
"""✅ Idempotent mit Key"""
await payment_api.charge(
amount,
idempotency_key=idempotency_key # ✅ API merkt Duplikate
)
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, payment_id: str, amount: float):
# ✅ Unique Key basierend auf Workflow
idempotency_key = f"{workflow.info().workflow_id}-payment"
await workflow.execute_local_activity(
charge_credit_card_idempotent,
args=[amount, idempotency_key],
start_to_close_timeout=timedelta(seconds=5)
)
# Option 2: Regular Activity (recommended!)
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, amount: float):
# ✅ Regular Activity: Temporal garantiert at-most-once
await workflow.execute_activity(
charge_credit_card,
amount,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
Regel: Verwenden Sie Regular Activities als Default. Local Activities nur für:
- Sehr schnelle Operationen (<1s)
- Read-Only Operationen
- Operations mit eingebauter Idempotenz
3. Workers Side-by-Side mit Application Code
Anti-Pattern: Workers im gleichen Process wie Application Code deployen.
# ❌ ANTI-PATTERN: Worker + Web Server im gleichen Process
# main.py
from fastapi import FastAPI
from temporalio.worker import Worker
app = FastAPI()
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
"""Web API endpoint"""
...
async def main():
# ❌ Worker und Web Server im gleichen Process!
client = await create_temporal_client()
# Start Worker (blocking!)
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment]
)
# ❌ Probleme:
# - Worker blockiert Web Server (oder umgekehrt)
# - Resource Contention (CPU/Memory)
# - Deployment ist gekoppelt
# - Scaling ist gekoppelt
# - Ein Crash betrifft beides
await worker.run()
# ✅ BEST PRACTICE: Separate Processes
# web_server.py (separate deployment)
from fastapi import FastAPI
app = FastAPI()
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
"""Web API endpoint"""
client = await create_temporal_client()
handle = client.get_workflow_handle(order_id)
status = await handle.query(OrderWorkflow.get_status)
return {"status": status}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
# worker.py (separate deployment)
from temporalio.worker import Worker
async def main():
"""Dedicated worker process"""
client = await create_temporal_client()
worker = Worker(
client,
task_queue="order-queue",
workflows=[OrderWorkflow],
activities=[process_payment]
)
await worker.run()
if __name__ == "__main__":
asyncio.run(main())
Separate Deployments:
# kubernetes/web-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
replicas: 10 # ✅ Viele Replicas für Web Traffic
template:
spec:
containers:
- name: web
image: myapp/web:latest
command: ["python", "web_server.py"]
resources:
requests:
cpu: "200m" # ✅ Wenig CPU für Web
memory: "256Mi"
---
# kubernetes/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker
spec:
replicas: 3 # ✅ Weniger Replicas, aber mehr Ressourcen
template:
spec:
containers:
- name: worker
image: myapp/worker:latest
command: ["python", "worker.py"]
resources:
requests:
cpu: "1000m" # ✅ Mehr CPU für Worker
memory: "2Gi" # ✅ Mehr Memory für Workflow Caching
13.8 Production Readiness Checklist
Code-Ebene
✅ Workflows orchestrieren nur, implementieren nicht
✅ Single Object Input Pattern für alle Workflows
✅ Alle non-deterministic Operationen in Activities
✅ Continue-As-New für long-running Workflows
✅ History Size Monitoring implementiert
✅ Query Handlers sind read-only
✅ Replay Tests in CI/CD
✅ Comprehensive Unit Tests für Activities
✅ Integration Tests mit WorkflowEnvironment
Deployment-Ebene
✅ Minimum 3 Worker Replicas pro Queue
✅ Workers separiert von Application Code
✅ Resource Limits definiert (CPU/Memory)
✅ Health Checks konfiguriert
✅ Graceful Shutdown implementiert
✅ Worker pro Domain/Use Case
✅ Worker Tuning basierend auf Metrics
✅ Rolling Update Strategy konfiguriert
Monitoring-Ebene
✅ Schedule-To-Start Metrics
✅ Workflow Success/Failure Rate
✅ Activity Duration & Error Rate
✅ Event History Size Tracking
✅ Worker CPU/Memory Monitoring
✅ Continue-As-New Rate
✅ Alerts konfiguriert (PagerDuty/Slack)
Testing-Ebene
✅ Replay Tests für jede Workflow-Version
✅ Unit Tests für jede Activity
✅ Integration Tests für Happy Path
✅ Integration Tests für Error Cases
✅ Production History Replay in CI
✅ Load Testing für Worker Capacity
✅ Chaos Engineering Tests (Worker Failures)
13.9 Code Review Checkliste
Verwenden Sie diese Checkliste bei Code Reviews:
Workflow Code Review
✅ Workflow orchestriert nur (keine Business Logic)?
✅ Single Object Input statt multiple Parameters?
✅ Keine non-deterministic Operationen (random, datetime.now, etc.)?
✅ Keine Activity-Reihenfolge geändert ohne workflow.patched()?
✅ Continue-As-New für long-running Workflows?
✅ History Size Monitoring vorhanden?
✅ Workflow State klein (<1 KB)?
✅ Query Handlers sind read-only?
✅ Replay Tests hinzugefügt?
Activity Code Review
✅ Activity ist idempotent?
✅ Activity hat Retry-Logic (oder RetryPolicy)?
✅ Activity hat Timeout definiert?
✅ Activity ist unit-testbar?
✅ Externe Calls haben Circuit Breaker?
✅ Activity loggt Errors mit Context?
✅ Activity gibt strukturiertes Result zurück (nicht primitives)?
Worker Code Review
✅ Worker hat max_concurrent_* konfiguriert?
✅ Worker hat Health Check Endpoint?
✅ Worker hat Graceful Shutdown?
✅ Worker ist unabhängig deploybar?
✅ Worker hat Resource Limits?
✅ Worker hat Monitoring/Metrics?
13.10 Zusammenfassung
Top 10 Best Practices
- Orchestration, nicht Implementation: Workflows orchestrieren, Activities implementieren
- Single Object Input: Ein Dataclass-Input statt viele Parameter
- Determinismus: Alles Non-Deterministische in Activities
- Continue-As-New: Bei >1.000 Events oder long-running Workflows
- Minimaler State: IDs speichern, nicht Daten
- Code-Organisation: Workflows, Activities, Workers getrennt
- Multiple Workers: Minimum 3 Replicas in Production
- Worker Tuning: Basierend auf Schedule-To-Start Metrics
- Replay Testing: Jede Workflow-Änderung testen
- Monitoring: Schedule-To-Start, Success Rate, History Size
Top 10 Anti-Patterns
- Non-Determinismus:
random(),datetime.now(),uuid.uuid4()im Workflow - Activity-Reihenfolge ändern: Ohne
workflow.patched() - Große Event History: >10.000 Events ohne Continue-As-New
- Großer Workflow State: Listen/Dicts statt IDs
- Query Mutation: State in Query Handler ändern
- SDK Over-Wrapping: Temporal SDK zu stark abstrahieren
- Local Activities ohne Idempotenz: Duplikate werden nicht verhindert
- Single Worker: Kein Failover, kein Rolling Update
- Workers mit App Code: Resource Contention, gekoppeltes Deployment
- Fehlende Tests: Keine Replay Tests, keine Integration Tests
Quick Reference: Was ist OK wo?
| Operation | Workflow | Activity | Warum |
|---|---|---|---|
random.random() | ❌ | ✅ | Non-deterministic |
datetime.now() | ❌ | ✅ | Non-deterministic |
uuid.uuid4() | ❌ | ✅ | Non-deterministic |
| External API Call | ❌ | ✅ | Non-deterministic |
| Database Query | ❌ | ✅ | Non-deterministic |
| File I/O | ❌ | ✅ | Non-deterministic |
| Heavy Computation | ❌ | ✅ | Should be retryable |
workflow.sleep() | ✅ | ❌ | Deterministic timer |
workflow.execute_activity() | ✅ | ❌ | Workflow orchestration |
| State Management | ✅ (minimal) | ❌ | Workflow owns state |
| Logging | ✅ | ✅ | Both OK |
Nächste Schritte
Sie haben jetzt:
- ✅ Best Practices für Production-Ready Workflows
- ✅ Anti-Patterns Katalog zur Vermeidung häufiger Fehler
- ✅ Code-Organisation Patterns für Wartbarkeit
- ✅ Worker-Tuning Guidelines für Performance
- ✅ Production Readiness Checkliste
In Teil V (Kochbuch) werden wir konkrete Rezepte für häufige Use Cases sehen:
- E-Commerce Order Processing
- Payment Processing with Retries
- Long-Running Approval Workflows
- Scheduled Cleanup Jobs
- Fan-Out/Fan-In Patterns
- Saga Pattern Implementation
⬆ Zurück zum Inhaltsverzeichnis
Nächstes Kapitel: Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)
Code-Beispiele für dieses Kapitel: examples/part-04/chapter-13/
Ressourcen
Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)
In diesem Kapitel untersuchen wir drei bewährte Workflow-Muster, die in der Praxis häufig vorkommen und für die Temporal besonders gut geeignet ist. Diese Muster zeigen die Stärken von Temporal bei der Orchestrierung komplexer Geschäftsprozesse.
14.1 Überblick: Warum Muster-Rezepte?
Während wir in den vorherigen Kapiteln die Grundlagen von Temporal kennengelernt haben, geht es nun darum, wie man häufige Geschäftsszenarien elegant und robust implementiert. Die drei Muster, die wir behandeln werden, repräsentieren typische Herausforderungen in verteilten Systemen:
- Human-in-the-Loop: Prozesse, die menschliche Eingaben oder Genehmigungen erfordern
- Cron/Scheduling: Zeitgesteuerte, wiederkehrende Aufgaben
- Order Fulfillment (Saga): Verteilte Transaktionen über mehrere Services hinweg
14.2 Human-in-the-Loop Pattern
Das Problem
Viele Geschäftsprozesse erfordern an bestimmten Punkten menschliche Entscheidungen oder Eingaben:
- Genehmigung von Urlaubsanträgen
- Überprüfung von Hintergrundüberprüfungen (Background Checks)
- Freigabe von Zahlungen über einem bestimmten Betrag
- Klärung von Mehrdeutigkeiten in automatisierten Prozessen
Die Herausforderung besteht darin, dass diese menschlichen Interaktionen unvorhersehbar lange dauern können – von Minuten bis zu Tagen oder sogar Wochen.
Die Temporal-Lösung
Temporal ermöglicht es Workflows, auf menschliche Eingaben zu warten, ohne Ressourcen zu blockieren. Der Workflow kann für Stunden oder Tage “schlafen” und wird genau dort fortgesetzt, wo er gestoppt hat, sobald die Eingabe erfolgt.
Wichtige Konzepte:
- Signals: Ermöglichen es, Daten in einen laufenden Workflow zu senden
- Queries: Erlauben das Abfragen des aktuellen Workflow-Status
- Timers: Können als Timeout für zu lange Wartezeiten dienen
Implementierungsbeispiel: Genehmigungsprozess
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class ApprovalWorkflow:
def __init__(self):
self.approved = False
self.rejection_reason = None
@workflow.run
async def run(self, request_data: dict) -> str:
# 1. Sende Benachrichtigung an Genehmiger
await workflow.execute_activity(
send_approval_notification,
request_data,
start_to_close_timeout=timedelta(seconds=30)
)
# 2. Warte auf Genehmigung mit Timeout von 7 Tagen
try:
await workflow.wait_condition(
lambda: self.approved or self.rejection_reason,
timeout=timedelta(days=7)
)
except asyncio.TimeoutError:
# Automatische Eskalation nach 7 Tagen
await workflow.execute_activity(
escalate_to_manager,
request_data,
start_to_close_timeout=timedelta(seconds=30)
)
# Warte weitere 3 Tage auf Manager
await workflow.wait_condition(
lambda: self.approved or self.rejection_reason,
timeout=timedelta(days=3)
)
# 3. Verarbeite das Ergebnis
if self.approved:
await workflow.execute_activity(
process_approval,
request_data,
start_to_close_timeout=timedelta(minutes=5)
)
return "approved"
else:
await workflow.execute_activity(
notify_rejection,
args=[request_data, self.rejection_reason],
start_to_close_timeout=timedelta(seconds=30)
)
return f"rejected: {self.rejection_reason}"
@workflow.signal
async def approve(self):
"""Signal zum Genehmigen des Antrags"""
self.approved = True
@workflow.signal
async def reject(self, reason: str):
"""Signal zum Ablehnen des Antrags"""
self.rejection_reason = reason
@workflow.query
def get_status(self) -> dict:
"""Abfrage des aktuellen Status"""
return {
"approved": self.approved,
"rejected": self.rejection_reason is not None,
"pending": not self.approved and not self.rejection_reason
}
Verwendung des Workflows:
# Workflow starten
handle = await client.start_workflow(
ApprovalWorkflow.run,
request_data,
id="approval-12345",
task_queue="approval-tasks"
)
# Status abfragen (jederzeit möglich)
status = await handle.query(ApprovalWorkflow.get_status)
print(f"Current status: {status}")
# Genehmigung senden (kann Tage später erfolgen)
await handle.signal(ApprovalWorkflow.approve)
# Auf Ergebnis warten
result = await handle.result()
Best Practices
- Timeouts verwenden: Implementiere immer Timeouts und Eskalationsmechanismen
- Status abfragbar machen: Nutze Queries, damit Benutzer den Status jederzeit prüfen können
- Benachrichtigungen senden: Informiere Menschen aktiv über ausstehende Aktionen
- Idempotenz beachten: Signals können mehrfach gesendet werden – handle dies entsprechend
14.3 Cron und Scheduling Pattern
Warum nicht einfach Cron?
Traditionelle Cron-Jobs haben mehrere Probleme:
- Keine Visibilität in den Ausführungsstatus
- Keine einfache Möglichkeit, Jobs zu pausieren oder zu stoppen
- Schwierig zu testen und zu überwachen
- Keine Garantie für genau eine Ausführung (at-least-once, aber nicht exactly-once)
- Kein eingebautes Retry-Verhalten
Temporal Schedules: Die bessere Alternative
Temporal Schedules bieten:
- Vollständige Kontrolle: Start, Stop, Pause, Update zur Laufzeit
- Observability: Einsicht in alle vergangenen und zukünftigen Ausführungen
- Backfill: Nachträgliches Ausführen verpasster Runs
- Overlap-Policies: Kontrolliere, was passiert, wenn ein Workflow noch läuft, während der nächste starten soll
Schedule-Optionen
1. Cron-Style Scheduling:
from temporalio.client import Client, ScheduleActionStartWorkflow, ScheduleSpec, ScheduleIntervalSpec
from datetime import timedelta
async def create_cron_schedule():
client = await Client.connect("localhost:7233")
await client.create_schedule(
id="daily-report-schedule",
schedule=Schedule(
action=ScheduleActionStartWorkflow(
workflow_type=GenerateReportWorkflow,
args=["daily"],
id=f"daily-report-{datetime.now().strftime('%Y%m%d')}",
task_queue="reporting"
),
spec=ScheduleSpec(
# Jeden Tag um 6 Uhr morgens UTC
cron_expressions=["0 6 * * *"],
),
# Was tun bei Überlappungen?
policy=SchedulePolicy(
overlap=ScheduleOverlapPolicy.SKIP, # Überspringe, wenn noch läuft
)
)
)
Cron-Format in Temporal:
┌───────────── Minute (0-59)
│ ┌───────────── Stunde (0-23)
│ │ ┌───────────── Tag des Monats (1-31)
│ │ │ ┌───────────── Monat (1-12)
│ │ │ │ ┌───────────── Tag der Woche (0-6, Sonntag = 0)
│ │ │ │ │
* * * * *
Beispiele:
0 9 * * 1-5: Werktags um 9 Uhr*/15 * * * *: Alle 15 Minuten0 0 1 * *: Am ersten Tag jeden Monats um Mitternacht
2. Interval-basiertes Scheduling:
await client.create_schedule(
id="health-check-schedule",
schedule=Schedule(
action=ScheduleActionStartWorkflow(
workflow_type=HealthCheckWorkflow,
task_queue="monitoring"
),
spec=ScheduleSpec(
# Alle 5 Minuten
intervals=[ScheduleIntervalSpec(
every=timedelta(minutes=5)
)],
)
)
)
Overlap-Policies
Was passiert, wenn ein Workflow noch läuft, während der nächste geplant ist?
from temporalio import ScheduleOverlapPolicy
# SKIP: Überspringe die neue Ausführung
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.SKIP)
# BUFFER_ONE: Führe maximal eine weitere Ausführung in der Warteschlange
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.BUFFER_ONE)
# BUFFER_ALL: Puffere alle Ausführungen (Vorsicht: kann zu Stau führen!)
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.BUFFER_ALL)
# CANCEL_OTHER: Breche den laufenden Workflow ab und starte neu
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.CANCEL_OTHER)
# ALLOW_ALL: Erlaube parallele Ausführungen
policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.ALLOW_ALL)
Schedule-Management
# Schedule abrufen
schedule_handle = client.get_schedule_handle("daily-report-schedule")
# Beschreibung abrufen
description = await schedule_handle.describe()
print(f"Next 5 runs: {description.info.next_action_times[:5]}")
# Pausieren
await schedule_handle.pause(note="Maintenance window")
# Wieder aktivieren
await schedule_handle.unpause(note="Maintenance complete")
# Einmalig manuell auslösen
await schedule_handle.trigger(overlap=ScheduleOverlapPolicy.ALLOW_ALL)
# Backfill: Verpasste Ausführungen nachholen
await schedule_handle.backfill(
start_at=datetime(2024, 1, 1),
end_at=datetime(2024, 1, 31),
overlap=ScheduleOverlapPolicy.ALLOW_ALL
)
# Schedule löschen
await schedule_handle.delete()
Workflow-Implementierung für Schedules
@workflow.defn
class DataSyncWorkflow:
@workflow.run
async def run(self) -> dict:
# Workflow weiß, ob er via Schedule gestartet wurde
info = workflow.info()
workflow.logger.info(
f"Running scheduled sync. Attempt: {info.attempt}"
)
# Normale Workflow-Logik
records = await workflow.execute_activity(
fetch_new_records,
start_to_close_timeout=timedelta(minutes=10)
)
await workflow.execute_activity(
sync_to_database,
records,
start_to_close_timeout=timedelta(minutes=5)
)
return {
"synced_records": len(records),
"timestamp": datetime.now().isoformat()
}
Best Practices für Scheduling
- Idempotenz: Schedules können Workflows mehrfach starten – stelle sicher, dass deine Logik idempotent ist
- Monitoring: Nutze Temporal UI, um verpasste oder fehlgeschlagene Runs zu überwachen
- Overlap-Policy wählen: Überlege genau, was bei Überlappungen passieren soll
- Zeitzone beachten: Cron-Ausdrücke werden standardmäßig in UTC interpretiert
- Workflow-IDs: Verwende dynamische Workflow-IDs mit Zeitstempel, um Duplikate zu vermeiden
14.4 Order Fulfillment mit dem Saga Pattern
Das Problem: Verteilte Transaktionen
Stellen wir uns einen E-Commerce-Bestellprozess vor, der mehrere Services involviert:
- Inventory Service: Prüfe Verfügbarkeit und reserviere Artikel
- Payment Service: Belaste Kreditkarte
- Shipping Service: Erstelle Versandetikett und beauftrage Versand
- Notification Service: Sende Bestätigungsmail
Was passiert, wenn Schritt 3 fehlschlägt, nachdem wir bereits Schritt 1 und 2 ausgeführt haben? Wir müssen:
- Die Kreditkartenbelastung rückgängig machen (Schritt 2)
- Die Inventar-Reservierung aufheben (Schritt 1)
Dies ist das klassische Problem verteilter Transaktionen: Entweder alle Schritte erfolgreich, oder keiner.
Das Saga Pattern
Ein Saga ist eine Sequenz von lokalen Transaktionen, wobei jede Transaktion eine Kompensation (Rückgängigmachung) hat. Falls ein Schritt fehlschlägt, werden alle vorherigen Schritte durch ihre Kompensationen rückgängig gemacht.
Zwei Hauptkomponenten:
- Forward-Recovery: Die normalen Schritte vorwärts
- Backward-Recovery (Compensations): Die Rückgängigmachung bei Fehler
Temporal vereinfacht Sagas
Ohne Temporal müsstest du:
- Selbst den Fortschritt tracken (Event Sourcing)
- Retry-Logik implementieren
- State Management über Services hinweg
- Crash-Recovery-Mechanismen bauen
Mit Temporal bekommst du all das kostenlos. Du musst nur die Kompensationen definieren.
Implementierung: Order Fulfillment
from temporalio import workflow, activity
from datetime import timedelta
from dataclasses import dataclass
from typing import Optional
@dataclass
class OrderInfo:
order_id: str
customer_id: str
items: list[dict]
total_amount: float
idempotency_key: str # Wichtig für Idempotenz!
@dataclass
class SagaCompensation:
activity_name: str
args: list
class Saga:
"""Helper-Klasse zum Verwalten von Kompensationen"""
def __init__(self):
self.compensations: list[SagaCompensation] = []
def add_compensation(self, activity_name: str, *args):
"""Füge eine Kompensation hinzu"""
self.compensations.append(
SagaCompensation(activity_name, list(args))
)
async def compensate(self):
"""Führe alle Kompensationen in umgekehrter Reihenfolge aus"""
# LIFO: Last In, First Out
for comp in reversed(self.compensations):
try:
await workflow.execute_activity(
comp.activity_name,
args=comp.args,
start_to_close_timeout=timedelta(minutes=5),
retry_policy=workflow.RetryPolicy(
maximum_attempts=5,
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=60)
)
)
workflow.logger.info(f"Compensated: {comp.activity_name}")
except Exception as e:
workflow.logger.error(
f"Compensation failed for {comp.activity_name}: {e}"
)
# In Produktion: Dead Letter Queue, Alerting, etc.
@workflow.defn
class OrderFulfillmentWorkflow:
@workflow.run
async def run(self, order: OrderInfo) -> dict:
saga = Saga()
try:
# Schritt 1: Inventar prüfen und reservieren
inventory_reserved = await workflow.execute_activity(
reserve_inventory,
order,
start_to_close_timeout=timedelta(minutes=2)
)
# Kompensation hinzufügen: Reservierung aufheben
saga.add_compensation(
"release_inventory",
order.order_id,
order.idempotency_key
)
workflow.logger.info(f"Inventory reserved: {inventory_reserved}")
# Schritt 2: Zahlung durchführen
payment_result = await workflow.execute_activity(
charge_payment,
order,
start_to_close_timeout=timedelta(minutes=5)
)
# Kompensation hinzufügen: Zahlung erstatten
saga.add_compensation(
"refund_payment",
payment_result["transaction_id"],
order.total_amount,
order.idempotency_key
)
workflow.logger.info(f"Payment charged: {payment_result}")
# Schritt 3: Versand erstellen
shipping_result = await workflow.execute_activity(
create_shipment,
order,
start_to_close_timeout=timedelta(minutes=3)
)
# Kompensation hinzufügen: Versand stornieren
saga.add_compensation(
"cancel_shipment",
shipping_result["shipment_id"],
order.idempotency_key
)
workflow.logger.info(f"Shipment created: {shipping_result}")
# Schritt 4: Bestätigung senden (keine Kompensation nötig)
await workflow.execute_activity(
send_confirmation_email,
order,
start_to_close_timeout=timedelta(seconds=30)
)
# Erfolg!
return {
"status": "fulfilled",
"order_id": order.order_id,
"tracking_number": shipping_result["tracking_number"]
}
except Exception as e:
workflow.logger.error(f"Order fulfillment failed: {e}")
# Kompensiere alle bisherigen Schritte
await saga.compensate()
# Sende Fehlerbenachrichtigung
await workflow.execute_activity(
send_error_notification,
args=[order, str(e)],
start_to_close_timeout=timedelta(seconds=30)
)
return {
"status": "failed",
"order_id": order.order_id,
"error": str(e)
}
Activity-Implementierungen
# Activities mit Idempotenz
@activity.defn
async def reserve_inventory(order: OrderInfo) -> bool:
"""Reserviere Artikel im Inventar"""
# Verwende idempotency_key, um Duplikate zu vermeiden
response = await inventory_service.reserve(
items=order.items,
order_id=order.order_id,
idempotency_key=order.idempotency_key
)
return response.success
@activity.defn
async def release_inventory(order_id: str, idempotency_key: str):
"""Kompensation: Gib Inventar-Reservierung frei"""
await inventory_service.release(
order_id=order_id,
idempotency_key=f"{idempotency_key}-release"
)
@activity.defn
async def charge_payment(order: OrderInfo) -> dict:
"""Belaste Zahlungsmittel"""
# Viele Payment-APIs akzeptieren bereits idempotency_keys
response = await payment_service.charge(
customer_id=order.customer_id,
amount=order.total_amount,
idempotency_key=order.idempotency_key
)
return {
"transaction_id": response.transaction_id,
"status": response.status
}
@activity.defn
async def refund_payment(
transaction_id: str,
amount: float,
idempotency_key: str
):
"""Kompensation: Erstatte Zahlung"""
await payment_service.refund(
transaction_id=transaction_id,
amount=amount,
idempotency_key=f"{idempotency_key}-refund"
)
@activity.defn
async def create_shipment(order: OrderInfo) -> dict:
"""Erstelle Versandetikett"""
response = await shipping_service.create_shipment(
order=order,
idempotency_key=order.idempotency_key
)
return {
"shipment_id": response.shipment_id,
"tracking_number": response.tracking_number
}
@activity.defn
async def cancel_shipment(shipment_id: str, idempotency_key: str):
"""Kompensation: Storniere Versand"""
await shipping_service.cancel(
shipment_id=shipment_id,
idempotency_key=f"{idempotency_key}-cancel"
)
@activity.defn
async def send_confirmation_email(order: OrderInfo):
"""Sende Bestätigungs-E-Mail"""
await email_service.send(
to=order.customer_id,
template="order_confirmation",
data=order
)
@activity.defn
async def send_error_notification(order: OrderInfo, error: str):
"""Sende Fehler-Benachrichtigung"""
await email_service.send(
to=order.customer_id,
template="order_failed",
data={"order": order, "error": error}
)
Kritisches Konzept: Idempotenz
Da Temporal Activities automatisch wiederholt, müssen alle Activities idempotent sein:
# Schlechtes Beispiel: Nicht idempotent
async def charge_payment_bad(customer_id: str, amount: float):
# Könnte bei Retry mehrfach belasten!
return payment_api.charge(customer_id, amount)
# Gutes Beispiel: Idempotent mit Key
async def charge_payment_good(
customer_id: str,
amount: float,
idempotency_key: str
):
# Payment-API prüft den Key und führt nur einmal aus
return payment_api.charge(
customer_id,
amount,
idempotency_key=idempotency_key
)
Best Practices für Idempotenz:
- Idempotenz-Keys verwenden: UUIDs oder zusammengesetzte Keys (z.B.
{order_id}-{operation}) - API-Unterstützung nutzen: Viele APIs (Stripe, PayPal, etc.) akzeptieren bereits Idempotenz-Keys
- Datenbank-Constraints: Unique-Constraints auf Keys in der Datenbank
- State-Checks: Prüfe vor Ausführung, ob Operation bereits durchgeführt wurde
Erweiterte Saga-Techniken
Parallele Kompensationen:
async def compensate_parallel(self):
"""Führe Kompensationen parallel aus für bessere Performance"""
tasks = []
for comp in reversed(self.compensations):
task = workflow.execute_activity(
comp.activity_name,
args=comp.args,
start_to_close_timeout=timedelta(minutes=5)
)
tasks.append(task)
# Warte auf alle Kompensationen
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
if isinstance(result, Exception):
workflow.logger.error(
f"Compensation failed: {self.compensations[i].activity_name}"
)
Teilweise Kompensation:
class Saga:
def __init__(self):
self.compensations: list[SagaCompensation] = []
self.checkpoints: list[str] = []
def add_checkpoint(self, name: str):
"""Setze einen Checkpoint für teilweise Kompensation"""
self.checkpoints.append(name)
async def compensate_to_checkpoint(self, checkpoint_name: str):
"""Kompensiere nur bis zu einem bestimmten Checkpoint"""
checkpoint_index = self.checkpoints.index(checkpoint_name)
for comp in reversed(self.compensations[:checkpoint_index]):
await workflow.execute_activity(...)
Wann Sagas verwenden?
Geeignet für:
- E-Commerce Order Processing
- Reisebuchungen (Flug + Hotel + Mietwagen)
- Finanzielle Transaktionen über mehrere Konten
- Multi-Service Workflows mit “Alles-oder-Nichts”-Semantik
Nicht geeignet für:
- Einfache, nicht-transaktionale Workflows
- Workflows ohne Notwendigkeit für Rollback
- Szenarien, wo echte ACID-Transaktionen möglich sind
14.5 Zusammenfassung
In diesem Kapitel haben wir drei essenzielle Workflow-Muster kennengelernt:
Human-in-the-Loop
- Problem: Workflows benötigen menschliche Eingaben mit unvorhersehbarer Dauer
- Lösung: Signals zum Senden von Eingaben, Queries zum Abfragen des Status, Timers für Timeouts
- Key Takeaway: Temporal-Workflows können problemlos Tage oder Wochen auf Input warten
Cron/Scheduling
- Problem: Traditionelle Cron-Jobs sind schwer zu überwachen und zu steuern
- Lösung: Temporal Schedules mit voller Kontrolle, Observability und Overlap-Policies
- Key Takeaway: Schedules sind Cron-Jobs mit Superkräften
Order Fulfillment (Saga Pattern)
- Problem: Verteilte Transaktionen über mehrere Services ohne echte ACID-Garantien
- Lösung: Saga Pattern mit Kompensationen für Rollback, Temporal übernimmt State-Management
- Key Takeaway: Idempotenz ist kritisch, Temporal macht Sagas einfach
Gemeinsame Prinzipien
Alle drei Muster profitieren von Temporals Kernstärken:
- Durability: State wird automatisch persistiert
- Reliability: Automatische Retries und Fehlerbehandlung
- Observability: Vollständige Einsicht in Workflow-Ausführungen
- Scalability: Workflows können über lange Zeiträume laufen
Im nächsten Kapitel werden wir uns mit Testing-Strategien für Temporal-Workflows beschäftigen, um sicherzustellen, dass diese Muster auch robust in Produktion laufen.
Kapitel 15: Erweiterte Rezepte (AI Agents, Lambda, Polyglot)
In diesem Kapitel behandeln wir drei fortgeschrittene Anwendungsfälle, die zeigen, wie Temporal in modernen, heterogenen Architekturen eingesetzt wird. Diese Rezepte demonstrieren die Flexibilität und Erweiterbarkeit der Plattform.
15.1 Überblick: Die Evolution von Temporal
Während Kapitel 14 klassische Workflow-Muster behandelte, konzentriert sich dieses Kapitel auf neuere, spezialisierte Anwendungsfälle:
- AI Agents: Orchestrierung von KI-Agenten mit LLMs und langlebigen Konversationen
- Serverless Integration: Kombination von Temporal mit AWS Lambda und anderen FaaS-Plattformen
- Polyglot Workflows: Mehrsprachige Workflows über verschiedene SDKs hinweg
Diese Muster repräsentieren den aktuellen Stand der Temporal-Nutzung in der Industrie (Stand 2024/2025).
15.2 AI Agents mit Temporal
15.2.1 Warum Temporal für AI Agents?
Die Entwicklung von AI-Agenten bringt spezifische Herausforderungen mit sich:
- Langlebige Konversationen: Gespräche können über Stunden oder Tage verlaufen
- Zustandsverwaltung: Kontext, Ziele und bisherige Interaktionen müssen persistent gespeichert werden
- Fehlertoleranz: LLM-APIs können fehlschlagen, Rate-Limits erreichen oder inkonsistente Antworten liefern
- Human-in-the-Loop: Menschen müssen in kritischen Momenten eingreifen können
- Tool-Orchestrierung: Agenten rufen verschiedene externe Tools auf
Temporal bietet für all diese Herausforderungen native Lösungen:
graph TB
subgraph "AI Agent Architecture mit Temporal"
WF[Workflow: Agent Orchestrator]
subgraph "Activities"
LLM[LLM API Call]
TOOL1[Tool: Database Query]
TOOL2[Tool: Web Search]
TOOL3[Tool: File Analysis]
HUMAN[Human Intervention]
end
STATE[(Workflow State:<br/>- Conversation History<br/>- Agent Goal<br/>- Tool Results)]
WF --> LLM
WF --> TOOL1
WF --> TOOL2
WF --> TOOL3
WF --> HUMAN
WF -.stores.-> STATE
end
style WF fill:#e1f5ff
style LLM fill:#ffe1f5
style STATE fill:#fff4e1
15.2.2 Real-World Adoption
Unternehmen, die Temporal für AI Agents nutzen (Stand 2024):
- Lindy, Dust, ZoomInfo: AI Agents mit State-Durability
- Descript & Neosync: Datenpipelines und GPU-Ressourcen-Koordination
- OpenAI Integration: Temporal hat eine offizielle Integration mit dem OpenAI Agents SDK (Public Preview, Python SDK)
15.2.3 Grundlegendes AI Agent Pattern
from temporalio import workflow, activity
from datetime import timedelta
from dataclasses import dataclass, field
from typing import List, Optional
import openai
@dataclass
class Message:
role: str # "system", "user", "assistant", "tool"
content: str
name: Optional[str] = None # Tool name
tool_call_id: Optional[str] = None
@dataclass
class AgentState:
goal: str
conversation_history: List[Message] = field(default_factory=list)
tools_used: List[str] = field(default_factory=list)
completed: bool = False
result: Optional[str] = None
# Activities: Non-deterministische LLM und Tool Calls
@activity.defn
async def call_llm(messages: List[Message], tools: List[dict]) -> dict:
"""
Ruft LLM API auf (OpenAI, Claude, etc.).
Vollständig non-deterministisch - perfekt für Activity.
"""
activity.logger.info(f"Calling LLM with {len(messages)} messages")
try:
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": m.role, "content": m.content} for m in messages],
tools=tools,
temperature=0.7,
)
return {
"content": response.choices[0].message.content,
"tool_calls": response.choices[0].message.tool_calls,
"finish_reason": response.choices[0].finish_reason
}
except Exception as e:
activity.logger.error(f"LLM API error: {e}")
raise
@activity.defn
async def execute_tool(tool_name: str, arguments: dict) -> str:
"""
Führt Tool-Aufrufe aus (Database, APIs, File System, etc.).
"""
activity.logger.info(f"Executing tool: {tool_name}")
if tool_name == "search_database":
# Simuliere Datenbanksuche
query = arguments.get("query")
results = await database_search(query)
return f"Found {len(results)} results: {results}"
elif tool_name == "web_search":
# Web-Suche
query = arguments.get("query")
results = await web_search_api(query)
return f"Web search results: {results}"
elif tool_name == "read_file":
# Datei lesen
filepath = arguments.get("filepath")
content = await read_file_async(filepath)
return content
else:
raise ValueError(f"Unknown tool: {tool_name}")
@activity.defn
async def request_human_input(question: str, context: dict) -> str:
"""
Fordert menschliche Eingabe an (via UI, Email, Slack, etc.).
"""
activity.logger.info(f"Requesting human input: {question}")
# In Production: Sende Notification via Slack/Email
# und warte auf Webhook/API Call zurück
notification_result = await send_notification(
channel="slack",
message=f"AI Agent needs your input: {question}",
context=context
)
# Placeholder - in Reality würde hier ein Signal empfangen
return notification_result
# Workflow: Deterministische Orchestrierung
@workflow.defn
class AIAgentWorkflow:
"""
Orchestriert einen AI Agent mit Tools und optionalem Human-in-the-Loop.
Der Workflow ist deterministisch, aber die LLM-Calls und Tools sind
non-deterministisch (daher als Activities implementiert).
"""
def __init__(self) -> None:
self.state = AgentState(goal="")
self.human_input_received = None
self.max_iterations = 20 # Verhindere infinite loops
@workflow.run
async def run(self, goal: str, initial_context: str = "") -> AgentState:
"""
Führe Agent aus bis Ziel erreicht oder max_iterations.
Args:
goal: Das zu erreichende Ziel des Agents
initial_context: Optionaler initialer Kontext
"""
self.state.goal = goal
# System Message
system_msg = Message(
role="system",
content=f"""You are a helpful AI assistant. Your goal is: {goal}
You have access to the following tools:
- search_database: Search internal database
- web_search: Search the web
- read_file: Read a file from the file system
- request_human_help: Ask a human for help
When you have achieved the goal, respond with "GOAL_ACHIEVED: [result]"."""
)
self.state.conversation_history.append(system_msg)
# Initial User Message
if initial_context:
user_msg = Message(role="user", content=initial_context)
self.state.conversation_history.append(user_msg)
# Available Tools
tools = [
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the internal database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a file",
"parameters": {
"type": "object",
"properties": {
"filepath": {"type": "string", "description": "Path to file"}
},
"required": ["filepath"]
}
}
},
{
"type": "function",
"function": {
"name": "request_human_help",
"description": "Ask a human for help",
"parameters": {
"type": "object",
"properties": {
"question": {"type": "string", "description": "Question for human"}
},
"required": ["question"]
}
}
}
]
# Agent Loop
for iteration in range(self.max_iterations):
workflow.logger.info(f"Agent iteration {iteration + 1}/{self.max_iterations}")
# Call LLM
llm_response = await workflow.execute_activity(
call_llm,
args=[self.state.conversation_history, tools],
start_to_close_timeout=timedelta(seconds=60),
retry_policy=workflow.RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
maximum_attempts=5,
non_retryable_error_types=["InvalidRequestError"]
)
)
# Prüfe, ob Ziel erreicht
if llm_response.get("content") and "GOAL_ACHIEVED:" in llm_response["content"]:
self.state.completed = True
self.state.result = llm_response["content"].replace("GOAL_ACHIEVED:", "").strip()
# Füge finale Antwort zur History hinzu
self.state.conversation_history.append(
Message(role="assistant", content=llm_response["content"])
)
workflow.logger.info(f"Goal achieved: {self.state.result}")
return self.state
# Verarbeite Tool Calls
if llm_response.get("tool_calls"):
for tool_call in llm_response["tool_calls"]:
tool_name = tool_call["function"]["name"]
tool_args = tool_call["function"]["arguments"]
workflow.logger.info(f"Executing tool: {tool_name}")
self.state.tools_used.append(tool_name)
# Spezialbehandlung: Human Input
if tool_name == "request_human_help":
# Warte auf menschliche Eingabe via Signal
question = tool_args.get("question")
workflow.logger.info(f"Waiting for human input: {question}")
# Sende Benachrichtigung (Fire-and-Forget Activity)
await workflow.execute_activity(
request_human_input,
args=[question, {"goal": self.state.goal}],
start_to_close_timeout=timedelta(seconds=30)
)
# Warte auf Signal (kann Stunden/Tage dauern!)
await workflow.wait_condition(
lambda: self.human_input_received is not None,
timeout=timedelta(hours=24)
)
tool_result = self.human_input_received
self.human_input_received = None # Reset für nächstes Mal
else:
# Normale Tool Execution
tool_result = await workflow.execute_activity(
execute_tool,
args=[tool_name, tool_args],
start_to_close_timeout=timedelta(minutes=5),
retry_policy=workflow.RetryPolicy(
initial_interval=timedelta(seconds=2),
maximum_attempts=3
)
)
# Füge Tool-Result zur Conversation History hinzu
self.state.conversation_history.append(
Message(
role="tool",
name=tool_name,
content=str(tool_result),
tool_call_id=tool_call["id"]
)
)
# Füge LLM Response zur History hinzu (wenn kein Tool Call)
elif llm_response.get("content"):
self.state.conversation_history.append(
Message(role="assistant", content=llm_response["content"])
)
# Max iterations erreicht
workflow.logger.warning("Max iterations reached without achieving goal")
self.state.completed = False
self.state.result = "Max iterations reached"
return self.state
@workflow.signal
async def provide_human_input(self, input_text: str):
"""Signal: Menschliche Eingabe bereitstellen."""
workflow.logger.info(f"Received human input: {input_text}")
self.human_input_received = input_text
@workflow.signal
async def add_user_message(self, message: str):
"""Signal: Neue User-Message hinzufügen (für Multi-Turn)."""
self.state.conversation_history.append(
Message(role="user", content=message)
)
@workflow.query
def get_state(self) -> AgentState:
"""Query: Aktueller Agent State."""
return self.state
@workflow.query
def get_conversation_history(self) -> List[Message]:
"""Query: Conversation History."""
return self.state.conversation_history
@workflow.query
def get_tools_used(self) -> List[str]:
"""Query: Welche Tools wurden verwendet?"""
return self.state.tools_used
15.2.4 Client: Agent starten und überwachen
from temporalio.client import Client
async def run_ai_agent():
"""Starte AI Agent und überwache Progress."""
client = await Client.connect("localhost:7233")
# Starte Agent Workflow
handle = await client.start_workflow(
AIAgentWorkflow.run,
args=[
"Analyze the sales data from Q4 2024 and create a summary report",
"Please focus on trends and outliers."
],
id=f"ai-agent-{uuid.uuid4()}",
task_queue="ai-agents"
)
print(f"Started AI Agent: {handle.id}")
# Überwache Progress
while True:
state = await handle.query(AIAgentWorkflow.get_state)
print(f"\nAgent Status:")
print(f" Completed: {state.completed}")
print(f" Tools used: {', '.join(state.tools_used)}")
print(f" Conversation length: {len(state.conversation_history)} messages")
if state.completed:
print(f"\n✅ Goal achieved!")
print(f"Result: {state.result}")
break
await asyncio.sleep(5)
# Hole finale Conversation History
history = await handle.query(AIAgentWorkflow.get_conversation_history)
print("\n=== Conversation History ===")
for msg in history:
print(f"{msg.role}: {msg.content[:100]}...")
result = await handle.result()
return result
15.2.5 Multi-Agent Orchestration
Für komplexere Szenarien können mehrere Agents koordiniert werden:
@workflow.defn
class MultiAgentCoordinatorWorkflow:
"""
Koordiniert mehrere spezialisierte AI Agents.
Beispiel: Research Agent → Analysis Agent → Report Agent
"""
@workflow.run
async def run(self, task: str) -> dict:
workflow.logger.info(f"Multi-Agent Coordinator started for: {task}")
# Agent 1: Research Agent
research_handle = await workflow.start_child_workflow(
AIAgentWorkflow.run,
args=[
f"Research the following topic: {task}",
"Collect relevant data from database and web."
],
id=f"research-agent-{workflow.info().workflow_id}",
task_queue="ai-agents"
)
research_result = await research_handle
# Agent 2: Analysis Agent
analysis_handle = await workflow.start_child_workflow(
AIAgentWorkflow.run,
args=[
"Analyze the following research data and identify key insights",
f"Research data: {research_result.result}"
],
id=f"analysis-agent-{workflow.info().workflow_id}",
task_queue="ai-agents"
)
analysis_result = await analysis_handle
# Agent 3: Report Agent
report_handle = await workflow.start_child_workflow(
AIAgentWorkflow.run,
args=[
"Create a professional report based on the analysis",
f"Analysis: {analysis_result.result}"
],
id=f"report-agent-{workflow.info().workflow_id}",
task_queue="ai-agents"
)
report_result = await report_handle
return {
"task": task,
"research": research_result.result,
"analysis": analysis_result.result,
"report": report_result.result,
"total_tools_used": (
len(research_result.tools_used) +
len(analysis_result.tools_used) +
len(report_result.tools_used)
)
}
15.2.6 Best Practices für AI Agents
1. LLM Calls immer als Activities
# ✅ Richtig: LLM Call als Activity
@activity.defn
async def call_llm(prompt: str) -> str:
return await openai.complete(prompt)
# ❌ Falsch: LLM Call direkt im Workflow
@workflow.defn
class BadWorkflow:
@workflow.run
async def run(self):
# NICHT deterministisch! Workflow wird fehlschlagen beim Replay
result = await openai.complete("Hello")
2. Retry Policies für LLM APIs
# LLMs können Rate-Limits haben oder temporär fehlschlagen
llm_response = await workflow.execute_activity(
call_llm,
prompt,
start_to_close_timeout=timedelta(seconds=60),
retry_policy=workflow.RetryPolicy(
initial_interval=timedelta(seconds=1),
backoff_coefficient=2.0,
maximum_interval=timedelta(seconds=60),
maximum_attempts=5,
# Nicht wiederholen bei Invalid Request
non_retryable_error_types=["InvalidRequestError", "AuthenticationError"]
)
)
3. Conversation History Management
# Begrenze History-Größe für lange Konversationen
def truncate_history(messages: List[Message], max_tokens: int = 4000) -> List[Message]:
"""Behalte nur neueste Messages innerhalb Token-Limit."""
# Behalte immer System Message
system_msgs = [m for m in messages if m.role == "system"]
other_msgs = [m for m in messages if m.role != "system"]
# Schneide älteste Messages ab
# (In Production: Token Counting nutzen)
return system_msgs + other_msgs[-50:] # Letzte 50 Messages
4. Timeouts für Human-in-the-Loop
try:
await workflow.wait_condition(
lambda: self.human_input_received is not None,
timeout=timedelta(hours=24)
)
except asyncio.TimeoutError:
# Automatische Eskalation oder Fallback
workflow.logger.warning("Human input timeout - using fallback")
self.human_input_received = "TIMEOUT: Proceeding without human input"
15.3 Serverless Integration (AWS Lambda & Co.)
15.3.1 Das Serverless-Dilemma
Temporal und Serverless haben unterschiedliche Ausführungsmodelle:
| Aspekt | Temporal Worker | AWS Lambda |
|---|---|---|
| Ausführung | Long-running Prozess | Kurzlebige Invocations (max 15 Min) |
| State | In-Memory | Stateless |
| Infrastruktur | VM, Container (persistent) | On-Demand |
| Kosten | Basierend auf Laufzeit | Pay-per-Invocation |
Kernproblem: Temporal Worker benötigen lange laufende Compute-Infrastruktur, während Lambda/Serverless kurzlebig und stateless ist.
Aber: Temporal kann trotzdem genutzt werden, um Serverless-Funktionen zu orchestrieren!
15.3.2 Integration Pattern 1: SQS + Lambda + Temporal
graph LR
S3[S3 Upload] --> SQS[SQS Queue]
SQS --> Lambda[Lambda Function]
Lambda -->|Start Workflow| Temporal[Temporal Service]
Temporal --> Worker[Temporal Worker<br/>ECS/EC2]
Worker -->|Invoke| Lambda2[Lambda Activities]
style Lambda fill:#ff9900
style Lambda2 fill:#ff9900
style Temporal fill:#ffd700
style Worker fill:#e1f5ff
Architecture:
- S3 Upload triggert SQS Message
- Lambda Function startet Temporal Workflow
- Temporal Worker (auf ECS/EC2) führt Workflow aus
- Workflow ruft Lambda-Funktionen als Activities auf
Implementierung:
# Lambda Function: Workflow Starter
import json
import boto3
from temporalio.client import Client
async def lambda_handler(event, context):
"""
AWS Lambda: Startet Temporal Workflow basierend auf SQS Message.
"""
# Parse SQS Event
for record in event['Records']:
body = json.loads(record['body'])
s3_key = body['Records'][0]['s3']['object']['key']
# Connect zu Temporal
client = await Client.connect("temporal.example.com:7233")
# Starte Workflow
handle = await client.start_workflow(
DataProcessingWorkflow.run,
args=[s3_key],
id=f"process-{s3_key}",
task_queue="data-processing"
)
print(f"Started workflow: {handle.id}")
return {
'statusCode': 200,
'body': json.dumps('Workflow started')
}
# Temporal Worker (auf ECS/EC2): Ruft Lambda als Activity auf
import boto3
import json
from temporalio import activity
lambda_client = boto3.client('lambda')
@activity.defn
async def invoke_lambda_activity(function_name: str, payload: dict) -> dict:
"""
Activity: Ruft AWS Lambda Function auf.
"""
activity.logger.info(f"Invoking Lambda: {function_name}")
try:
response = lambda_client.invoke(
FunctionName=function_name,
InvocationType='RequestResponse', # Synchron
Payload=json.dumps(payload)
)
result = json.loads(response['Payload'].read())
activity.logger.info(f"Lambda response: {result}")
return result
except Exception as e:
activity.logger.error(f"Lambda invocation failed: {e}")
raise
@workflow.defn
class DataProcessingWorkflow:
"""
Workflow: Orchestriert mehrere Lambda Functions.
"""
@workflow.run
async def run(self, s3_key: str) -> dict:
workflow.logger.info(f"Processing S3 file: {s3_key}")
# Step 1: Lambda für Data Extraction
extraction_result = await workflow.execute_activity(
invoke_lambda_activity,
args=[
"data-extraction-function",
{"s3_key": s3_key}
],
start_to_close_timeout=timedelta(minutes=5),
retry_policy=workflow.RetryPolicy(
maximum_attempts=3,
initial_interval=timedelta(seconds=5),
)
)
# Step 2: Lambda für Data Transformation
transform_result = await workflow.execute_activity(
invoke_lambda_activity,
args=[
"data-transform-function",
{"data": extraction_result}
],
start_to_close_timeout=timedelta(minutes=5)
)
# Step 3: Lambda für Data Loading
load_result = await workflow.execute_activity(
invoke_lambda_activity,
args=[
"data-load-function",
{"data": transform_result}
],
start_to_close_timeout=timedelta(minutes=5)
)
return {
"s3_key": s3_key,
"records_processed": load_result.get("count"),
"status": "completed"
}
15.3.3 Integration Pattern 2: Step Functions Alternative
Temporal kann als robustere Alternative zu AWS Step Functions dienen:
| Feature | AWS Step Functions | Temporal |
|---|---|---|
| Sprache | JSON (ASL) | Python, Go, Java, TypeScript, etc. |
| Debugging | Schwierig | Native IDE Support |
| Testing | Komplex | Unit Tests möglich |
| Versionierung | Limitiert | Native Code-Versionierung |
| Local Dev | Schwierig (Localstack) | Temporal Dev Server |
| Vendor Lock-In | AWS only | Cloud-agnostisch |
| Kosten | Pro State Transition | Selbst gehostet oder Cloud |
Migration von Step Functions zu Temporal:
# Vorher: Step Functions (JSON ASL)
"""
{
"StartAt": "ProcessData",
"States": {
"ProcessData": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:process",
"Next": "TransformData"
},
"TransformData": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:transform",
"Next": "LoadData"
},
"LoadData": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:load",
"End": true
}
}
}
"""
# Nachher: Temporal Workflow (Python)
@workflow.defn
class ETLWorkflow:
@workflow.run
async def run(self, input_data: dict) -> dict:
# Step 1: Process
processed = await workflow.execute_activity(
process_data,
input_data,
start_to_close_timeout=timedelta(minutes=5)
)
# Step 2: Transform
transformed = await workflow.execute_activity(
transform_data,
processed,
start_to_close_timeout=timedelta(minutes=5)
)
# Step 3: Load
result = await workflow.execute_activity(
load_data,
transformed,
start_to_close_timeout=timedelta(minutes=5)
)
return result
15.3.4 Deployment-Strategien für Worker
Option 1: AWS ECS (Fargate oder EC2)
# ecs-task-definition.json
{
"family": "temporal-worker",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "temporal-worker",
"image": "myorg/temporal-worker:latest",
"environment": [
{
"name": "TEMPORAL_ADDRESS",
"value": "temporal.example.com:7233"
},
{
"name": "TASK_QUEUE",
"value": "data-processing"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/temporal-worker",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
Option 2: Kubernetes (EKS)
# temporal-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker
spec:
replicas: 3
selector:
matchLabels:
app: temporal-worker
template:
metadata:
labels:
app: temporal-worker
spec:
containers:
- name: worker
image: myorg/temporal-worker:latest
env:
- name: TEMPORAL_ADDRESS
value: "temporal.example.com:7233"
- name: TASK_QUEUE
value: "data-processing"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
15.3.5 Cost Optimization
Hybrid Approach: Worker auf Reserved Instances + Lambda für Burst
@workflow.defn
class HybridWorkflow:
"""
Nutzt reguläre Activities für Standard-Tasks,
Lambda für CPU-intensive Burst-Workloads.
"""
@workflow.run
async def run(self, data: dict) -> dict:
# Standard Processing auf ECS Worker
normalized = await workflow.execute_activity(
normalize_data,
data,
start_to_close_timeout=timedelta(minutes=2)
)
# CPU-intensive Task auf Lambda (burst capacity)
if data.get("requires_heavy_processing"):
processed = await workflow.execute_activity(
invoke_lambda_activity,
args=["heavy-processing-function", normalized],
start_to_close_timeout=timedelta(minutes=10)
)
else:
processed = normalized
# Finale Speicherung auf ECS Worker
result = await workflow.execute_activity(
save_to_database,
processed,
start_to_close_timeout=timedelta(minutes=1)
)
return result
15.4 Polyglot Workflows
15.4.1 Warum Polyglot?
In der Realität nutzen Teams unterschiedliche Sprachen für unterschiedliche Aufgaben:
- Python: Data Science, ML, Scripting
- Go: High-Performance Services, Infrastructure
- TypeScript/Node.js: Frontend-Integration, APIs
- Java: Enterprise Applications, Legacy Systems
Temporal ermöglicht es, diese Sprachen in einem Workflow zu kombinieren!
15.4.2 Architektur-Prinzipien
graph TB
Client[Client: TypeScript]
subgraph "Temporal Service"
TS[Temporal Server]
end
subgraph "Workflow: Python"
WF[Workflow Definition<br/>Python]
end
subgraph "Activities"
ACT1[Activity: Python<br/>Data Processing]
ACT2[Activity: Go<br/>Image Processing]
ACT3[Activity: Java<br/>Legacy Integration]
ACT4[Activity: TypeScript<br/>API Calls]
end
subgraph "Workers"
W1[Worker: Python<br/>Task Queue: python-tasks]
W2[Worker: Go<br/>Task Queue: go-tasks]
W3[Worker: Java<br/>Task Queue: java-tasks]
W4[Worker: TypeScript<br/>Task Queue: ts-tasks]
end
Client -->|Start Workflow| TS
TS <--> WF
WF -->|Execute Activity| ACT1
WF -->|Execute Activity| ACT2
WF -->|Execute Activity| ACT3
WF -->|Execute Activity| ACT4
ACT1 -.-> W1
ACT2 -.-> W2
ACT3 -.-> W3
ACT4 -.-> W4
style WF fill:#e1f5ff
style W1 fill:#ffe1e1
style W2 fill:#e1ffe1
style W3 fill:#ffe1ff
style W4 fill:#ffffe1
Wichtige Regel:
- ✅ Ein Workflow wird in einer Sprache geschrieben
- ✅ Activities können in verschiedenen Sprachen sein
- ❌ Ein Workflow kann nicht mehrere Sprachen mischen
15.4.3 Beispiel: Media Processing Pipeline
Workflow: Python (Orchestration)
# workflow.py (Python Worker)
from temporalio import workflow
from datetime import timedelta
@workflow.defn
class MediaProcessingWorkflow:
"""
Polyglot Workflow: Orchestriert Activities in Python, Go, TypeScript.
"""
@workflow.run
async def run(self, video_url: str) -> dict:
workflow.logger.info(f"Processing video: {video_url}")
# Activity 1: Download Video (Python)
# Task Queue: python-tasks
downloaded_path = await workflow.execute_activity(
"download_video", # Activity Name (String-based)
video_url,
task_queue="python-tasks",
start_to_close_timeout=timedelta(minutes=10)
)
# Activity 2: Extract Frames (Go - High Performance)
# Task Queue: go-tasks
frames = await workflow.execute_activity(
"extract_frames",
downloaded_path,
task_queue="go-tasks",
start_to_close_timeout=timedelta(minutes=5)
)
# Activity 3: AI Analysis (Python - ML Libraries)
# Task Queue: python-tasks
analysis_result = await workflow.execute_activity(
"analyze_frames",
frames,
task_queue="python-tasks",
start_to_close_timeout=timedelta(minutes=15)
)
# Activity 4: Generate Thumbnail (Go - Image Processing)
# Task Queue: go-tasks
thumbnail_url = await workflow.execute_activity(
"generate_thumbnail",
frames[0],
task_queue="go-tasks",
start_to_close_timeout=timedelta(minutes=2)
)
# Activity 5: Store Metadata (TypeScript - API Integration)
# Task Queue: ts-tasks
metadata_id = await workflow.execute_activity(
"store_metadata",
args=[{
"video_url": video_url,
"analysis": analysis_result,
"thumbnail": thumbnail_url
}],
task_queue="ts-tasks",
start_to_close_timeout=timedelta(minutes=1)
)
return {
"video_url": video_url,
"thumbnail_url": thumbnail_url,
"analysis": analysis_result,
"metadata_id": metadata_id
}
Activity 1: Python (Download & ML)
# activities_python.py (Python Worker)
from temporalio import activity
import httpx
import tensorflow as tf
@activity.defn
async def download_video(url: str) -> str:
"""Download video from URL."""
activity.logger.info(f"Downloading video: {url}")
async with httpx.AsyncClient() as client:
response = await client.get(url)
filepath = f"/tmp/video_{activity.info().workflow_id}.mp4"
with open(filepath, "wb") as f:
f.write(response.content)
return filepath
@activity.defn
async def analyze_frames(frames: list[str]) -> dict:
"""Analyze frames using ML model (Python/TensorFlow)."""
activity.logger.info(f"Analyzing {len(frames)} frames")
# Load ML Model
model = tf.keras.models.load_model("/models/video_classifier.h5")
results = []
for frame_path in frames:
image = tf.keras.preprocessing.image.load_img(frame_path)
image_array = tf.keras.preprocessing.image.img_to_array(image)
prediction = model.predict(image_array)
results.append(prediction.tolist())
return {
"frames_analyzed": len(frames),
"predictions": results
}
# Worker
async def main():
from temporalio.client import Client
from temporalio.worker import Worker
client = await Client.connect("localhost:7233")
worker = Worker(
client,
task_queue="python-tasks",
workflows=[], # Nur Activities auf diesem Worker
activities=[download_video, analyze_frames]
)
await worker.run()
Activity 2: Go (High-Performance Image Processing)
// activities_go.go (Go Worker)
package main
import (
"context"
"fmt"
"os/exec"
"go.temporal.io/sdk/activity"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
)
// ExtractFrames extracts frames from video using FFmpeg
func ExtractFrames(ctx context.Context, videoPath string) ([]string, error) {
logger := activity.GetLogger(ctx)
logger.Info("Extracting frames", "video", videoPath)
// FFmpeg command: Extract 1 frame per second
outputPattern := "/tmp/frame_%04d.jpg"
cmd := exec.Command(
"ffmpeg",
"-i", videoPath,
"-vf", "fps=1",
outputPattern,
)
if err := cmd.Run(); err != nil {
return nil, fmt.Errorf("ffmpeg failed: %w", err)
}
// Return list of generated frame paths
frames := []string{
"/tmp/frame_0001.jpg",
"/tmp/frame_0002.jpg",
// ... would actually scan directory
}
logger.Info("Extracted frames", "count", len(frames))
return frames, nil
}
// GenerateThumbnail creates a thumbnail from image
func GenerateThumbnail(ctx context.Context, imagePath string) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("Generating thumbnail", "image", imagePath)
thumbnailPath := "/tmp/thumbnail.jpg"
// ImageMagick: Resize to 300x300
cmd := exec.Command(
"convert",
imagePath,
"-resize", "300x300",
thumbnailPath,
)
if err := cmd.Run(); err != nil {
return "", fmt.Errorf("thumbnail generation failed: %w", err)
}
// Upload to S3 (simplified)
s3Url := uploadToS3(thumbnailPath)
return s3Url, nil
}
func main() {
c, err := client.Dial(client.Options{
HostPort: "localhost:7233",
})
if err != nil {
panic(err)
}
defer c.Close()
w := worker.New(c, "go-tasks", worker.Options{})
// Register Activities
w.RegisterActivity(ExtractFrames)
w.RegisterActivity(GenerateThumbnail)
if err := w.Run(worker.InterruptCh()); err != nil {
panic(err)
}
}
Activity 3: TypeScript (API Integration)
// activities_typescript.ts (TypeScript Worker)
import { Context } from '@temporalio/activity';
import { log } from '@temporalio/activity';
interface MetadataInput {
video_url: string;
analysis: any;
thumbnail: string;
}
/**
* Store metadata in external API
*/
export async function storeMetadata(
metadata: MetadataInput
): Promise<string> {
log.info('Storing metadata', { videoUrl: metadata.video_url });
// Call external API
const response = await fetch('https://api.example.com/videos', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url: metadata.video_url,
thumbnailUrl: metadata.thumbnail,
analysis: metadata.analysis,
processedAt: new Date().toISOString(),
}),
});
if (!response.ok) {
throw new Error(`API call failed: ${response.statusText}`);
}
const result = await response.json();
log.info('Metadata stored', { id: result.id });
return result.id;
}
// Worker
import { Worker } from '@temporalio/worker';
async function run() {
const worker = await Worker.create({
workflowsPath: require.resolve('./workflows'),
activities: {
storeMetadata,
},
taskQueue: 'ts-tasks',
});
await worker.run();
}
run().catch((err) => {
console.error(err);
process.exit(1);
});
15.4.4 Data Serialization zwischen Sprachen
Temporal konvertiert automatisch zwischen Sprachen:
# Python → Go
await workflow.execute_activity(
"extract_frames",
"/tmp/video.mp4", # Python string → Go string
task_queue="go-tasks"
)
# Python → TypeScript
await workflow.execute_activity(
"store_metadata",
{ # Python dict → TypeScript object
"video_url": "https://...",
"analysis": {"score": 0.95}
},
task_queue="ts-tasks"
)
Unterstützte Typen (Automatic Conversion):
- Primitives:
int,float,string,bool - Collections:
list,dict,array,object - Custom Types: Dataclasses, Structs, Interfaces (als JSON)
Komplexe Typen:
# Python
from dataclasses import dataclass
@dataclass
class VideoMetadata:
url: str
duration_seconds: int
resolution: dict
tags: list[str]
# Temporal serialisiert automatisch zu JSON
metadata = VideoMetadata(
url="https://...",
duration_seconds=120,
resolution={"width": 1920, "height": 1080},
tags=["tutorial", "python"]
)
# Go empfängt als Struct
"""
type VideoMetadata struct {
URL string `json:"url"`
DurationSeconds int `json:"duration_seconds"`
Resolution struct {
Width int `json:"width"`
Height int `json:"height"`
} `json:"resolution"`
Tags []string `json:"tags"`
}
"""
15.4.5 Workflow Starter in verschiedenen Sprachen
# Python Client
from temporalio.client import Client
client = await Client.connect("localhost:7233")
handle = await client.start_workflow(
"MediaProcessingWorkflow", # Workflow Name (String)
"https://example.com/video.mp4",
id="video-123",
task_queue="python-tasks" # Workflow läuft auf Python Worker
)
// TypeScript Client
import { Client } from '@temporalio/client';
const client = new Client();
const handle = await client.workflow.start('MediaProcessingWorkflow', {
args: ['https://example.com/video.mp4'],
workflowId: 'video-123',
taskQueue: 'python-tasks',
});
// Go Client
import (
"go.temporal.io/sdk/client"
)
c, _ := client.Dial(client.Options{})
defer c.Close()
options := client.StartWorkflowOptions{
ID: "video-123",
TaskQueue: "python-tasks",
}
we, _ := c.ExecuteWorkflow(
context.Background(),
options,
"MediaProcessingWorkflow",
"https://example.com/video.mp4",
)
15.4.6 Best Practices für Polyglot
1. Task Queue Naming Convention
# Sprache im Task Queue Namen
task_queue = f"{language}-{service}-tasks"
# Beispiele:
"python-ml-tasks"
"go-image-processing-tasks"
"typescript-api-tasks"
"java-legacy-integration-tasks"
2. Activity Namen als Strings
# ✅ Verwende String-Namen für Cross-Language
await workflow.execute_activity(
"extract_frames", # String name
video_path,
task_queue="go-tasks"
)
# ❌ Funktionsreferenzen funktionieren nur innerhalb einer Sprache
await workflow.execute_activity(
extract_frames, # Function reference
video_path
)
3. Schema Validation
# Nutze Pydantic für Schema-Validierung
from pydantic import BaseModel
class VideoProcessingInput(BaseModel):
video_url: str
resolution: dict
tags: list[str]
@workflow.defn
class MediaWorkflow:
@workflow.run
async def run(self, input_dict: dict) -> dict:
# Validiere Input
input_data = VideoProcessingInput(**input_dict)
# Arbeite mit validiertem Input
result = await workflow.execute_activity(
"process_video",
input_data.dict(), # Serialize zu dict
task_queue="go-tasks"
)
return result
4. Deployment Coordination
# docker-compose.yaml für Multi-Language Development
version: '3.8'
services:
temporal:
image: temporalio/auto-setup:latest
ports:
- "7233:7233"
python-worker:
build: ./python-worker
environment:
- TEMPORAL_ADDRESS=temporal:7233
- TASK_QUEUE=python-tasks
depends_on:
- temporal
go-worker:
build: ./go-worker
environment:
- TEMPORAL_ADDRESS=temporal:7233
- TASK_QUEUE=go-tasks
depends_on:
- temporal
typescript-worker:
build: ./typescript-worker
environment:
- TEMPORAL_ADDRESS=temporal:7233
- TASK_QUEUE=ts-tasks
depends_on:
- temporal
15.5 Zusammenfassung
In diesem Kapitel haben wir drei fortgeschrittene Temporal-Patterns kennengelernt:
AI Agents mit Temporal
Kernkonzepte:
- LLM Calls als Activities (non-deterministisch)
- Langlebige Konversationen mit State Management
- Tool-Orchestrierung und Human-in-the-Loop
- Multi-Agent Coordination mit Child Workflows
Vorteile:
- ✅ State persistiert automatisch über Stunden/Tage
- ✅ Retry Policies für fehleranfällige LLM APIs
- ✅ Vollständige Observability der Agent-Aktionen
- ✅ Einfache Integration von Tools und menschlicher Intervention
Real-World Adoption:
- OpenAI Agents SDK Integration (2024)
- Genutzt von Lindy, Dust, ZoomInfo
Serverless Integration
Kernkonzepte:
- Temporal Worker auf ECS/EKS (long-running)
- Lambda Functions als Activities invoken
- SQS + Lambda als Workflow-Trigger
- Alternative zu AWS Step Functions
Deployment-Optionen:
- ECS Fargate: Serverless Workers
- EKS: Kubernetes-basierte Workers
- Hybrid: Worker auf Reserved Instances + Lambda für Burst
Vorteile:
- ✅ Cloud-agnostisch (vs. Step Functions)
- ✅ Echte Programmiersprachen (vs. JSON ASL)
- ✅ Besseres Debugging und Testing
- ✅ Cost Optimization durch Hybrid-Ansatz
Polyglot Workflows
Kernkonzepte:
- Ein Workflow = Eine Sprache
- Activities in verschiedenen Sprachen
- Task Queues pro Sprache/Service
- Automatische Serialisierung zwischen SDKs
Unterstützte Sprachen:
- Python, Go, Java, TypeScript, .NET, PHP, Ruby
Vorteile:
- ✅ Nutze beste Sprache für jede Aufgabe
- ✅ Integration von Legacy-Systemen
- ✅ Team-Autonomie (jedes Team nutzt eigene Sprache)
- ✅ Einfache Daten-Konvertierung
graph TB
Start[Erweiterte Temporal Patterns]
AI[AI Agents]
Lambda[Serverless/Lambda]
Polyglot[Polyglot Workflows]
Start --> AI
Start --> Lambda
Start --> Polyglot
AI --> AI1[LLM Orchestration]
AI --> AI2[Tool Integration]
AI --> AI3[Multi-Agent Systems]
Lambda --> L1[Worker auf ECS/EKS]
Lambda --> L2[Lambda als Activities]
Lambda --> L3[Step Functions Alternative]
Polyglot --> P1[Cross-Language Activities]
Polyglot --> P2[Task Queue per Language]
Polyglot --> P3[Automatic Serialization]
AI1 --> Production[Production-Ready Advanced Workflows]
AI2 --> Production
AI3 --> Production
L1 --> Production
L2 --> Production
L3 --> Production
P1 --> Production
P2 --> Production
P3 --> Production
style AI fill:#e1f5ff
style Lambda fill:#ff9900,color:#fff
style Polyglot fill:#90EE90
style Production fill:#ffd700
Gemeinsame Themen
Alle drei Patterns profitieren von Temporals Kernstärken:
- State Durability: Workflows können unterbrochen und wiederaufgenommen werden
- Retry Policies: Automatische Wiederholung bei Fehlern
- Observability: Vollständige Event History für Debugging
- Scalability: Horizontal skalierbare Worker
- Flexibility: Anpassbar an verschiedene Architekturen
Im nächsten Kapitel würden wir Testing-Strategien für diese komplexen Workflows behandeln (falls weitere Kapitel geplant sind).
⬆ Zurück zum Inhaltsverzeichnis
Vorheriges Kapitel: Kapitel 14: Muster-Rezepte (Human-in-Loop, Cron, Order Fulfillment)
Weiterführende Ressourcen:
- 📚 Temporal for AI Documentation
- 🐙 GitHub: temporal-ai-agent Demo
- 🐙 GitHub: temporal-polyglot Examples
- 📰 Temporal Blog: AI Agents
- 💬 Community: Lambda Integration
Praktische Übung: Implementieren Sie einen AI Agent mit Tool-Calls oder eine Polyglot-Pipeline mit mindestens zwei verschiedenen Sprachen!
Ressourcen und Referenzen
Hier finden Sie eine kuratierte Liste von Ressourcen, die Ihnen beim Lernen und Arbeiten mit Temporal helfen.
Offizielle Temporal-Ressourcen
Dokumentation
- Temporal Documentation: https://docs.temporal.io/
- Temporal Python SDK Documentation: https://docs.temporal.io/develop/python
- Temporal TypeScript SDK Documentation: https://docs.temporal.io/develop/typescript
Community und Support
- Temporal Community Forum: https://community.temporal.io/
- Temporal GitHub: https://github.com/temporalio
- Temporal Slack: https://temporal.io/slack
Lernmaterialien
- Temporal Blog: https://temporal.io/blog
- Temporal YouTube Channel: https://www.youtube.com/c/Temporalio
- Temporal Academy: https://learn.temporal.io/
Python-spezifische Ressourcen
- temporalio/sdk-python: https://github.com/temporalio/sdk-python
- Python Samples Repository: https://github.com/temporalio/samples-python
- uv Package Manager: https://github.com/astral-sh/uv
Dieses Buch
- GitHub Repository: https://github.com/TheCodeEngine/temporal-durable-execution-mastery
- Issues und Feedback: https://github.com/TheCodeEngine/temporal-durable-execution-mastery/issues
Hinweis: Links werden regelmäßig aktualisiert. Bei Problemen erstellen Sie bitte ein Issue auf GitHub.