FlowForge — Workflow Automation Platform
Distributed workflow automation engine with DAG-based pipeline execution, queue-driven workers, retry mechanisms, and event-driven architecture — inspired by n8n and Zapier.
Domain Knowledge
What problem this project solves
Workflow automation sits at the intersection of distributed systems and application integration. The core problem is reliable execution: if a workflow has 10 steps and step 5 fails, the entire workflow shouldn't crash — it should retry, log, and eventually move to a dead-letter queue for investigation. FlowForge treats every workflow as a Directed Acyclic Graph (DAG) and executes nodes based on topological ordering.
Architecture
How the system is structured
FlowForge uses an event-driven architecture. When a workflow triggers, the engine parses the DAG, resolves dependencies via topological sort, and enqueues each executable node into a Redis-backed job queue. Workers pick up jobs, execute the corresponding action (HTTP request, email, database query, etc.), and emit completion events that trigger downstream nodes. Failed jobs retry with exponential backoff (1min → 5min → 30min) before moving to a dead-letter queue for manual investigation.
Data Model
Schema design and data flow
Workflows are stored as DAG definitions (nodes + edges) in PostgreSQL. Node types include Trigger, Action, Condition, and Wait. Each execution creates an immutable run record with per-node status (pending, running, succeeded, failed, skipped). Queue state lives in Redis for performance, with periodic snapshots to Postgres for durability.
Key Challenges
Hardest problems encountered
Ensuring exactly-once execution semantics was the hardest part — workers can crash after completing an action but before recording success. Solved with idempotency keys on each job: if a worker picks up a job with the same idempotency key, it checks if the action was already performed before re-executing. Another challenge was DAG cycle detection — workflows must be validated as acyclic before execution. Implemented Kahn's algorithm for topological sort with cycle detection.
Scaling Strategy
How the system grows
Workers scale horizontally — add more containers to process more workflows. The queue (BullMQ/Redis) is the bottleneck, so Redis runs in cluster mode for high-throughput deployments. Workflow definitions are cached in Redis after first load. Long-running workflows checkpoint progress to Postgres so they can resume after worker crashes.
Security
Defense-in-depth approach
Each workflow runs in a sandboxed execution context. Action credentials are encrypted at rest using AES-256. Webhook triggers validate signatures. Rate limiting per workspace prevents runaway workflows from consuming all queue capacity.
Failure Handling
Resilience and recovery
Three-tier retry: immediate retry for transient failures, exponential backoff (1m/5m/30m) for intermittent failures, dead-letter queue for permanent failures. DLQ alerts trigger manual review. Failed webhook deliveries retry with idempotency to prevent duplicate processing.
Observability
Monitoring and debugging
Every workflow execution emits structured logs with a run ID. Per-node timing metrics help identify bottlenecks. Dashboard shows active runs, queue depth, failure rate, and average execution time. Failed nodes surface the exact error and input context for debugging.
Trade-offs
Engineering decisions and alternatives
Redis-backed queues were chosen over Kafka for simplicity (FlowForge doesn't need Kafka's log retention or replay capabilities). PostgreSQL stores workflow definitions because they're relational and change infrequently. BullMQ was chosen over RabbitMQ for its Redis-native integration and job scheduling features.
Architecture Decisions
Key choices and what was rejected
Senior-Level Topics
Concepts this project explores