Vishal.dev
Back
Backend

FlowForge — Workflow Automation Platform

Distributed workflow automation engine with DAG-based pipeline execution, queue-driven workers, retry mechanisms, and event-driven architecture — inspired by n8n and Zapier.

TypeScriptNode.jsRedisBullMQPostgreSQLDockerREST API
<100ms
Node execution overhead
Worker scale
99.5%
Delivery reliability
3
Retry attempts before DLQ

Domain Knowledge

What problem this project solves

Workflow automation sits at the intersection of distributed systems and application integration. The core problem is reliable execution: if a workflow has 10 steps and step 5 fails, the entire workflow shouldn't crash — it should retry, log, and eventually move to a dead-letter queue for investigation. FlowForge treats every workflow as a Directed Acyclic Graph (DAG) and executes nodes based on topological ordering.

Architecture

How the system is structured

FlowForge uses an event-driven architecture. When a workflow triggers, the engine parses the DAG, resolves dependencies via topological sort, and enqueues each executable node into a Redis-backed job queue. Workers pick up jobs, execute the corresponding action (HTTP request, email, database query, etc.), and emit completion events that trigger downstream nodes. Failed jobs retry with exponential backoff (1min → 5min → 30min) before moving to a dead-letter queue for manual investigation.

Data Model

Schema design and data flow

Workflows are stored as DAG definitions (nodes + edges) in PostgreSQL. Node types include Trigger, Action, Condition, and Wait. Each execution creates an immutable run record with per-node status (pending, running, succeeded, failed, skipped). Queue state lives in Redis for performance, with periodic snapshots to Postgres for durability.

Key Challenges

Hardest problems encountered

Ensuring exactly-once execution semantics was the hardest part — workers can crash after completing an action but before recording success. Solved with idempotency keys on each job: if a worker picks up a job with the same idempotency key, it checks if the action was already performed before re-executing. Another challenge was DAG cycle detection — workflows must be validated as acyclic before execution. Implemented Kahn's algorithm for topological sort with cycle detection.

Scaling Strategy

How the system grows

Workers scale horizontally — add more containers to process more workflows. The queue (BullMQ/Redis) is the bottleneck, so Redis runs in cluster mode for high-throughput deployments. Workflow definitions are cached in Redis after first load. Long-running workflows checkpoint progress to Postgres so they can resume after worker crashes.

Security

Defense-in-depth approach

Each workflow runs in a sandboxed execution context. Action credentials are encrypted at rest using AES-256. Webhook triggers validate signatures. Rate limiting per workspace prevents runaway workflows from consuming all queue capacity.

Failure Handling

Resilience and recovery

Three-tier retry: immediate retry for transient failures, exponential backoff (1m/5m/30m) for intermittent failures, dead-letter queue for permanent failures. DLQ alerts trigger manual review. Failed webhook deliveries retry with idempotency to prevent duplicate processing.

Observability

Monitoring and debugging

Every workflow execution emits structured logs with a run ID. Per-node timing metrics help identify bottlenecks. Dashboard shows active runs, queue depth, failure rate, and average execution time. Failed nodes surface the exact error and input context for debugging.

Trade-offs

Engineering decisions and alternatives

Redis-backed queues were chosen over Kafka for simplicity (FlowForge doesn't need Kafka's log retention or replay capabilities). PostgreSQL stores workflow definitions because they're relational and change infrequently. BullMQ was chosen over RabbitMQ for its Redis-native integration and job scheduling features.

Architecture Decisions

Key choices and what was rejected

Decision
Chosen
Rejected
Execution model
DAG with topological sort
Linear step execution
Queue system
BullMQ (Redis)
Kafka (overkill for this use case)
Retry strategy
Exponential backoff + DLQ
Infinite retry (blocks queue)
Idempotency
Idempotency keys per job
At-most-once (data loss risk)

Senior-Level Topics

Concepts this project explores

DAG ExecutionTopological SortingMessage BrokersDistributed WorkersExactly-Once ProcessingDead Letter QueuesEvent-Driven ArchitectureIdempotency