toolApache Software Foundationapplied

Apache Airflow

The batch orchestrator that moved 44 background jobs off the paiddaily.io API event loop. Self-hosted, DAG-based, Python-native. Runs every data pipeline, price refresh, and scoring job on a cron — so the API serves pure reads from Postgres with zero blocking I/O.

Updated May 27, 2026

Apache Airflow is the scheduler that keeps the API fast. Every background job that used to block the paiddaily.io event loop now runs as an Airflow DAG on a cron. This page covers how it fits the GL stack, not the full Airflow ecosystem — official docs at airflow.apache.org own the reference.

What it is

A platform for programmatically authoring, scheduling, and monitoring workflows. Workflows are defined as Python DAGs (directed acyclic graphs) — each node is a task, each edge is a dependency. Apache 2.0 license. Self-hosted on Docker Compose or Kubernetes; managed offerings from Astronomer and GCP Cloud Composer exist but aren't used here.

The pitch: Airflow is the right tool when the workload is batch, the schedule is cron-shaped, and the orchestration logic is "run these steps in order on a timer." It's not a streaming engine, not a real-time event bus, not a task queue. It's a scheduler that knows about retries, dependencies, backfills, and failure callbacks.

At a glance

Core concepts

DAG — a Python file defining a workflow as a directed acyclic graph. Tasks run in dependency order.
Task — a single unit of work. Usually a Python function decorated with @task.
Operator — a pre-built task template (PythonOperator, BashOperator, etc.). The @task decorator is syntactic sugar for PythonOperator.
Schedule — cron expression or preset (@daily, @hourly). Defines when the DAG runs.
XCom — cross-communication between tasks. Return values from @task functions are automatically pushed/pulled as XComs.
Connection — a stored credential reference (database DSN, API key). Managed through the Airflow UI or env vars.
Callback — function called on task/DAG success or failure. Used for alerting.

The GL deployment shape

Self-hosted on Docker Compose alongside the paiddaily.io stack. Single-node setup: webserver (UI on port 8080), scheduler, triggerer, and Postgres metadata database. DAGs live in airflow/dags/, shared code in airflow/include/gl_paiddaily_common/.

44 DAGs total as of 2026-05-27. Two patterns:

Direct domain calls (16 DAGs) — import and call the same async domain functions the API uses, via a run_domain() sync wrapper that manages the asyncpg pool lifecycle.
Inline HTTP/RPC (11 DAGs) — fetch data from external APIs (Boros, Pendle, DeFiLlama), transform, and write to Postgres via stored procs.
Hybrid (remaining) — mix of both patterns depending on the pipeline shape.

How to integrate

The paiddaily.io pattern for adding a new scheduled job:

Write the domain function in features/. Pure async Python, takes a db pool, returns a result. Test it in isolation.
Create a DAG in airflow/dags/. Import _path for sys.path setup. Use @dag and @task decorators. Call run_domain("features.module.path", "function_name") from the task.
Set the schedule. Cron expression in schedule=. Think about ordering — if this DAG depends on data from another DAG, set start_date and schedule accordingly.
Wire failure alerting. Pass on_failure_callback=telegram_on_failure in default_args or DAG-level kwargs.
Remove the old in-process job. Delete the APScheduler / background-loop code from the API. The API should only serve reads.

The `domain_runner` pattern

from gl_paiddaily_common.domain_runner import run_domain

result = run_domain(
    "features.aerodrome.jobs.veaero_lock_sync",
    "sync_lock",
    wallet_lc,
)

run_domain initializes an asyncpg pool, calls the async function, tears down. This lets DAGs call the exact same code the API was calling — no HTTP round-trip, no separate client, no API auth.

Gotchas

DAGs are parsed on every scheduler heartbeat. Heavy imports at module level slow down the scheduler. Keep DAG files thin — import the real logic inside the @task function body, not at the top of the file.
run_domain() creates a new connection pool per task invocation. Fine for cron-cadence jobs (minutes/hours). Would be wasteful for sub-second scheduling — but that's not what Airflow is for.
XCom serializes to JSON by default. Don't pass large DataFrames or binary blobs between tasks via return values. Write intermediate results to the database or object store.
Backfill runs every missed interval. Set catchup=False on DAGs that don't need historical backfills, or you'll get a flood of runs on first deploy.
The webserver is not hardened for public exposure. Keep it behind a VPN or localhost. The GL setup binds to 127.0.0.1:8080.

Risks

Operational surface. Self-hosted Airflow is a real service to maintain — scheduler, webserver, metadata DB, log rotation. More moving parts than a cron job in a shell script. Worth it at 44 DAGs; questionable at 3.
Python-only DAGs. The workflow definition language is Python. If the team writes Go or TypeScript, Airflow is a context switch. Not an issue for the GL stack (Python backend), but worth noting.
Scheduler is single-point. Single-node setup means scheduler downtime = no DAG runs. Acceptable for a trading dashboard; not acceptable for a billing pipeline. HA requires CeleryExecutor or KubernetesExecutor — more infra.

Neo4j — Airflow DAGs write pipeline results to Postgres; agents read canonical state from Neo4j. Two systems, two jobs.
DSPy — the Pendle market research DAG runs a DSPy module inside run_domain(). Airflow schedules the call; DSPy does the thinking.