etl-principles

Best practices usually start to make sense when the team gets large and there are multiple data sources, calculation processes and users. It prevents the urge to make ad-hoc changes in order to ‘solve it quickly to get it going’ that eventually will tangle everything.

Load Data Incrementally

Extract data incrementally at regular intervals
Airflow makes this process easy by scheduling jobs to follow a time cadence

Process Historic Data

Ad-hoc workarounds are usually needed to retrieve older data for a new workflow

Partition Ingested Data

Partitioned data at ingestion allows for parallel DAG runs that won’t get into a write lock

Enforce the Idempotency Constraint

Runs with the same parameters should have the same outcome on different days
Sometimes the processes will change and the outcome can change

Enforce Deterministic Properties

For a set of given input the output is always the same, cases where the function can be non-deterministic **Using external state within the function ** Operating in time-sensitive ways **Relying on order of input variable ** Implementation issues inside the function (relying on dictionary order) **Improper exception handling and post-exception behavior ** Intermediate commits and unexpected conditions

Execute Conditionally

Option to control tasks to run after the success of other tasks

Code Workflow as well as the Applications

Able to control both the workflow as well as the underlying application
Dynamically control DAGs within another DAG

Reset Data Between Tasks

What might seem like inefficient is actually intended to not allow the reading of temporary files
Task instances of the same DAG gets executed on different workers and they don’t read from the temporary data

Understand SLA’s and Alerts

SLA can be used to detect long running tasks
Airflow sends email to notify missed SLA’s