etl-principles

Best practices usually start to make sense when the team gets large and there are multiple data sources, calculation processes and users. It prevents the urge to make ad-hoc changes in order to ‘solve it quickly to get it going’ that eventually will tangle everything.

Load Data Incrementally

  • Extract data incrementally at regular intervals
  • Airflow makes this process easy by scheduling jobs to follow a time cadence

Process Historic Data

  • Ad-hoc workarounds are usually needed to retrieve older data for a new workflow

Partition Ingested Data

  • Partitioned data at ingestion allows for parallel DAG runs that won’t get into a write lock

Enforce the Idempotency Constraint

  • Runs with the same parameters should have the same outcome on different days
  • Sometimes the processes will change and the outcome can change

Enforce Deterministic Properties

  • For a set of given input the output is always the same, cases where the function can be non-deterministic **Using external state within the function ** Operating in time-sensitive ways **Relying on order of input variable ** Implementation issues inside the function (relying on dictionary order) **Improper exception handling and post-exception behavior ** Intermediate commits and unexpected conditions

Execute Conditionally

  • Option to control tasks to run after the success of other tasks

Code Workflow as well as the Applications

  • Able to control both the workflow as well as the underlying application
  • Dynamically control DAGs within another DAG

Reset Data Between Tasks

  • What might seem like inefficient is actually intended to not allow the reading of temporary files
  • Task instances of the same DAG gets executed on different workers and they don’t read from the temporary data

Understand SLA’s and Alerts

  • SLA can be used to detect long running tasks
  • Airflow sends email to notify missed SLA’s