Functional Data Engineering

Metadata:

#article #data-science
Source: Functional Data Engineering
Modern paradigm for batched data processing
ETL (extract, transform and load) is a time-consuming, brittle and often unrewarding process
Functional programming paradigm can bring clarity to the process
- A style of building the structure and elements of computer programs that treats computation as the evaluation of mathematical functions and avoid changing-state and mutable data
- A declarative programming paradigm, meaning programming is done with expressions or declarations, instead of statements
- Functional code outputs code that only depend on the arguments that are passed to the function
  - Calling the same function twice with the same argument will produce the same result each time
- In contrast to procedures that depend on a local or global state
Reproducibility and replicability are the key characteristics we are after
- In order to do this, we want immutable data along with versioned logic
Write only pure tasks
- These are deterministic and idempotent
- No side-effects to things outside the scopes of the task
- When tasks inevitably fail, they can be re-run without any concerns of double-counting or overwriting unintentionally
- Instead of returning something like a pure-function - it overwrites a partition of the data which is akin to the immutable object that a typical pure function would return
- Should only target a single output
Table partitions as immutable objects
- Don't use DML operations like UPDATE, APPEND, DELETE
- A pure task should fully overwrite a partition as its output
Use dimension snapshots to handle slowly changing dimensions
Use separate event time and processing time to handle late arriving facts