data-science-cheat-sheet

cheatsheetdata-sciencemachine-learning

Inspired by this https://python-data-science.readthedocs.io/en/latest/_images/architecture.png

MLOps flow

Structure: DESIGN + MODEL DEVELOPMENT + OPERATIONS
Design: requirements engineering, use case prioritization, data availability check
Model development: data engineering, model engineering, testing and validation
Operations: model deployment, CI/CD pipelines, monitoring and alerting

Model Design

Exploratory Data Analysis
- Descriptive Statistics
- Distributions
- Boxplots - gives descriptions of the data with min/max, IQR and median
  - Gives the 25th percentile to 75th percentile range which is the inter-quantile-range (IQR)
  - Also gives the min and max which is defined as Q1 - 1.5 * IQR or Q3 + 1.5 * IQR
  - Outliers are outside of the min and max range
- Correlations
Data Preparation
1. Feature Preprocessing
  - Fill missing values or remove the column entirely if too many values are missing
    - Impute with mean or median, interpolation with linear or other methods
  - Remove outliers to ensure robustness of sensitive models
    - As identified in box plots
  - Encode features from string into integer
    - sine/cosine transformation to maintain cyclic relationships
2. Split training and testing to avoid overfitting of the model the training data and avoid data leakage
3. Feature Transformation Techniques to bring the distribution of the feature into a more normal like distribution
4. Feature Scaling Techniques to bring features into the same space for model convergence
5. Feature Engineering to create powerful features that is more informative for the model
6. Class balancing to help models predict rare occurrences
Model Selection
- Supervised Learning
  - Regression
    - Linear Regression Model
    - Polynomial regression model
    - Quantile Regression
    - Tree Regressors
  - Classification
- Unsupervised Learning
- Semi-supervised Learning
- Reinforcement Learning
- Recommendation Systems
Model Fit
- Loss Functions
  - L1 loss function (sum of the absolute errors)
  - L2 loss function (sum of the squared errors)
  - Pinball loss function
- Hyperparameter tuning result in change in model performance and combined with cross-validation techniques we can find the best set of hyperparameters
  - Grid Search
  - Random Search
  - Bayesian Optimization
Model Evaluation
- Commonly used Error metrics to evaluate regression type models
- Bias-variance tradeoff
  - The best predictive model is one that has good generalization ability which is able to predict accurately to new and previously unseen data
  - high bias can lead to okay performance but too general -> under-fit
  - high variance can lead to low errors with existing data but not necessarily with new data -> overfit
- Regularization Techniques
  - L1 regularization (Lasso Regression) which reduces the coefficient values
  - L2 regularization (Ridge Regression) which penalizes higher powers
- Model fit vs complexity
  - AIC
  - BIC
- SHAPLY
Model Score
- Score on unseen data as the true evaluation of the model
Serialization
- Pickling of the entire process of fitted data preparation and fitted model parameters

Inference testing

Using sample data to make conclusions about a population parameter
Frequentist Inference Testing
Bayesian Inference Testing

Causal Inference

Establishing cause-and-effect relationships. It aims to determine whether a change in one variable (the cause or treatment) leads to a change in another variable (the effect or outcome)
AB Testing
Difference in difference
- Fit a regression to the two groups to the observed metric and the coefficient estimated is the difference

MLOps flow

Model Design

Inference testing

Causal Inference

Mathematics