2021-11-15 - Paul's Second Brain

☀️Daily Log:
- Hurdle Models #data-science
  - Linear models assume some level of normality in the response variable being predicted
    - If there is skew such that the response variable doesn’t follow a normal distribution - then we correct the skew with transformations like log, sqrt and box-cox power
  - However, in some instances, there are clear cases of multi-modal distribution with lots of values at zero
    - Transformation only change the scale of the variable but doesn’t change the distribution
    - This is a good indication that there are data which belongs in two or more distinct underlying data generation processes
  - Common applications: insurance claims (where most people don’t claim)
    - Also applies to us, where most site-days don’t experience waiting
  - First model - a binomial classifier trained and tested on all the data
  - Second model - a regressor trained only on true positive samples but used to make predictions on all the test data
  - Implmentation
    - You can build two separate models
    - Or extend scikit.BaseRegressor such that it can be passed in grid search functions and evaluation functions
    - https://geoffruddock.com/building-a-hurdle-regression-estimator-in-scikit-learn/
-Classifying Histograms #data-science - Use hierarchical clustering or DBSCAN - They are better than k-means because they work with arbitrary distance measures such as Jensen-Shannon divergence - Which is designed to capture similarity of distributions - hierarchical clustering which builds a hierarchy of clusters - Agglomerative - a bottoms up approach - Divisive - a top-down approach
Retrospective::
- One week ago: November 8th, 2021
- One month ago: October 15th, 2021
- One quarter ago: August 15th, 2021
- One year ago: November 15th, 2020
Daily Stoic::

Daily Stoic::