2021-11-15

☀️Daily Log:
- Hurdle Models #data-science
  - Linear models assume some level of normality in the response variable being predicted
    - If there is skew such that the response variable doesn't follow a normal distribution - then we correct the skew with transformations like log, sqrt and box-cox power
  - However, in some instances, there are clear cases of multi-modal distribution with lots of values at zero
    - Transformation only change the scale of the variable but doesn't change the distribution
    - This is a good indication that there are data which belongs in two or more distinct underlying data generation processes
  - Common applications: insurance claims (where most people don't claim)
    - Also applies to us, where most site-days don't experience waiting
  - First model - a binomial classifier trained and tested on all the data
  - Second model - a regressor trained only on true positive samples but used to make predictions on all the test data
  - Implmentation
    - You can build two separate models
    - Or extend scikit.BaseRegressor such that it can be passed in grid search functions and evaluation functions
    - https://geoffruddock.com/building-a-hurdle-regression-estimator-in-scikit-learn/
-Classifying Histograms #data-science
- Use [[hierarchical clustering]] or [[DBSCAN]]
  - They are better than [[k-means]] because they work with arbitrary distance measures such as Jensen-Shannon divergence
  - Which is designed to capture similarity of distributions
- [[hierarchical clustering]] which builds a hierarchy of clusters
  - Agglomerative - a bottoms up approach
  - Divisive - a top-down approach
Retrospective::
- One week ago: [[November 8th, 2021]]
- One month ago: [[October 15th, 2021]]
- One quarter ago: [[August 15th, 2021]]
- One year ago: [[November 15th, 2020]]
Daily Stoic::

Daily Stoic::​

Daily Stoic::