2021-11-15
-
☀️Daily Log:
- Hurdle Models #data-science
- Linear models assume some level of normality in the response variable being predicted
- If there is skew such that the response variable doesn't follow a normal distribution - then we correct the skew with transformations like log, sqrt and box-cox power
- However, in some instances, there are clear cases of multi-modal distribution with lots of values at zero
- Transformation only change the scale of the variable but doesn't change the distribution
- This is a good indication that there are data which belongs in two or more distinct underlying data generation processes
- Common applications: insurance claims (where most people don't claim)
- Also applies to us, where most site-days don't experience waiting
- First model - a binomial classifier trained and tested on all the data
- Second model - a regressor trained only on true positive samples but used to make predictions on all the test data
- Implmentation
- You can build two separate models
- Or extend
scikit.BaseRegressor
such that it can be passed in grid search functions and evaluation functions - https://geoffruddock.com/building-a-hurdle-regression-estimator-in-scikit-learn/
- Linear models assume some level of normality in the response variable being predicted
-Classifying Histograms #data-science
- Use [[hierarchical clustering]] or [[DBSCAN]]
- They are better than [[k-means]] because they work with arbitrary distance measures such as Jensen-Shannon divergence
- Which is designed to capture similarity of distributions
- [[hierarchical clustering]] which builds a hierarchy of clusters
- Agglomerative - a bottoms up approach
- Divisive - a top-down approach
- Hurdle Models #data-science
-
Retrospective::
- One week ago: [[November 8th, 2021]]
- One month ago: [[October 15th, 2021]]
- One quarter ago: [[August 15th, 2021]]
- One year ago: [[November 15th, 2020]]
-
Daily Stoic::