GenAi, Deep Learning and Computer Vision
Photo
Google engineers offered 28 actionable tests for #machinelearning systems. π
Introducing π The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction (2017). π
If #ml #training is like compilation, then ML testing shall be applied to both #data and code.
7 model tests
1β£ π Review model specs and version-control it. It makes training auditable and improve reproducibility.
2β£ π Ensure model loss is correlated with user engagement.
3β£ π Tune all hyperparameters. Grid search, Bayesian method whatever you use, tune all of them.
4β£ π Measure the impact of model staleness. The age-versus-quality curve shows what amount of staleness is tolerable.
5β£ π Test against a simpler model regularly to confirm the benefit more sophisticated techniques.
6β£ π Check the model quality is good across different data segment, e.g. user countries, movie genre etc.
7β£ π Test model inclusion by checking against the protected dimensions or enrich under-represented categories.
7 data tests
1β£ π Capture feature expectations in schema using statistics from data + domain knowledge + expectations.
2β£ π Use beneficial features only, e.g. training a set of models each with one feature removed.
3β£ π Avoid costly features. Cost includes running time, RAM as well as upstream work and instability.
4β£ π Adhere to feature requirements. If certain features canβt be used, enforce it programmatically.
5β£ π Set privacy controls. Budget enough time for new feature that depends on sensitive data.
6β£ π Add new features quickly. If conflicting with 5β£ , privacy goes first.
7β£ π Test code for all input features. Bugs do exist in feature creation code.
See 7 Infrastructure & 7 monitoring tests in paper. π
They interviewed 36 teams across Google and found
π Using a checklist helps avoid mistakes (like a surgeon would do).
π Data dependencies leads to outsourcing responsibility. Other teamsβ validation may not validate your use case.
π A good framework promotes integration test which is not well adopted.
π Assess the assessment to better assess your system.
https://research.google.com/pubs/archive/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
Introducing π The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction (2017). π
If #ml #training is like compilation, then ML testing shall be applied to both #data and code.
7 model tests
1β£ π Review model specs and version-control it. It makes training auditable and improve reproducibility.
2β£ π Ensure model loss is correlated with user engagement.
3β£ π Tune all hyperparameters. Grid search, Bayesian method whatever you use, tune all of them.
4β£ π Measure the impact of model staleness. The age-versus-quality curve shows what amount of staleness is tolerable.
5β£ π Test against a simpler model regularly to confirm the benefit more sophisticated techniques.
6β£ π Check the model quality is good across different data segment, e.g. user countries, movie genre etc.
7β£ π Test model inclusion by checking against the protected dimensions or enrich under-represented categories.
7 data tests
1β£ π Capture feature expectations in schema using statistics from data + domain knowledge + expectations.
2β£ π Use beneficial features only, e.g. training a set of models each with one feature removed.
3β£ π Avoid costly features. Cost includes running time, RAM as well as upstream work and instability.
4β£ π Adhere to feature requirements. If certain features canβt be used, enforce it programmatically.
5β£ π Set privacy controls. Budget enough time for new feature that depends on sensitive data.
6β£ π Add new features quickly. If conflicting with 5β£ , privacy goes first.
7β£ π Test code for all input features. Bugs do exist in feature creation code.
See 7 Infrastructure & 7 monitoring tests in paper. π
They interviewed 36 teams across Google and found
π Using a checklist helps avoid mistakes (like a surgeon would do).
π Data dependencies leads to outsourcing responsibility. Other teamsβ validation may not validate your use case.
π A good framework promotes integration test which is not well adopted.
π Assess the assessment to better assess your system.
https://research.google.com/pubs/archive/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
π3