Testing deployed ML services is different from testing traditional software. Unlike software with deterministic outcomes, ML systems operate in probabilities and can return different results over time.
At TinyData, we make tools that help ensure the safety of production ML systems. To demonstrate this, I will showcase the approach we took to testing 4 commercial ML systems for gender bias. By having the ability to easily generate datasets for blackbox testing, we could find large categories of images that result in gender labelling errors. We will then discuss the workflow required for turning these error-producing datasets into training data for improving the systems.