Building and curating representative datasets during the model research and development is a critical component of getting a ML system with accuracy meeting the project's requirements. After the deployment of said model, monitoring accuracy and other statistical metrics in order to improve and adjust the model is a natural workflow. Models working with unstructured language data might experience data shift resulting in unpredictable and non-representative inference. With the help of open-source APIs and commercial or open-source annotation tools the building of annotations can be operationalized and the analyst workload reduced. In this talk I will cover the process of generating datasets and using them for real time precision/recall splits with the goal of detecting data shifts away from the in-sample space to prioritize future data collection and model retraining.
Ivan is a data scientist at Teleskope focused on building scalable models for detecting PII/PHI/Secrets and other compliance related entities within customers' clouds. Prior to joining Teleskope, Ivan was a ML Engineer at Forge.AI, a Boston based shop working on information extraction, content extraction, and other NLP related tasks. In his free time he is a fan of making stuff from raw materials, including ceramics, cooking.