Open source active learning toolkit for computer vision

New Products | January 26, 2023

By Rich Pell

machine learning ML data science open source Development Platform computer vision AI

AI-assisted platform provider Encord has released a free open source industry agnostic toolkit designed to enable machine learning (ML) engineers and data scientists to understand and improve their training data quality and help boost model performance. Designed as an all-in-one toolkit for improving data quality and model performance, Encord Active is offered as empowering machine learning teams to find failure modes in their models, prioritize high-value data for labeling, and drive smart data curation to improve model performance.

For many use cases, says the company, such as self-driving cars and diagnostic medical models, AI suffers from a “production gap” between successful proof-of-concept models and models capable of running “in the wild.” Proof-of-concept models perform well in research environments but struggle to make predictions accurately and consistently in real-world scenarios due to issues of model robustness and reliability.

Encord Active is designed to enable ML engineers to bridge this gap using active learning for investigating the quality of their data, labels, and model performance. Active learning is a process for training models in which the model asks for data that can help improve its performance.

While this approach has gained traction as a theory among researchers, start-ups, and enterprises, says the company, smaller AI companies, have not yet been able to implement usable active learning techniques. Encord Active is designed to allow companies of all sizes to move from theory to implementation by providing a new methodology based on “quality metrics” – computed indexes added on top of users’ data, labels, and models based on human-explainable concepts.

Current active learning methods rely on ML engineers building their own tools and creating their own versions of quality metrics, making the process a time-consuming and expensive approach, says the company. Encord Active removes that work by automating computation of an assortment of pre-built quality metrics across the data, labels, and model predictions.

“As many ML engineers know, the performance of all models depends on the quality of their training data.” says Eric Landau, Co-Founder and CEO at Encord. “Encord Active is first and foremost a framework built to help machine learning engineers understand and improve their data quality iteratively and effectively. We want to contribute to the progression of the computer vision space as much as possible, so making Encord Active open source was a no-brainer.”

The quality metrics approach focuses on the automatic calculation of characteristics of images, labels, model predictions, and metadata. ML teams are then presented with a breakdown of their data, label distribution, and model performance by each metric. These insights allow them to:

Find unknown failure modes in their datasets.
Inspect whether their dataset is balanced across the different metrics and balance their dataset based on the quality metrics prior to labeling or training a model.
Identify potential outliers in their dataset that can then be removed if they are unnecessary for the use case.

Encord Active, says the company, is also the first tool to provide actionable end-to-end active learning workflows to create an environment where models can continuously learn and improve, similar to how humans do. Within the Encord ecosystem, users can not only find valuable data to label and find label errors to re-label but also complete the workflow cycle to fix these issues.

Encord Active is available on GitHub.

Encord