Learning a few shots in production

Introduction

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Given the large number of models that excel at zero-shot classification, the recognition of common objects such as dogs, cars, and stop signs can be seen as a largely solved problem. Identification of less common or rare objects is still an active area of ​​research. This is a scenario where large, manually annotated datasets are not available. In these cases, it may be unrealistic to expect people to engage in the laborious task of assembling large datasets of images, so a solution relying on a few illustrative examples is necessary. A prime example is healthcare, where professionals may need to classify image scans for rare diseases. Here, large datasets are rare, expensive and complex to create.

Before diving in, a few definitions might be helpful.

Zero-shot, one-shot, and few-shot learning are techniques that allow a machine learning model to make predictions for new classes with limited labeled data. The choice of technique depends on the specific problem and the amount of labeled data available for new categories or labels (classes).

  • Zero-Shot Learning: No labeled data is available for new classes. The algorithm makes predictions about new classes by using prior knowledge about the relationships between classes that it already knows.
  • One-Shot Learning: The new class has a labeled instance. The algorithm makes predictions based on a single instance.
  • Few-shot learning: The goal is to make predictions for new classes based on a few examples of labeled data..

Few-show learning, an approach focused on learning from only a few examples, is designed for situations where labeled data are rare and difficult to generate. Training a decent image classifier often requires a large amount of training data, especially for classical convolutional neural networks. You can imagine how difficult the problem becomes when there are only a handful of labeled images (typically less than 5) for training.

With the advent of visual language models (VLMs), large models that combine text and language data, few-shot classification has become more feasible. These models have learned features and variations from large amounts of Internet data and connections between visual features and textual descriptors. This makes VLMs the ideal basis for fine-tuning or leveraging VLMs to perform downstream classification tasks when only a small amount of labeled data is provided. Effectively deploying such a system will make a few-shot classification solution much less expensive and more attractive to our customers.

We have paired with University of Toronto Engineering Science (Machine Intelligence) students for the Fall 2023 semester to take the first steps in developing a few-shot learning system.

Adapting to new instances

Although VLMs have very impressive results on standard benchmarks, they generally only perform well in unseen domains with further training. One way is to refine the model with new examples. Full fine-tuning involves retraining all parameters of a previously trained model on a new task-specific dataset. Although this method can achieve robust performance, it has a few drawbacks. Basically, it requires considerable computational resources and time and may lead to overfitting if the task-specific dataset is small. This may result in the model failing to generalize to unseen data.

The adapter method, popularized by the first CLIP-Adapter for the The CLIP model, has been developed to reduce these problems. Unlike full fine-tuning, the adapter method adjusts only a few parameters in the model. This method involves inserting small adapter modules into the model architecture, which are then fixed while the original model parameters remain frozen. This approach significantly reduces the computational cost and risk of overfitting associated with exhaustive fine-tuning while allowing the model to be efficiently adapted to new tasks.

gave TIP adapter There is an advanced method that further improves on the CLIP-Adapter. TIP adapters provide a training-free framework for a few-shot learning system, which means no fine-tuning is required (there is a version that uses additional fine-tuning and is more efficient than the CLIP-adapter). . The system leverages a key-value (KV) cache where the CLIP embeddings are the keys and the provided modified labels are the values. It can be easily extended into an extensible service for various image classification tasks.

Scaling up to production

along with, University of Toronto Engineering Science Program The team designed a system that can be deployed as a container using Fast API, Redis and Docker. Out of the box, it can support up to 10 million uniquely trained class instances. Needless to say, through the adapter method, the time required for fine-tuning is reduced to the order of 10 seconds.

Their final deliverable can be found in this GitHub Storage.

What's coming?

The team has identified a few directions:

  1. Different base models: CLIP has many variants and is definitely not the only VLM out there. However, this may be a trade-off between model size (and thus rendering cost) and accuracy.
  2. Augmenting data: Methods like cropping, rotation, and recoloring can help artificially increase the number of examples for training.
  3. Promising prospects from large language models (LMs): LLMs have zero-shot capabilities (no additional training) and emerging few-shot capabilities. Can LLM be used more widely in few-shot production systems? Only time will tell.

The UofT team consists of Arthur Allshire, Chase McDougal, Christopher Mountain, Ritok Singh, Sameer Bharatiya, and Vatsal Bagri.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment