Task vectors for vision

Task vectors are ways to encode tasks into a general-purpose model, without fine-tuning. The general, high-level idea is as follows:

prompt the model a few times with the same task
Store activations of the model, select the most useful/mean activations
Apply the task vector at inference time to get the model to perform a specific task, without additional examples needed to tune the model behaviour

In the simplest form, this is giving the model multiple prompts of the form ‘France → Paris’, ‘The Netherlands → Amsterdam’, ‘United Kingdom → London’, and then give it ‘Spain →’ at inference time, and expecting just the output ‘Madrid’, instead of "¡España! ¿Qué necesitas saber? 😊” or some other random text based on the stochastic generation process of the LLM.

The benefits of task vectors are the following:

You don’t need to fine-tune a model to behave in a particular way
You don’t need to use additional context describing the task at inference time, which reduces latency

In this post, we will implement task vectors from scratch, first for LLMs, and then apply it to Vision-Language Models (VLMs). One interesting finding we will discover is that if we use text-only descriptions when extracting the task vector, this will transfer to good image-text performance at inference time (!). It means that task vectors are cross-modal: a task vector created based on text-only input, can be used at inference time in an image context.

Weight task vectors for model editing

Ilharco et al. originally coined the term 'task vector'. In that work, a task vector is described as 'a direction in weight space that correspond to a particular task'. The main discovery here was that if you take a model, fine-tune it on a specific task, and then subtract the original weights from that fine-tuned model, you end up with vectors that describe the task. Though that may seem unsurprising, the resulting weights are informative and can be used in interesting ways:

Subtracting task vectors from the model weights can remove undesirable behavior. For example, we can fine-tune a language model to behave in a particular toxic way, and then subtract the fine-tuned weights from the original model. The resulting model will be less toxic compared to the original model.
Combining task vectors can be competitive with fine-tuning models. We can for example add task vectors from two different image classification tasks, add the two task vectors to the original weights of the model, and obtain performance similar to that of two models fine-tuned separately.
Using task analogies can help to improve performance when data is scarce. For example, we might have little data for the class $\text {lion indoors}$ , but a lot of data for the class $\text {dog indoors}$ . Using task vectors as $\hat{\tau}_{\text {lion indoors }}=\tau_{\text {lion outdoors }}+\left(\tau_{\text {dog indoors }}-\tau_{\text {dog outdoor }}\right)$ can help to improve the performance of the $\text {lion indoors }$ class.
Task vectors from analogous tasks can help to improve performance when data is scarce.

The idea of combining task vectors this way reminds me strongly of the idea of Model soups, which showed that averaging the weights of multiple models fine-tuned with different hyperparameters can improve accuracy and robustness.

Task vectors and in-context learning

The above shows that task vectors exist and we can obtain them by fine-tuning a model on a specific task, and then subtracting the original model weights from the fine-tuned model. In 2023, task vectors were used to better understand in-context learning (ICL). What makes ICL work - how does a model use the context demonstrations internally? As it turns out, models use ICL demonstrations to create task vectors. This idea was first published (on the same day!) by Todd et al., and Hendel et al. in October 2023.

Input, task and output separation

Let's describe the idea of in-context learning and task vectors more formally. Our goal when constructing task vectors with ICL is to find if a model maps a learning algorithm $A$ of a set of demonstrations $S$ into a task vector $\theta$ , independent of our input query $x$ . A model that uses a task vector, is then an application of $A$ to the query $x$ , defined as a function $f$ .

The assumption underlying this framework is that a model uses part of its layers to encode the task based on the demonstrations, and other parts of the network for encoding the query input and output. If this assumption holds, we can find specific task vectors - if not, all layers contribute a bit to the input and a bit to the task and the solution, and we won't be able to find specific vectors that correspond to the task we give to the model. Now, when trying to prove that there is this separation, we run into two problems:

The layers that correspond to the learning algorithm $A$ , also have access to the input $x$ . This creates an unwanted dependence of the task vector $\theta$ to the query $x$ .
The layers that correspond to applying the task vector to the query ( $f$ ) have direct access to our demonstrations $S$ , and can therefore do more than what we’re trying to prove: that just $f = A(x)$ without $S$ .

To solve this, the authors in Hendel et al. (2023) do the following:

They introduce a dummy query $x'$ , and calculate $\theta$ using this dummy query to ensure independence from the actual input query $x$ . - Specifically, they compute the representation of the input token $\rightarrow$ at the $L$ -th layer using $x$ and store this representation as $\theta$ , which encodes the task information from the demonstrations $S$ . $\rightarrow$ is a character mapping the task input to the output in the example demonstrations, as in e.g. " $Paris \rightarrow France$ ".
To ensure that the second part of the process ( $f$ ) applies $\theta$ to the query $x$ without directly accessing $S$ , they "inject" $\theta$ at the $L$ -th layer during a forward pass of the model with the original model weights. This injection replaces the intermediate representation at the $L$ -th layer with $\theta$ , ensuring that subsequent layers can rely on $\theta$ and $x$ to compute the output without access to $S$ . By doing this, we enforce a clear separation between the task-learning phase ( $A$ ) and the task-application phase ( $f$ ).

To find a robust task vector, we average the task vector across a set of examples.

Finding the right layer for our task vector

Now, we still need to prove our assumption that different parts of the network perform the different tasks we outlined above. To do this, we iterate over each model layer and use each as the layer where we extract and then inject our task vector for a given dataset. Then, we plot the accuracy of the task when we do this for each layer. The result of this process is shown below: Layer accuracy per layer We see that there is a clear pattern here for the three models evaluated: middle to early layers are the best task vector layers.

Results

When we do the above, and use the optimal layer as our task vector, we get a performance that is similar to standard in-context learning: we reach about 80-90% of the performance of ICL. In the plot below, Hypothesis is the task-vector approach, and Regular refers to regular in-context learning where each prompt contains the example demonstrations. Baseline refers to the scenario where we do not have any in-context learning - just a single input prompt. Task vector accuracy comparison

Task vector interpretation

To dive deeper into what the task vectors mean, we can decode the model's output at a task vector layer. That is, we can attach a linear layer that maps a layer's output to the size of the vocabulary, and then see what the top token predictions are at that point. This approach is called logit lens and was published in August 2020 here. Interestingly, if we do this for our task vectors, some of the top tokens make sense and map to task descriptions rather than output predictions. For example, when we ask our model to map a country to a capital (France $\rightarrow$ Paris), the model's top tokens at the task vector contain words such as capital, central, cities. Importantly, these words never appeared in the context. Task vector top tokens

Task vectors for vision

Finding Visual Task Vectors (ICML 2024) applies the idea of task vectors to the vision domain. The authors find that task vectors also exist in vision-language models, such as MAE-VQGAN. This model allows you to visually prompt a model with examples, and works on a range of computer vision tasks such as image inpainting, colorization, edge detection or segmentation (see below). mae_vqan Different from previous work, they take a different perspective and start with the insight that task vectors are similar to "intermediate activations that are invariant to change within a task, but have high variance across different tasks". Taking this insight, we can quickly compute the 'taskness' of different activations as follows:

$\left.\rho_{\text {token }}(i)=\frac{\sum_{e=1}^d \operatorname{Var}\left(h_{\text {all }}^i[e]\right)}{\frac{1}{n} \sum_{j=1}^n \sum_{e=1}^d \operatorname{Var}\left(h_{\text {task }}^j\right.}[e]\right)$

Here, $\operatorname{Var}$ is the variance, $h[e]$ denotes the $e$ -th element of a vector $h$ , and $n$ is the number of tasks. We're essentially looking for high variance across different tasks, and low variance within a specific task.

visualising_task_vectors

Scoring 'taskness' across attention blocks (x-axis) and attention heads (y-axis) according to $\rho_\text {token}$ . High scores indicate high variance across tasks and low variance within a task, as visualised in the different subplots.

However, computing the above does not mean that an actual task vector is found, as the similarity for a specific task can also correspond to the visual similarity of the task at hand. In addition, (as we will see later), the VQAN encodes the task across multiple layers and heads, and is not confined to a single or isolated activation.

On top of this comes the more complex encoding of the vision task in terms of the tokens used to encode it. MAE-VQAN does not process image tokens sequentially, and as a result, multiple tokens might hold task vectors. The language-methods described above restricted their search space to the output activations of the last token in the prompt sequence ( $\rightarrow$ ). Because this is not possible with vision demonstrations, which are just stitched images with inputs and outputs, we need a different mechanism to find the task vectors. The authors propose the following:

Compute the mean attention head outputs for each token position across a set of task examples
Replace attention head outputs at selected positions with these pre-calculated means, using a new query as input.
Use the REINFORCE algorithm to optimise task vector injection positions, guiding the model towards the desired task.

Interestingly, using this approach to find task vectors, and then injecting these at inference time works better than standard visual ICL in the evaluated setting. This was not the case for the text-only task vectors, which were about 80-90% as good as standard ICL. Other insights from the paper:

Both the encoder and the decoder contribute to the encoding of the task.
Still providing examples at inference time, when using task vectors, does not improve performance.
When applying the same procedure to Llama2 7B, the approach outperforms standard 10-shot ICL on 2/3 tasks, and gets close to the performance of 10-shot ICL on the third task. So, it seems that the approach is generalizable across different models and tasks, and an improvement over the task vector approach for text-only tasks as described above. However, this does come at the cost of increased computational complexity - REINFORCE is used to find the optimal injection positions, which is computationally more expensive.

Multimodal Task Vectors

Task Vectors are Cross-Modal finds that task vectors are cross-modal as well. That means, conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. For example, we can encode the country "France" as text or as an image of the French flag, and the task vector will be similar. cross_modal_task_vectors Somewhat surprisingly, the authors even find that task vectors created with only text outperforms task vectors created with images, when images are used at inference time. The authors speculate that this is because image ICL requires an additional step to understand the task compared to text ICL. Following a similar logic, the authors show that task vectors can also be defined via brief instructions (instead of demonstrations) and patched onto image queries. Combining instruction-based vectors with image-based vectors improves performance over only using image-based vectors.

To be frank, reported accuracy on the tasks is very low (in the range of 20-60% for many tasks), which surprises me somewhat given the relative simplicity of the used datasets and makes me wonder how well the task vectors are actually working. A comparison with e.g. LoRA fine-tuning is missing and so is a clear description of how the task vector is exactly computed. task vector VLM results

Interpreting VLM task understanding

What is interesting in the paper however, is a further deepdive into the interpretation of the task vectors. Again using logit lens, the paper also shows that tokens in VLMs undergo three distinct phases: input, task and answer, similar to the text-only task vectors described above. task vector logit lens for vlms

Each line corresponds to the probability that the last token representation decodes to a pre-defined input, task or answer vector.

task vector output transforms across layers

Output transforms of the Country-Capital task for three different layers, for text and image ICL. Middle layers often decode to task summaries.

Task Vectors from scratch

I created a simple implementation of task vectors in PyTorch. Let's go through the code step by step. In this blog post, I will focus on the essential parts of the task vectors. The full implementation can be found here. In this experiment, we use LLama3 1B, so we focus just on the text-only task vectors.

Task vector extraction

Given a simple task, we can first create the in-context learning prompt with our demonstrations. Given a dataset (a simple dictionary containing single word inputs as the keys and the corresponding outputs as the values), we can create a few-shot prompt as follows:

SEPARATOR = "→"

def get_icl_prompt(
    data: dict, nb_shots: int
) -> tuple[list[str], dict]:
    # +1 for including the dummy query
    sampled_items = random.sample(list(data.items()), nb_shots +1)
    prompts = []
    for task_input, target_output in sampled_items[:-1]:
        prompts.append(f"{task_input}{SEPARATOR}{target_output}")
    # Last item is the dummy query: task input plus the separator
    prompts.append(sampled_items[-1][0] + SEPARATOR)
    return prompts, dict(sampled_items)

All we do here is sample a few tasks from the dataset, and create a prompt with the task input and the target output. The last item in the prompt is the dummy query, which is the task input plus the separator. We can then get our task vectors by extracting the activations of the model at the desired layer. First, we define a simple hook that stores the activations of the model at the desired layer for the last token of the prompt ( $\rightarrow$ ):

def hook_extract(module, inputs, outputs):
    task_vectors.append(outputs[0][:, -1, :])

Then, we can extract the task vectors by running the model with the few-shot prompt. Here, we iterate over each layer so that we can find the optimal layer for our task vectors:

for layer_index, layer in model.model.layers:
    task_vectors = []  # Reset task vectors for each layer
    extract_hook = layer.register_forward_hook(hook_extract)
    infer(model, tokenizer, prompt="\n".join(few_shot_prompts))
    extract_hook.remove()

The task_vectors now contain the activations of the model at the specified layer for the last token of the demonstrations and the dummy query. Our inference function is a simple function that tokenizes the prompt, runs the model, and returns the decoded output:

def infer(model, tokenizer, prompt, device):
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, return_token_type_ids=False).to(device)
    output_ids = model.generate(
        **inputs, max_new_tokens=1, do_sample=False, num_return_sequences=1, pad_token_id=tokenizer.pad_token_id
    )
    return tokenizer.decode(output_ids[-1][-1], skip_special_tokens=True)

We generate a single token output here to keep things simple, similar to what is done in the papers mentioned above (it would actually be interesting to test task vectors on more complicated tasks. Do they still perform as well?).

Now that we have our task vectors, we can inject them at inference time. We do this by defining a hook that replaces the activations of the last token at the desired layer with the task vector:

def hook_inject(module, inputs, outputs, task_vectors):
    outputs[0][:, -1, :] = task_vectors
    return outputs

# ... inside our layer loop
# Inject task vectors
inject_hook = layer.register_forward_hook(partial(hook_inject, task_vectors=task_vectors[0]))
for input_item, target_output in data_iterator(task_data, exclude_keys):
    # single input item, no in-context demonstrations
    answer = infer(model, tokenizer, prompt=f"\n{input_item}")
    # ... store answer, compute accuracy etc.
inject_hook.remove()

And that's essentially it! We can now compute the accuracy of our task vectors, and compare them to standard in-context learning. When we do this (see the full script here), we find that the task vectors perform at about 80-90% of the performance of standard in-context learning, similar to the results in the paper.

Conclusion

Task vectors can be found in both text and vision models, and are cross-modal.
They can be used as replacements for in-context learning, and in some cases, even outperform standard in-context learning.
Interpreting task vectors with logit lens shows that they indeed encode task information.
How to exactly find task vectors is still an open question, particularly for vision-language models.