Task vectors are ways to encode tasks into a general-purpose model, without fine-tuning. The general, high-level idea is as follows:
In the simplest form, this is giving the model multiple prompts of the form ‘France → Paris’, ‘The Netherlands → Amsterdam’, ‘United Kingdom → London’, and then give it ‘Spain →’ at inference time, and expecting just the output ‘Madrid’, instead of "¡España! ¿Qué necesitas saber? 😊” or some other random text based on the stochastic generation process of the LLM.
The benefits of task vectors are the following:
In this post, we will implement task vectors from scratch, first for LLMs, and then apply it to Vision-Language Models (VLMs). One interesting finding we will discover is that if we use text-only descriptions when extracting the task vector, this will transfer to good image-text performance at inference time (!). It means that task vectors are cross-modal: a task vector created based on text-only input, can be used at inference time in an image context.
Ilharco et al. originally coined the term 'task vector'. In that work, a task vector is described as 'a direction in weight space that correspond to a particular task'. The main discovery here was that if you take a model, fine-tune it on a specific task, and then subtract the original weights from that fine-tuned model, you end up with vectors that describe the task. Though that may seem unsurprising, the resulting weights are informative and can be used in interesting ways:
Task vectors from analogous tasks can help to improve performance when data is scarce.
The idea of combining task vectors this way reminds me strongly of the idea of Model soups, which showed that averaging the weights of multiple models fine-tuned with different hyperparameters can improve accuracy and robustness.
The above shows that task vectors exist and we can obtain them by fine-tuning a model on a specific task, and then subtracting the original model weights from the fine-tuned model. In 2023, task vectors were used to better understand in-context learning (ICL). What makes ICL work - how does a model use the context demonstrations internally? As it turns out, models use ICL demonstrations to create task vectors. This idea was first published (on the same day!) by Todd et al., and Hendel et al. in October 2023.
Let's describe the idea of in-context learning and task vectors more formally. Our goal when constructing task vectors with ICL is to find if a model maps a learning algorithm of a set of demonstrations into a task vector , independent of our input query . A model that uses a task vector, is then an application of to the query , defined as a function .
The assumption underlying this framework is that a model uses part of its layers to encode the task based on the demonstrations, and other parts of the network for encoding the query input and output. If this assumption holds, we can find specific task vectors - if not, all layers contribute a bit to the input and a bit to the task and the solution, and we won't be able to find specific vectors that correspond to the task we give to the model. Now, when trying to prove that there is this separation, we run into two problems:
To solve this, the authors in Hendel et al. (2023) do the following:
To find a robust task vector, we average the task vector across a set of examples.
Now, we still need to prove our assumption that different parts of the network
perform the different tasks we outlined above. To do this, we iterate over each model layer
and use each as the layer where we extract and then inject our task vector for a given dataset.
Then, we plot the accuracy of the task when we do this for each layer. The result of this process
is shown below:
We see that there is a clear pattern here for the three models evaluated: middle to early layers are
the best task vector layers.
When we do the above, and use the optimal layer as our task vector, we get a performance
that is similar to standard in-context learning: we reach about 80-90% of the performance of ICL.
In the plot below, Hypothesis is the task-vector approach,
and Regular refers to regular in-context learning where each prompt contains the example demonstrations.
Baseline refers to the scenario where we do not have any in-context learning - just a single input prompt.
To dive deeper into what the task vectors mean, we can decode the model's output at a task vector layer.
That is, we can attach a linear layer that maps a layer's output to the size of the vocabulary,
and then see what the top token predictions are at that point. This approach is called logit lens and was
published in August 2020 here.
Interestingly, if we do this for our task vectors, some of the top tokens make sense and map to task descriptions rather than
output predictions. For example, when we ask our model to map a country to a capital (France Paris),
the model's top tokens at the task vector contain words such as capital, central, cities.
Importantly, these words never appeared in the context.
Finding Visual Task Vectors (ICML 2024) applies the idea of
task vectors to the vision domain. The authors find that task vectors
also exist in vision-language models, such as MAE-VQGAN.
This model allows you to visually prompt a model with examples, and works on a range of computer vision tasks such as
image inpainting, colorization, edge detection or segmentation (see below).
Different from previous work, they take a different
perspective and start with the insight that task vectors are similar to "intermediate activations that are invariant to change within a task,
but have high variance across different tasks". Taking this insight,
we can quickly compute the 'taskness' of different activations as follows:
Here, is the variance, denotes the -th element of a vector , and is the number of tasks. We're essentially looking for high variance across different tasks, and low variance within a specific task.
Scoring 'taskness' across attention blocks (x-axis) and attention heads (y-axis) according to . High scores indicate high variance across tasks and low variance within a task, as visualised in the different subplots.
However, computing the above does not mean that an actual task vector is found, as the similarity for a specific task can also correspond to the visual similarity of the task at hand. In addition, (as we will see later), the VQAN encodes the task across multiple layers and heads, and is not confined to a single or isolated activation.
On top of this comes the more complex encoding of the vision task in terms of the tokens used to encode it. MAE-VQAN does not process image tokens sequentially, and as a result, multiple tokens might hold task vectors. The language-methods described above restricted their search space to the output activations of the last token in the prompt sequence (). Because this is not possible with vision demonstrations, which are just stitched images with inputs and outputs, we need a different mechanism to find the task vectors. The authors propose the following:
Interestingly, using this approach to find task vectors, and then injecting these at inference time works better than standard visual ICL in the evaluated setting. This was not the case for the text-only task vectors, which were about 80-90% as good as standard ICL. Other insights from the paper:
Task Vectors are Cross-Modal finds that task vectors are cross-modal as well.
That means, conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified.
For example, we can encode the country "France" as text or as an image of the French flag, and the task vector will be similar.
Somewhat surprisingly, the authors even find
that task vectors created with only text outperforms task vectors created with images, when images are used
at inference time. The authors speculate that this is because image ICL requires an additional
step to understand the task compared to text ICL. Following a similar logic, the authors show that task vectors can also be defined via
brief instructions (instead of demonstrations) and patched onto image queries. Combining instruction-based vectors
with image-based vectors improves performance over only using image-based vectors.
To be frank, reported accuracy on the tasks is very low (in the range of 20-60% for many tasks),
which surprises me somewhat given the relative simplicity of the used datasets and makes me wonder how well the task vectors are actually working.
A comparison with e.g. LoRA fine-tuning is missing and so is a clear description of how the task vector is exactly computed.
What is interesting in the paper however, is a further deepdive into the interpretation of the task vectors.
Again using logit lens, the paper also shows that tokens in VLMs undergo three distinct phases: input, task and answer,
similar to the text-only task vectors described above.
Each line corresponds to the probability that the last token representation decodes to a pre-defined input, task or answer vector.
Output transforms of the Country-Capital task for three different layers, for text and image ICL. Middle layers often decode to task summaries.
I created a simple implementation of task vectors in PyTorch. Let's go through the code step by step. In this blog post, I will focus on the essential parts of the task vectors. The full implementation can be found here. In this experiment, we use LLama3 1B, so we focus just on the text-only task vectors.
Given a simple task, we can first create the in-context learning prompt with our demonstrations. Given a dataset (a simple dictionary containing single word inputs as the keys and the corresponding outputs as the values), we can create a few-shot prompt as follows:
SEPARATOR = "→"
def get_icl_prompt(
data: dict, nb_shots: int
) -> tuple[list[str], dict]:
# +1 for including the dummy query
sampled_items = random.sample(list(data.items()), nb_shots +1)
prompts = []
for task_input, target_output in sampled_items[:-1]:
prompts.append(f"{task_input}{SEPARATOR}{target_output}")
# Last item is the dummy query: task input plus the separator
prompts.append(sampled_items[-1][0] + SEPARATOR)
return prompts, dict(sampled_items)
All we do here is sample a few tasks from the dataset, and create a prompt with the task input and the target output. The last item in the prompt is the dummy query, which is the task input plus the separator. We can then get our task vectors by extracting the activations of the model at the desired layer. First, we define a simple hook that stores the activations of the model at the desired layer for the last token of the prompt ():
def hook_extract(module, inputs, outputs):
task_vectors.append(outputs[0][:, -1, :])
Then, we can extract the task vectors by running the model with the few-shot prompt. Here, we iterate over each layer so that we can find the optimal layer for our task vectors:
for layer_index, layer in model.model.layers:
task_vectors = [] # Reset task vectors for each layer
extract_hook = layer.register_forward_hook(hook_extract)
infer(model, tokenizer, prompt="\n".join(few_shot_prompts))
extract_hook.remove()
The task_vectors
now contain the activations of the model at the specified layer for the last token of the demonstrations and the dummy query.
Our inference function is a simple function that tokenizes the prompt, runs the model, and returns the decoded output:
def infer(model, tokenizer, prompt, device):
inputs = tokenizer(prompt, return_tensors="pt", padding=True, return_token_type_ids=False).to(device)
output_ids = model.generate(
**inputs, max_new_tokens=1, do_sample=False, num_return_sequences=1, pad_token_id=tokenizer.pad_token_id
)
return tokenizer.decode(output_ids[-1][-1], skip_special_tokens=True)
We generate a single token output here to keep things simple, similar to what is done in the papers mentioned above (it would actually be interesting to test task vectors on more complicated tasks. Do they still perform as well?).
Now that we have our task vectors, we can inject them at inference time. We do this by defining a hook that replaces the activations of the last token at the desired layer with the task vector:
def hook_inject(module, inputs, outputs, task_vectors):
outputs[0][:, -1, :] = task_vectors
return outputs
# ... inside our layer loop
# Inject task vectors
inject_hook = layer.register_forward_hook(partial(hook_inject, task_vectors=task_vectors[0]))
for input_item, target_output in data_iterator(task_data, exclude_keys):
# single input item, no in-context demonstrations
answer = infer(model, tokenizer, prompt=f"\n{input_item}")
# ... store answer, compute accuracy etc.
inject_hook.remove()
And that's essentially it! We can now compute the accuracy of our task vectors, and compare them to standard in-context learning. When we do this (see the full script here), we find that the task vectors perform at about 80-90% of the performance of standard in-context learning, similar to the results in the paper.