Attributing Distributed LLMs with Petals

What is Petals?

Petals is a framework enabling large language models usage without the need of high-end GPUs, exploiting the potential of distributed training and inference. With Petals, you can join compute resources with other people over the Internet and run large language models such as LLaMA, Guanaco, or BLOOM right from your desktop computer or Google Colab. See the official tutorial and the paper showcasing petals for more details.

Visualization of the Tuned Lens approach from Belrose et al. (2023)

Since petals allows for gradient computations to take place on multiple machines and is mostly compatible with the Huggingface Transformers library, it can be used alongsides inseq to attribute large LLMs such as LLaMA 65B or Bloom 175B. This tutorial will show how to load a LLM from petals and use it to attribute a generated sequence.

Attributing LLMs with Petals

First, we need to install petals and inseq with pip install inseq petals. Then, we can load a LLM from petals and attribute it with inseq. For this tutorial, we will use the LLaMA 65B model, which can be loaded as follows:

from petals import AutoDistributedModelForCausalLM

model_name = "enoch/llama-65b-hf"
model = AutoDistributedModelForCausalLM.from_pretrained(model_name).cuda()

We can now test a prompt of interest to see whether the model would provide the correct response:

from transformers import AutoTokenizer

prompt = (
    "Option 1: Take a 50 minute bus, then a half hour train, and finally a 10 minute bike ride.\n"
    "Option 2: Take a 10 minute bus, then an hour train, and finally a 30 minute bike ride.\n"
    "Which of the options above is faster to get to work?\n"
    "Answer: Option "
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

# Only 1 token is generated
outputs = model.generate(inputs, max_new_tokens=1)

#>>> [...] The answer is Option 1

We can see that the model correctly predicts Option 1 to be the shortest option. Now, we can use inseq to attribute the model’s prediction to understand which features played a relevant role in determining the model’s answer. Exploiting the advanced features of the inseq library, we can easily perform a contrastive attribution using contrast_prob_diff_fn() between 1 and 2 as target for gradient attribution (see our tutorial for more details).

out = inseq_model.attribute(
    prompt + "1",
    contrast_targets=prompt + "2",
    step_scores=["contrast_prob_diff", "probability"],

# Attributing with input_x_gradient...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 80/80 [00:37<00:00, 37.55s/it]

Our attribution results are now stored in the out variable, and have exactly the same format as the ones obtained with any other Huggingface decoder-only model. We can now visualize the attribution results using the show() method, specifying the aggregation of our choice. Here we will use the sum of input_x_gradient scores across all 8192 dimensions of model input embeddings, without normalizing them to sum to 1:"sum", normalize=False)

From the results we can observe that the model is generally upweighting minutes tokens, while attribution scores for hour are less clear-cut. We can also observe that the model predicts Option 1 with a probability of ~53% (probability), which is roughly 8% higher than the contrastive option 2 (contrast_prob_diff). In light of this, we could formulate the hypothesis that attributions are not very informative because of the relatively low confidence of the model in its prediction.


While most methods relying on prediction should work normally with petals, methods requiring access to model internals such as attention are not currently supported.