.. Copyright 2023 The Inseq Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ####################################################################################################################### Attributing Distributed LLMs with Petals ####################################################################################################################### What is Petals? ------------------------------------- `Petals `__ is a framework enabling large language models usage without the need of high-end GPUs, exploiting the potential of distributed training and inference. With Petals, you can join compute resources with other people over the Internet and run large language models such as LLaMA, Guanaco, or BLOOM right from your desktop computer or Google Colab. See the `official tutorial `__ and the `paper `__ showcasing ``petals`` for more details. .. image:: https://camo.githubusercontent.com/58732a64488a9be928e25f3e60e3692b989ffe212ac86cb4902d8df20a042b03/68747470733a2f2f692e696d6775722e636f6d2f525459463379572e706e67 :align: center :width: 800 :alt: Visualization of the Tuned Lens approach from Belrose et al. (2023) Since ``petals`` allows for gradient computations to take place on multiple machines and is mostly compatible with the Huggingface Transformers library, it can be used alongsides ``inseq`` to attribute large LLMs such as LLaMA 65B or Bloom 175B. This tutorial will show how to load a LLM from ``petals`` and use it to attribute a generated sequence. Attributing LLMs with Petals ------------------------------------- First, we need to install ``petals`` and ``inseq`` with ``pip install inseq petals``. Then, we can load a LLM from ``petals`` and attribute it with ``inseq``. For this tutorial, we will use the LLaMA 65B model, which can be loaded as follows: .. code-block:: python from petals import AutoDistributedModelForCausalLM model_name = "enoch/llama-65b-hf" model = AutoDistributedModelForCausalLM.from_pretrained(model_name).cuda() We can now test a prompt of interest to see whether the model would provide the correct response: .. code-block:: python from transformers import AutoTokenizer prompt = ( "Option 1: Take a 50 minute bus, then a half hour train, and finally a 10 minute bike ride.\n" "Option 2: Take a 10 minute bus, then an hour train, and finally a 30 minute bike ride.\n" "Which of the options above is faster to get to work?\n" "Answer: Option " ) tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False) inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda() # Only 1 token is generated outputs = model.generate(inputs, max_new_tokens=1) print(tokenizer.decode(outputs[0])) #>>> [...] The answer is Option 1 We can see that the model correctly predicts Option 1 to be the shortest option. Now, we can use ``inseq`` to attribute the model's prediction to understand which features played a relevant role in determining the model's answer. Exploiting the advanced features of the ``inseq`` library, we can easily perform a contrastive attribution using :func:`~inseq.attr.step_functions.contrast_prob_diff_fn` between 1 and 2 as target for gradient attribution (see our `tutorial `__ for more details). .. code-block:: python out = inseq_model.attribute( prompt, prompt + "1", attributed_fn="contrast_prob_diff", contrast_targets=prompt + "2", step_scores=["contrast_prob_diff", "probability"], ) # Attributing with input_x_gradient...: 100%|██████████| 80/80 [00:37<00:00, 37.55s/it] Our attribution results are now stored in the ``out`` variable, and have exactly the same format as the ones obtained with any other Huggingface decoder-only model. We can now visualize the attribution results using the :meth:`~inseq.FeatureAttributionOutput.show` method, specifying the aggregation of our choice. Here we will use the sum of ``input_x_gradient`` scores across all 8192 dimensions of model input embeddings, without normalizing them to sum to 1: .. code-block:: python out.show(aggregator="sum", normalize=False) .. raw:: html
From the results we can observe that the model is generally upweighting ``minutes`` tokens, while attribution scores for ``hour`` are less clear-cut. We can also observe that the model predicts Option 1 with a probability of ~53% (``probability``), which is roughly 8% higher than the contrastive option 2 (``contrast_prob_diff``). In light of this, we could formulate the hypothesis that attributions are not very informative because of the relatively low confidence of the model in its prediction. .. warning:: While most methods relying on prediction should work normally with ``petals``, methods requiring access to model internals such as ``attention`` are not currently supported.