# unlimiformer paper - summary aggregation test ## prompt ```python instruction = """Create a clear informative overview and explanation of the topic using the below paragraphs: In this paper, we present a general approach to extending pre-trained models to unlimited input lengths without adding additional learning weights. We show that our approach works well on datasets longer than the maximum input for these models. For example, a dataset with a maximum input length of 16384 tokens can be extended to a maximum length of 350K tokens. We also demonstrate that our method is able to summarize even 350K token-long input sequences from BookSum. In this paper, we describe the search step reformulation of attention. The search step uses a single storage of hidden states for space efficiency. We construct a total of two sets of datastores where L and H are the keys and values stored in each set of stores. L is the amount of storage required to retrieve the encoded tokens. H is the hidden states per head. This allows retrieval augmentation at both time and space. Instead of using a single set of decoder layers, we use a retrieval augmentation system that allows us to simultaneously store multiple sets of tokens across two different sets of storage. For example, we could store all tokens in one set of storage and retrieve them all in the same set of tokens. This would be very similar to the Memorization Transformers approach. However, instead of storing the tokens in a single memory layer, we store them in a set of multiple storage layers. This way, we don't have to store them all at once. This is why we call this reformulation "attention reformulation" rather than "attention formula." We also call it "retrieval augmentation" because it uses the same number of storage layers as the original transformer attention formula. This means that we can store the tokens across multiple storage systems without having to store every token in a separate storage system. It's not like we're trying to do something new or different. We just want to make sure that everything is working as well as possible. In this paper, we introduce the concept of "unlimiformer," which is a machine learning technique that retrieves key information from a data store in one layer and applies it to a large set of datasets. We use the example of BookSum, where we find that Unlimiform outperforms all other training methods on the same dataset. We also find that using Unlimform in conjunction with a pre-trained model improves both the performance and the robustness of the training method. This paper describes a method that can be used to improve the performance of unsupervised classification tasks. Specifically, it shows that unsupervised classification can be improved by using a combination of sparse and fast random-encoder training. It also shows how this technique can be extended to other tasks, such as sequence generation.""" ``` ## model outputs ### BART0 ``` ('Unlimiformer is a machine-learning technique that retrieves data from one ' 'layer of a data store and applies it to multiple layers of storage.') CPU times: user 33 s, sys: 337 ms, total: 33.3 s Wall time: 17.1 s ``` ### bart0-base params ``` { "bos_token_id": 0, "decoder_start_token_id": 2, "early_stopping": true, "encoder_no_repeat_ngram_size": 5, "eos_token_id": 2, "forced_eos_token_id": 2, "max_new_tokens": 256, "min_new_tokens": 32, "no_repeat_ngram_size": 3, "num_beams": 8, "pad_token_id": 1, "repetition_penalty": 1.3, "transformers_version": "4.29.2" } ``` output ``` ('Unlimiformer (unsupervised classification) is a machine-learning technique ' 'that retrieves data from a data store and applies it to large set of ' 'datasets') CPU times: user 8.93 s, sys: 64.1 ms, total: 8.99 s Wall time: 4.5 s ``` ### bart0-base-pp ```json { "bos_token_id": 0, "decoder_start_token_id": 2, "early_stopping": true, "encoder_no_repeat_ngram_size": 5, "eos_token_id": 2, "forced_eos_token_id": 2, "max_new_tokens": 256, "min_new_tokens": 32, "no_repeat_ngram_size": 3, "num_beams": 8, "pad_token_id": 1, "repetition_penalty": 1.3, "transformers_version": "4.29.2" } ``` output ``` ('Unsupervised classification (unsupervised categorization), applied to a ' 'large number of data sets, and the use of randomencoder training to improve ' 'its performance.') CPU times: user 10.5 s, sys: 7.19 ms, total: 10.5 s Wall time: 5.23 s ``` --- ### pszemraj/bart-base-dolly_hhrlhf-text2text-v3 ```json { "bos_token_id": 0, "decoder_start_token_id": 2, "early_stopping": true, "encoder_no_repeat_ngram_size": 5, "eos_token_id": 2, "forced_bos_token_id": 0, "forced_eos_token_id": 2, "length_penalty": 1.2, "max_new_tokens": 256, "min_new_tokens": 32, "no_repeat_ngram_size": 3, "num_beams": 8, "pad_token_id": 1, "repetition_penalty": 1.3, "transformers_version": "4.29.2" } ``` output ``` ('Here is a brief overview of the topic:\n' 'In this paper we describe a general approach to extend pre-trained model to ' 'unlimited input lengths, without adding any learning weights.\n' 'The search step reformulation is an approach that allows us to store ' 'multiple sets of hidden states across two different storage systems.\n' 'This method is able to retrieve all tokens in a single set of stores without ' 'having to store them all in a separate memory layer.\n' 'In addition, we show that our method can be used to perform unsupervised ' 'classification by using sparse and fast randomencoder training.\n' 'We also show that this technique can be applied to other tasks such as ' 'sequence generation') CPU times: user 27.5 s, sys: 71.9 ms, total: 27.6 s Wall time: 13.8 s ``` ### bart-large-mnli-dolly_hhrlhf-text2text-r2-bs - **using length penalty** ```json { "bos_token_id": 0, "decoder_start_token_id": 2, "encoder_no_repeat_ngram_size": 5, "eos_token_id": 2, "forced_eos_token_id": 2, "length_penalty": 1.1, "max_new_tokens": 256, "min_new_tokens": 32, "no_repeat_ngram_size": 3, "num_beams": 8, "pad_token_id": 1, "repetition_penalty": 1.3, "transformers_version": "4.29.2" } ``` out ``` ('An overview and explanation of this topic using the following paragraphs:\n' '1. In this paper, we show that a general approach to extend pre-trained ' 'models can be extended to unlimited input lengths by adding additional ' 'learning weights without adding additional weights.\n' '2. We demonstrate that our method works well on datasets that are longer ' 'than the maximum inputs for these models.\n' '3. We use the dataset from BookSum as an example.\n' '4. We introduce the concept of unlimiformer, a machine learning technique ' 'which retrieves key data from a data store and applies it to large sets of ' 'datasets.\n' '5. Using Unlimiform in conjunction with a model improves both performance ' 'and robustness of the model.\n' '6. It also shows that this technique can be applied to other tasks such as ' 'sequence generation and unsupervised classification.') CPU times: user 1min 51s, sys: 749 ms, total: 1min 52s Wall time: 56.2 s ``` ### bart-large-mnli-dolly_hhrlhf-text2text-r2-bs - **without length penalty** ```json { "bos_token_id": 0, "decoder_start_token_id": 2, "encoder_no_repeat_ngram_size": 5, "eos_token_id": 2, "forced_eos_token_id": 2, "max_new_tokens": 256, "min_new_tokens": 32, "no_repeat_ngram_size": 3, "num_beams": 8, "pad_token_id": 1, "repetition_penalty": 1.3, "transformers_version": "4.29.2" } ``` out ``` ('An overview and explanation of this topic using the following paragraphs:\n' '1. In this paper, we show that a general approach to extend pre-trained ' 'models can be extended to unlimited input lengths by adding additional ' 'learning weights without adding additional weights.\n' '2. We demonstrate that our method works well on datasets that are longer ' 'than the maximum inputs for these models.\n' '3. We use the dataset from BookSum as an example.\n' '4. We introduce the concept of unlimiformer, a machine learning technique ' 'which retrieves key data from a data store and applies it to large sets of ' 'datasets.\n' '5. Using Unlimiform in conjunction with a model improves the performance and ' 'robustness of the model.') CPU times: user 1min 43s, sys: 586 ms, total: 1min 44s Wall time: 52.6 s ```