Hey Shaan! This is Ayush from India. I am doing MTech in Artificial Intelligence from IISc Bangalore. My MTech project will be focused on increasing the Context length of Large Language Models. I liked your post. I want to go through the code and would like to have a small discussion with you.
There's something I'm missing. If the original model is design to only be able to take 2048 tokens, regardless of the position encoding used it can still only attend to a maximum of 2048 tokens. Thats the size of the transformer input. How do you expand this on an already trained model ?
Hi, thanks for replying. That link is just another position embedding scaling. What i don't get is that the maximum context widow size of a transformer is fixed. It's a hyperparameter.of the transformers used. It's hard coded into the architecture
Regardless of what you do to the position embedding the transformer can only attend to that fixed number of tokens. So, for example of the prompt is larger that the context window the only.part of the prompt can fit into the context window and the rest will be truncated.
Hey Shaan! This is Ayush from India. I am doing MTech in Artificial Intelligence from IISc Bangalore. My MTech project will be focused on increasing the Context length of Large Language Models. I liked your post. I want to go through the code and would like to have a small discussion with you.
I have sent you a request on Linkedin. My email id is singhayush9084@gmail.com / ayushsingh@iisc.ac.in, I will be waiting to hear from you.
There's something I'm missing. If the original model is design to only be able to take 2048 tokens, regardless of the position encoding used it can still only attend to a maximum of 2048 tokens. Thats the size of the transformer input. How do you expand this on an already trained model ?
This is how you can do it in HuggingFace: https://www.linkedin.com/posts/gante_scaling-llama-and-gptneox-to-8k-input-context-activity-7085545793050320896-8OKi
Hi, thanks for replying. That link is just another position embedding scaling. What i don't get is that the maximum context widow size of a transformer is fixed. It's a hyperparameter.of the transformers used. It's hard coded into the architecture
Regardless of what you do to the position embedding the transformer can only attend to that fixed number of tokens. So, for example of the prompt is larger that the context window the only.part of the prompt can fit into the context window and the rest will be truncated.