Home Artificial Intelligence Deploying Large Language Models: vLLM and Quantization Deployment of A Large Language Model with vLLM Quantization of Large Language Models

Deploying Large Language Models: vLLM and Quantization Deployment of A Large Language Model with vLLM Quantization of Large Language Models

0
Deploying Large Language Models: vLLM and Quantization
Deployment of A Large Language Model with vLLM
Quantization of Large Language Models

Step-by-step guide on tips on how to speed up large language models

Towards Data Science
source

Deployment of Large Language Models (LLMs)

We live in a tremendous time of Large Language Models like ChatGPT, GPT-4, and Claude that may perform multiple amazing tasks. In practically every field, starting from education, healthcare to arts and business, Large Language Models are getting used to facilitate efficiency in delivering services. Over the past 12 months, many sensible open-source Large Language Models, similar to Llama, Mistral, Falcon, and Gemma, have been released. These open-source LLMs can be found for everybody to make use of, but deploying them will be very difficult as they will be very slow and require plenty of GPU compute power to run for real-time deployment. Different tools and approaches have been created to simplify the deployment of Large Language Models.

Many deployment tools have been created for serving LLMs with faster inference, similar to vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs for loading very large Language Models. In this text, I’ll explain tips on how to deploy Large Language Models with vLLM and quantization.

Latency and Throughput

A few of the foremost aspects that affect the speed performance of a Large Language Model are GPU hardware requirements and model size. The larger the scale of the model, the more GPU compute power is required to run it. Common benchmark metrics utilized in measuring the speed performance of a Large Language Model are Latency and Throughput.

Latency: That is the time required for a Large Language Model to generate a response. It is normally measured in seconds or milliseconds.

Throughput: That is the variety of tokens generated per second or millisecond from a Large Language Model.

Install Required Packages

Below are the 2 required packages for running a Large Language Model: Hugging Face transformers and speed up.

pip3 install transformers
pip3 install speed up

What’s Phi-2?

Phi-2 is a state-of-the-art foundation model from Microsoft with 2.7 billion parameters. It was pre-trained with quite a lot of data sources, starting from code to textbooks. Learn more about Phi-2 from here.

Benchmarking LLM Latency and Throughput with Hugging Face Transformers

Generated Output

Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a listing of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return total

print(sum_list([1, 2, 3, 4, 5]))

Step By Step Code Breakdown

Line 6–10: Loaded Phi-2 model and tokenized the prompt “Generate a python code that accepts a listing of numbers and returns the sum.

Line 12- 18: Generated a response from the model and obtained the latency by calculating the time required to generate the response.

Line 21–23: Obtained the entire length of tokens within the response generated, divided it by the latency and calculated the throughput.

This model was run on an A1000 (16GB GPU), and it achieves a latency of 2.7 seconds and a throughput of 32 tokens/second.

vLLM is an open source LLM library for serving Large Language Models at low latency and high throughput.

How vLLM works

The transformer is the constructing block of Large Language Models. The transformer network uses a mechanism called the attention mechanism, which is utilized by the network to review and understand the context of words. The attention mechanism is made up of a bunch of mathematical calculations of matrices generally known as attention keys and values. The memory utilized by the interaction of those attention keys and values affects the speed of the model. vLLM introduced a brand new attention mechanism called PagedAttention that efficiently manages the allocation of memory for the transformer’s attention keys and values through the generation of tokens. The memory efficiency of vLLM has proven very useful in running Large Language Models at low latency and high throughput.

It is a high-level explanation of how vLLM works. To learn more in-depth technical details, visit the vLLM documentation.

Install vLLM

pip3 install vllm==0.3.3

Run Phi-2 with vLLM

Generated Output

Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
[1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return total

numbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))

Step By Step Code Breakdown

Line 1–3: Imported required packages from vLLM for running Phi-2.

Line 5–8: Loaded Phi-2 with vLLM, defined the prompt and set vital parameters for running the model.

Line 10–16: Generated the model’s response using llm.generate and computed the latency.

Line 19–21: Obtained the length of total tokens generated from the response, divided the length of tokens by the latency to get the throughput.

Line 23–24: Obtained the generated text.

I ran Phi-2 with vLLM on the identical prompt, “Generate a python code that accepts a listing of numbers and returns the sum.” On the identical GPU, an A1000 (16GB GPU), vLLM produces a latency of 1.2 seconds and a throughput of 63 tokens/second, in comparison with Hugging Face transformers’ latency of 2.85 seconds and a throughput of 32 tokens/second. Running a Large Language Model with vLLM produces the identical accurate result as using Hugging Face, with much lower latency and better throughput.

Note: The metrics (latency and throughput) I obtained for vLLM are estimated benchmarks for vLLM performance. The model generation speed relies on many aspects, similar to the length of the input prompt and the scale of the GPU. In response to the official vLLM report, running an LLM model on a strong GPU just like the A100 in a production setting with vLLM achieves 24x higher throughput than Hugging Face Transformers.

Benchmarking Latency and Throughput in Real Time

The best way I calculated the latency and throughput for running Phi-2 is experimental, and I did this to elucidate how vLLM accelerates a Large Language Model’s performance. Within the real-world use case of LLMs, similar to a chat-based system where the model outputs a token because it is generated, measuring the latency and throughput is more complex.

A chat-based system is predicated on streaming output tokens. A few of the foremost aspects that affect the LLM metrics are Time to First Token (the time required for a model to generate the primary token), Time Per Output Token (the time spent per output token generated), the input sequence length, the expected output, the entire expected output tokens, and the model size. In a chat-based system, the latency is normally a mixture of Time to First Token and Time Per Output Token multiplied by the entire expected output tokens.

The longer the input sequence length passed right into a model, the slower the response. Among the approaches utilized in running LLMs in real-time involve batching users’ input requests or prompts to perform inference on the requests concurrently, which helps in improving the throughput. Generally, using a strong GPU and serving LLMs with efficient tools like vLLM improves each the latency and throughput in real-time.

Run the vLLM deployment on Google Colab

Quantization is the conversion of a machine learning model from a better precision to a lower precision by shrinking the model’s weights into smaller bits, normally 8-bit or 4-bit. Deployment tools like vLLM are very useful for inference serving of Large Language Models at very low latency and high throughput. We’re capable of run Phi-2 with Hugging Face and vLLM conveniently on the T4 GPU on Google Colab since it is a smaller LLM with 2.7 billion parameters. For instance, a 7-billion-parameter model like Mistral 7B can’t be run on Colab with either Hugging Face or vLLM. Quantization is best for managing GPU hardware requirements for Large Language Models. When GPU availability is proscribed and we want to run a really large Language Model, quantization is one of the best approach to load LLMs on constrained devices.

BitsandBytes

It’s a python library built with custom quantization functions for shrinking model’s weights into lower bits(8-bit and 4-bit).

Install BitsandBytes

pip3 install bitsandbytes

Quantization of Mistral 7B Model

Mistral 7B, a 7-billion-parameter model from MistralAI, is among the finest state-of-the-art open-source Large Language Models. I’ll undergo a step-by-step technique of running Mistral 7B with different quantization techniques that will be run on the T4 GPU on Google Colab.

Quantization with 8bit Precision: That is the conversion of a machine learning model’s weight into 8-bit precision. BitsandBytes has been integrated with Hugging Face transformers to load a language model using the identical Hugging Face code, but with minor modifications for quantization.

Line 1: Imported the needed packages for running model, including the BitsandBytesConfig library.

Line 3–4: Defined the quantization config and set the parameter load_in_8bit to true for loading the model’s weights in 8-bit precision.

Line 7–9: Passed the quantization config into the function for loading the model, set the parameter device_map for bitsandbytes to routinely allocate appropriate GPU memory for loading the model. Finally loaded the tokenizer weights.

Quantization with 4bit Precision: That is the conversion of a machine learning model’s weight into 4-bit precision.

The code for loading Mistral 7B in 4-bit precision is comparable to that of 8-bit precision apart from just a few changes:

  • modified load_in_8bit to load_in_4bit.
  • A brand new parameter bnb_4bit_compute_dtype is introduced into the BitsandBytesConfig to perform the model’s computation in bfloat16. bfloat16 is computation data type for loading model’s weights for faster inference. It could possibly be used with each 4-bit and 8-bit precisions. Whether it is in 8-bit you simply need to alter the parameter from bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.

NF4(4-bit Normal Float) and Double Quantization

NF4 (4-bit Normal Float) from QLoRA is an optimal quantization approach that yields higher results than the usual 4-bit quantization. It’s integrated with double quantization, where quantization occurs twice; quantized weights from the primary stage of quantization are passed into the following stage of quantization, yielding optimal float range values for the model’s weights. In response to the report from the QLoRA paper, NF4 with double quantization doesn’t suffer from a drop in accuracy performance. Read more in-depth technical details about NF4 and Double Quantization from the QLoRA paper:

Line 4–9: Extra parameters were set the BitsandBytesConfig:

  • load_4bit: loading model in 4-bit precision is ready to true.
  • bnb_4bit_quant_type: The sort of quantization is ready to nf4.
  • bnb_4bit_use_double_quant: Double quantization is ready to True.
  • bnb_4_bit_compute_dtype: bfloat16 computation data type is used for faster inference.

Line 11–13: Loaded the model’s weights and tokenizer.

Full Code for Model Quantization

Generated Output

 [INST] What's Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its principal objective is to read, decipher,
understand, and make sense of the human language in a precious way. It could possibly be used for various tasks similar to speech recognition,
text-to-speech synthesis, sentiment evaluation, machine translation, part-of-speech tagging, name entity recognition,
summarization, and question-answering systems. NLP technology allows machines to acknowledge, understand,
and reply to human language in a more natural and intuitive way, making interactions more accessible and efficient.

Quantization is a superb approach for optimizing the running of very Large Language Models on smaller GPUs and will be applied to any model, similar to Llama 70B, Falcon 40B, and mpt-30b. In response to reports from the LLM.int8 paper, very Large Language Models suffer less from accuracy drops when quantized in comparison with smaller ones. Quantization is best applied to very Large Language Models and doesn’t work well for smaller models due to loss in accuracy performance.

Run Mixtral 7B Quantization on Google Colab

Conclusion

In this text, I provided a step-by-step approach to measuring the speed performance of a Large Language Model, explained how vLLM works, and the way it could possibly be used to enhance the latency and throughput of a Large Language Model. Finally, I explained quantization and the way it’s used to load Large Language Models on small-scale GPUs.

Reach to me via:

Email: olafenwaayoola@gmail.com

Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References

LEAVE A REPLY

Please enter your comment!
Please enter your name here