Batch Manager
The Batch Manager in TensorRT-LLM can be described as a component that efficiently handles and processes multiple requests simultaneously to maximise GPU utilisation.
It allows new requests to be added and completed requests to be returned dynamically during the processing loop, reducing waiting times and improving overall performance.
The Batch Manager enables in-flight batching of requests, also known as continuous batching or iteration-level batching.
The goal is to reduce wait times in queues, eliminate the need for padding requests, and achieve higher GPU utilisation.
The Batch Manager exposes hooks for the user to define how TensorRT-LLM reads in new requests and returns completed requests.
In-flight batching refers to the ability to process newly arrived requests and return completed requests at each iteration of the token generation loop. This technique is essential for reducing wait times and maximizing GPU utilisation.
A basic explanation of the process
Dynamic Request Incorporation
The Batch Manager allows for the inclusion of newly arrived requests and the return of newly completed requests at each iteration of the token generation loop.
This dynamic incorporation of requests enables the Batch Manager to efficiently utilise the available computational resources and minimize idle time.
Callback-based API
The Batch Manager exposes a callback-based API that allows clients to interact with it using two mandatory callbacks:
GetInferenceRequestsCallback
andSendResponseCallback
.GetInferenceRequestsCallback
is used by the client to pass new requests to the Batch Manager, whileSendResponseCallback
is used by the Batch Manager to deliver completed responses back to the client.Optional callbacks, such as
PollStopSignalCallback
for interrupting requests andReturnBatchManagerStatsCallback
for reporting execution statistics, provide additional flexibility and control.
Request Identification and Tracking
Each request passed to the Batch Manager is associated with a unique 64-bit request ID.
The request ID is used to identify and track requests throughout their lifecycle, from submission to completion.
The Batch Manager ensures that request IDs are not reused for active requests, preventing conflicts and ensuring proper request management.
Flexible Batching Schemes
The Batch Manager supports different batching schemes, including V1, InflightBatching, and InflightFusedBatching.
V1 refers to the traditional batching scheme where requests in a batch run in lockstep until the full generation for all of them is complete, with padding applied to ensure equal sequence lengths.
InflightBatching incorporates newly arrived requests dynamically into the batch under execution and returns requests as soon as their end conditions are met, without any padding.
InflightFusedBatching improves upon InflightBatching by leveraging additional operation fusion opportunities for superior performance.
Scheduling Policies
The Batch Manager provides different scheduling policies to control how requests are selected for execution in each iteration of the generation loop.
The
MAX_UTILIZATION
policy aims to maximise GPU throughput by packing as many requests as the underlying TRT engine can support in each iteration, potentially requiring requests to be paused and restarted based on KV cache memory availability.The
GUARANTEED_NO_EVICT
policy uses KV cache more conservatively, ensuring that once a request is started, it will run to completion without eviction.
KV Cache Management
The Batch Manager includes configuration options for managing the key-value (KV) cache, which is crucial for efficient memory utilization during inference.
The
kvCacheConfig
parameter allows specifying the maximum number of tokens reserved for KV cache across all requests (maxTokens
) and the maximum fraction of GPU memory that can be used for KV cache (freeGpuMemoryFraction
).Additional options, such as
enableBlockReuse
for reusing previously computed KV cache blocks across requests, provide further optimisation opportunities.
Response Content
The responses returned by the Batch Manager through the
SendResponseCallback
include various tensors related to the request, such as output IDs, sequence lengths, context logits, generation logits, log probabilities, and cumulative log probabilities.The shape and availability of these tensors depend on the configuration of the TensorRT-LLM engine, such as the presence of
gather_context_logits
,gather_generation_logits
, orgather_all_token_logits
flags during engine build.
Multi-GPU Execution
The Batch Manager supports multi-GPU execution using either tensor or pipeline parallelism.
When running on multiple GPUs, each GPU rank runs its own instance of the Batch Manager, and care must be taken to ensure that all ranks see the same inputs at each iteration of the generation loop.
Techniques like MPI broadcast can be used in the
GetInferenceRequestsCallback
to synchronise the set of requests across all ranks.
Integration with Triton Inference Server
TensorRT-LLM provides a Triton Inference Server C++ backend that includes the necessary mechanisms to serve models using in-flight batching with the Batch Manager.
The Triton backend serves as a good starting example of how to implement in-flight batching using the TensorRT-LLM Batch Manager in a production environment.
Flexibility and Customization
The Batch Manager offers some configuration options to optimise performance based on specific requirements.
These options can include the maximum batch size, scheduling policy, and memory management.
Integration
The Batch Manager can be easily integrated into existing systems or frameworks.
It provides a straightforward API for submitting requests and retrieving results.
Batch Manager API
The Batch Manager API allows a software component (referred to as the client) to interact with the Batch Manager.
The API includes mandatory and optional callbacks that serve various functions. These callbacks are invoked at regular intervals during the generation loop.
Get and Send Callbacks
Two essential callbacks are the GetInferenceRequestsCallback and SendResponseCallback.
The GetInferenceRequestsCallback is used to pass new requests to the Batch Manager, and it returns a list of requests to be processed
The SendResponseCallback is responsible for delivering responses to the client, including output tensors and error messages if applicable. The last response for a request is marked with a boolean flag.
Responses from SendResponseCallback are stored in a list of shared pointers to tensor objects, containing output IDs, sequence length, context logits, generation logits, log probabilities, and cumulative log probabilities.
Summary:
GetInferenceRequestsCallback
: Retrieves a list of new requests to be processed.
SendResponseCallback
: Delivers responses to the client.
Request Interruption
The Batch Manager allows users to stop the execution of requests that are currently in-flight.
The set of request IDs to be stopped can be provided to the Batch Manager through the PollStopSignalCallback callback. This feature provides control over request execution.
Statistics Reporting
The Batch Manager can report execution statistics when provided with the ReturnBatchManagerStatsCallback callback.
The statistics are packaged as a JSON string and include information such as timestamps, iteration counters, and the number of active requests. This feature helps in monitoring and optimising the batch processing.
The Batch Manager in TensorRT-LLM is a sophisticated component designed to optimise the processing of inference requests, particularly useful for high-throughput scenarios.
Scheduler Policy
The scheduler policy (MAX_UTILIZATION or GUARANTEED_NO_EVICT) dictates how requests are scheduled, balancing between maximizing GPU utilisation, and ensuring sufficient memory allocation.
Multi-GPU Execution
When running on multiple GPUs using tensor or pipeline parallelism, each GPU rank runs its own instance of GptManager.
The number of visible GPUs can be controlled using the CUDA_VISIBLE_DEVICES environment variable.
All ranks must see the same inputs at each iteration of the generation loop, which can be ensured using an MPI broadcast in GetInferenceRequestsCallback.
Summary
In summary, the Batch Manager in TensorRT-LLM enables efficient in-flight batching of requests, allowing for dynamic inclusion and completion of requests during the token generation loop.
It provides a flexible API with callbacks for request handling, response delivery, and statistics reporting.
The Batch Manager integrates seamlessly with the Triton Inference Server and supports multi-GPU execution using tensor or pipeline parallelism.
Last updated