LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • The tool is structured in two main parts
  • Building a Front End

Was this helpful?

  1. TensorRT-LLM Libraries

trt-llm build command

The trtllm-build command-line tool is designed to provide a convenient way to build TensorRT engines for large language models using the TensorRT-LLM library.

It encapsulates the necessary configuration options and build process into a single command that can be executed from the command line.

The tool is structured in two main parts

The build.py file: This file contains the actual implementation of the trtllm-build command.

It defines the main function, which serves as the entry point for the command-line tool.

The main function parses the command-line arguments, creates the necessary configurations, and invokes the parallel_build function to build the engines.

The setup.py file: This file is responsible for creating the trtllm-build command during the package installation process.

It includes an entry_points parameter that maps the trtllm-build command to the main function in build.py.

When the TensorRT-LLM package is installed using python setup.py install or pip install, the entry_points are processed, and the trtllm-build command is created and installed in the system's executable path or the virtual environment's bin directory.

To use the trtllm-build command-line tool, you need to have the TensorRT-LLM package installed.

Once installed, you can execute the trtllm-build command from the command line, passing the necessary arguments to configure the build process.

The available arguments include specifying the checkpoint directory, model configuration, build configuration, maximum batch size, maximum input length, and various other options.

For example, you can run the command like this:

trtllm-build --checkpoint_dir /path/to/checkpoint --model_config /path/to/model_config.json --max_batch_size 8 --max_input_len 512 --output_dir /path/to/output

This command will build the TensorRT engines using the specified checkpoint directory, model configuration, maximum batch size of 8, maximum input length of 512, and save the generated engines in the specified output directory.

Building a Front End

Since the trtllm-build command is a command-line tool, you can create a graphical user interface (GUI) or a web-based front-end that allows users to input the necessary configuration options and generates the corresponding command-line arguments.

For example, you can create a simple web form that prompts the user to enter the checkpoint directory, model configuration file, and other relevant options.

When the user submits the form, your front-end can generate the appropriate command-line arguments and execute the trtllm-build command behind the scenes.

Here's a simple example using Python and the Flask web framework:

from flask import Flask, render_template, request
import subprocess

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def build_form():
    if request.method == 'POST':
        checkpoint_dir = request.form['checkpoint_dir']
        model_config = request.form['model_config']
        max_batch_size = request.form['max_batch_size']
        max_input_len = request.form['max_input_len']
        output_dir = request.form['output_dir']

        command = f"trtllm-build --checkpoint_dir {checkpoint_dir} --model_config {model_config} --max_batch_size {max_batch_size} --max_input_len {max_input_len} --output_dir {output_dir}"
        subprocess.run(command, shell=True)

        return "Build completed!"

    return render_template('build_form.html')

if __name__ == '__main__':
    app.run()

In this example, the Flask app renders an HTML form (build_form.html) that allows the user to input the necessary configuration options.

When the form is submitted, the app retrieves the user input, generates the corresponding trtllm-build command, and executes it using the subprocess module.

This is just a simple example to illustrate the concept. You can further enhance the front-end by adding more options, validation, error handling, and a more user-friendly interface.

By creating a front-end, you can provide a more intuitive and user-friendly way for users to interact with the trtllm-build command-line tool, abstracting away the complexities of the command-line arguments.

Previoustop_model_mixin.pyNexttrtllm-build CLI configurations

Last updated 1 year ago

Was this helpful?

Page cover image