LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

Compiling LLama Models

Here is a list of arguments

  1. --model_dir: Specifies the directory where the pre-trained model is stored.

  2. --meta_ckpt_dir: Specifies the directory where the meta checkpoint is stored.

  3. --tp_size: Sets the N-way tensor parallelism size.

  4. --pp_size: Sets the N-way pipeline parallelism size.

  5. --dtype: Determines the data type for the model weights, with options including float32, bfloat16, and float16.

  6. --vocab_size: Specifies the vocabulary size of the model.

  7. --n_positions: Sets the number of positions for embeddings.

  8. --n_layer: Specifies the number of layers in the model.

  9. --n_head: Sets the number of attention heads in the model.

  10. --n_kv_head: Specifies the number of key-value heads, if different from n_head.

  11. --n_embd: Sets the dimensionality of embeddings.

  12. --inter_size: Specifies the size of the intermediate layer in the transformer.

  13. --rms_norm_eps: Sets the epsilon value for RMS normalization.

  14. --use_weight_only: Enables quantization of weights only, without affecting activations.

  15. --disable_weight_only_quant_plugin: Disables the plugin implementation for weight-only quantization, using the out-of-the-box implementation instead.

  16. --weight_only_precision: Defines the precision for weight-only quantization, with options including int8, int4, int4_awq, and int4_gptq.

  17. --smoothquant: Activates SmoothQuant quantization with a specified alpha parameter.

  18. --per_channel: Uses a different static scaling factor for each channel in GEMM's result for quantization.

  19. --per_token: Chooses a custom scaling factor for each token at runtime during quantization.

  20. --int8_kv_cache: Enables INT8 quantization for the key-value cache.

  21. --ammo_quant_ckpt_path: Path to a quantized model checkpoint in .npz format.

  22. --per_group: Chooses a custom scaling factor for each group at runtime, specifically for GPTQ/AWQ quantization.

  23. --quantize_lm_head: Quantizes the language model head weights as well when using INT4_AWQ.

  24. --enable_fp8: Uses FP8 Linear layer for Attention QKV/Dense and MLP.

  25. --fp8_kv_cache: Chooses FP8 quantization for the key-value cache.

  26. --load_by_shard: Enables loading a pre-trained model shard-by-shard.

  27. --hidden_act: Specifies the hidden activation function.

  28. --rotary_base: Sets the base value for rotary embeddings.

  29. --rotary_scaling: Specifies the type and factor for rotary scaling.

  30. --group_size: Sets the group size used in GPTQ/AWQ quantization.

  31. --storage-type: Specifies the storage type, with options including fp32 and fp16.

  32. --dataset-cache-dir: Sets the cache directory to load the Hugging Face dataset.

  33. --load-model-on-cpu: Forces the model to load on the CPU.

  34. --convert-model-on-cpu: Forces the model conversion to occur on the CPU.

  35. --use_parallel_embedding: Enables embedding parallelism.

  36. --embedding_sharding_dim: Specifies the dimension along which to shard the embedding lookup table.

  37. --use_embedding_sharing: Attempts to reduce the engine size by sharing the embedding lookup table between layers.

  38. --use_prompt_tuning: Enables prompt tuning.

  39. --output_dir: Specifies the directory to save the converted model checkpoint.

  40. --workers: Sets the number of workers for converting the checkpoint in parallel.

  41. --moe_num_experts: Specifies the number of experts for MOE layers.

  42. --moe_top_k: Sets the top_k value for MOE layers.

  43. --moe_tp_mode: Determines how to distribute experts in tensor parallelism.

  44. --moe_renorm_mode: Controls renormalization after gate logits for MOE.

  45. --use_fused_mlp: Enables horizontal fusion in GatedMLP.

  46. --enable_pos_shift: Enables position shift for the streaming LLM method.

  47. --dense_context_fmha: Enables dense FMHA in context phase, as opposed to sliding window attention.

  48. --hf_lora_dir: Specifies the directory for a LoRA model.

Previoussummarize.py script in Llama folderNextTasks

Last updated 1 year ago

Was this helpful?