run.py for inference

The run.py file is a script for running inference with a pre-built TensorRT engine.

It takes various command-line arguments to configure the inference process and generates output based on the provided input.

Let's analyze the key arguments parsed to the script and their significance:

Key arguments:

  1. --max_output_len: The maximum length of the generated output sequence.

  2. --max_attention_window_size: The attention window size that controls the sliding window attention or cyclic KV cache behavior.

  3. --sink_token_length: The length of the sink token.

  4. --log_level: The logging level for the script.

  5. --engine_dir: The directory containing the pre-built TensorRT engine.

  6. --use_py_session: Whether to use the Python runtime session or the C++ session.

  7. --input_text: The input text to be used for generation.

  8. --input_file: An alternative to --input_text, allowing input to be read from a CSV or Numpy file.

  9. --max_input_length: The maximum length of the input sequence.

  10. --output_csv, --output_npy: Files to store the tokenized output in CSV or Numpy format.

  11. --output_logits_npy: File to store the generation logits in Numpy format (only when num_beams==1).

  12. --output_log_probs_npy, --output_cum_log_probs_npy: Files to store the log probabilities and cumulative log probabilities in Numpy format.

  13. --tokenizer_dir, --tokenizer_type, --vocab_file: Configuration for the tokenizer.

  14. --num_beams: The number of beams to use for beam search (use num_beams > 1 for beam search).

  15. --temperature, --top_k, --top_p, --length_penalty, --repetition_penalty, --presence_penalty, --frequency_penalty: Parameters for controlling the generation process.

  16. --early_stopping: Whether to use early stopping during beam search.

  17. --debug_mode: Whether to turn on debug mode.

  18. --streaming, --streaming_interval: Configuration for streaming mode.

  19. --prompt_table_path, --prompt_tasks: Configuration for prompt tuning.

  20. --lora_dir, --lora_task_uids, --lora_ckpt_source: Configuration for LoRA (Low-Rank Adaptation).

  21. --num_prepend_vtokens: The number of default virtual tokens to prepend to each sentence.

  22. --run_profiling: Whether to run profiling iterations to measure inference latencies.

  23. --medusa_choices: Configuration for Medusa decoding.

Relationship to the LLaMA engine:

  • The run.py script is designed to perform inference using a pre-built TensorRT engine, such as the rank0.engine file generated from the LLaMA model.

  • The --engine_dir argument specifies the directory containing the TensorRT engine file, which is loaded by the script for inference.

  • The config.json file contains the configuration details of the LLaMA model used to build the TensorRT engine. It includes information such as the model architecture, data types, vocabulary size, and other hyperparameters.

  • The run.py script uses the configuration from config.json to set up the appropriate tokenizer, input parsing, and output processing based on the LLaMA model's specifications.

  • The script also supports additional features like LoRA, prompt tuning, and Medusa decoding, which can be configured through the respective command-line arguments and are specific to the LLaMA model.

In summary, the run.py script is a generic inference script that can be used with any pre-built TensorRT engine, including the LLaMA engine. It takes the necessary configuration from the config.json file and the rank0.engine file to perform inference and generate output based on the provided input and specified arguments.

Last updated