# NVCC: The NVIDIA CUDA Compiler

NVCC, which stands for <mark style="color:blue;">**NVIDIA CUDA Compiler**</mark>, is a proprietary compiler by NVIDIA that <mark style="color:yellow;">compiles CUDA C/C++ code for execution on CUDA-enabled GPUs</mark> (Graphics Processing Units).&#x20;

NVCC acts as a compiler driver, controlling the compilation flow and linking process, while delegating the actual code generation to other tools like the host compiler and the CUDA backend compiler.

#### <mark style="color:green;">CUDA Programming Model</mark>

CUDA follows a heterogeneous programming model where the <mark style="color:yellow;">host code runs on the CPU</mark> and the <mark style="color:yellow;">device code</mark>, also known as kernels, <mark style="color:yellow;">run on the GPU</mark>.&#x20;

The host code is responsible for memory allocation on the device, data transfer between host and device, and launching kernels on the GPU. Kernels are C++ functions marked with the **global** keyword, indicating that they are callable from the host and execute on the device.

#### <mark style="color:green;">NVCC Workflow</mark>

NVCC processes CUDA source files (typically with a .cu extension) and separates the device code from the host code.

It then compiles the device code using the CUDA backend compiler, which generates a PTX (Parallel Thread Execution) assembly file or a cubin (CUDA binary) object file.&#x20;

The host code is modified to include the necessary CUDA runtime function calls and is then passed to a standard C++ compiler for compilation.

#### <mark style="color:green;">Supported Host Compilers</mark>

NVCC relies on a host compiler for preprocessing, parsing, and code generation of the host code.

It supports various host compilers such as GCC, Clang, and Microsoft Visual C++ (MSVC) on different platforms. The specific host compiler used can be specified using the -ccbin option followed by the path to the compiler executable.

#### <mark style="color:green;">CUDA Compilation Trajectory</mark>

The CUDA compilation trajectory involves several stages:

1. <mark style="color:purple;">Preprocessing:</mark> The CUDA source files are preprocessed to handle includes, macros, and conditional compilation.
2. <mark style="color:purple;">Compilation:</mark>
   * Device code is compiled to PTX assembly or cubin object files.
   * Host code is modified and compiled using the host compiler.
3. <mark style="color:purple;">Linking:</mark>
   * Device object files are linked together using <mark style="color:blue;">**nvlink**</mark>.
   * The resulting device code is embedded into the host object files.
   * Host object files are linked using the host linker to create an executable.

#### <mark style="color:green;">NVCC Compiler Options</mark>

NVCC provides a wide range of compiler options to control the compilation process.&#x20;

Some key options include:

* -gpu-architecture (-arch): Specifies the target GPU architecture (e.g., compute\_80 for NVIDIA Ampere).
* -gpu-code (-code): Specifies the target GPU code (e.g., sm\_80 for NVIDIA Ampere).
* -rdc: Enables relocatable device code, allowing separate compilation and linking of device code.
* -dc: Compiles device code only, without host code compilation.
* -Xcompiler: Passes options directly to the host compiler.
* -Xlinker: Passes options directly to the host linker.

#### <mark style="color:green;">Separate Compilation and Linking</mark>

NVCC supports separate compilation and linking of device code.

This allows device code to be split across multiple files and linked together using <mark style="color:blue;">**nvlink**</mark>. &#x20;

To enable separate compilation, the -rdc option is used to generate relocatable device code.&#x20;

The compiled objects can then be linked using nvlink, and the resulting device code is embedded into the host executable.

<mark style="color:purple;">**Optimisations:**</mark> NVCC provides various optimisation options to improve the performance of CUDA code. Some notable optimisations include:

* -O3: Enables aggressive optimisations.
* -ftz: Flushes denormal values to zero.
* -prec-div: Controls the precision of division operations.
* -use\_fast\_math: Enables fast math optimisations.

#### <mark style="color:green;">Code Generation</mark>

NVCC generates device code in two forms:

<mark style="color:purple;">PTX assembly</mark> and <mark style="color:purple;">cubin object files</mark>.&#x20;

PTX is a low-level virtual machine and instruction set architecture that provides a stable interface for CUDA code across different GPU architectures. &#x20;

PTX code is compiled to binary code by the CUDA runtime during execution, allowing for portability and forward compatibility.&#x20;

Cubin, on the other hand, is a pre-compiled binary format specific to a particular GPU architecture.

#### <mark style="color:green;">Virtual Architectures and Just-in-Time Compilation</mark>

To enable forward compatibility and optimisation for specific GPU architectures, NVCC introduces the concept of <mark style="color:yellow;">virtual architectures.</mark>&#x20;

Virtual architectures (compute\_*) define a set of features and capabilities that are common across a range of physical architectures (sm\_*).&#x20;

NVCC compiles device code to a virtual architecture, which is then compiled to binary code for a specific physical architecture at runtime through Just-in-Time (JIT) compilation.&#x20;

This allows CUDA applications to run on newer GPU architectures without recompilation.

#### <mark style="color:green;">Debugging and Profiling</mark>

NVCC provides options for debugging and profiling CUDA code.&#x20;

The -g option enables debugging symbols, allowing for source-level debugging using tools like cuda-gdb.&#x20;

The -lineinfo option generates line number information for device code, enabling profiling and performance analysis using tools like NVIDIA Visual Profiler.

#### <mark style="color:green;">Conclusion</mark>

NVCC is a powerful compiler that simplifies the process of compiling and linking CUDA C/C++ code for execution on NVIDIA GPUs.&#x20;

It handles the intricate details of separating device code from host code, compiling device code to PTX or cubin, and linking everything together into a final executable.&#x20;

With its wide range of compiler options, optimizations, and support for separate compilation and linking, NVCC provides developers with the tools necessary to write efficient and high-performance CUDA applications.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/cuda-introduction/nvcc-the-nvidia-cuda-compiler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
