The CUDA development environmentrelies on tight integration with the host development environment, including the host compiler and C runtime libraries, and is therefore only supported on Ubuntu versions that have been qualified for the CUDA Toolkit release.
The material below provides instructions on how to ensure the NVIDIA Drivers are going to be compatible with the host system.
Compatibility between CUDA 12.3 and the host development environment
This table lists the kernel versions, default GCC (GNU Compiler Collection)versions, andGLIBC (GNU C Library) versions for two different LTS (Long-Term Support) releases of Ubuntu.
Distribution
Kernel
Default GCC
GLIBC
Ubuntu 22.04 LTS
5.15-43
11.2
2.35
Check the Kernel compatibility
To check the kernel versionof your Ubuntu 22.04 system, you can use the uname command in the terminal.
The uname command with different options provides various system information, including the kernel version. Here's how you can do it:
Run the uname command to get the kernel version by typing the following command and press Enter:
uname-r
Output on a typical Ubuntu 22.04 virtual machine
5.15.0-105-generic
As you can see here the first Linux kernel is5.15 - which is compatible with the CUDA Toolkit installed (range of 5.13.0 to 5.13.46).
What is a kernel?
A kernel is the core component of an operating system (OS)
It acts as abridge between applications and the actual data processing done at the hardware level.
The kernel's responsibilities include managing the system's resources and allowing multiple programs to run and use these resources efficiently. Here are some key aspects of a kernel:
Resource Management
The kernel manages hardware resources like the CPU, memory, and disk space. It allocates resources to various processes, ensuring that each process receives enough resources to function effectively while maintaining overall system efficiency.
Process Management
It handles the creation, scheduling, and termination of processes. The kernel decides which processes should run when and for how long, a process known as scheduling. This is critical in multi-tasking environments where multiple processes require CPU attention.
Memory Management
The kernel controls how memory is allocated to various processes and manages memory access, ensuring that each process has access to the memory it needs without interfering with other processes. It also manages virtual memory, allowing the system to use disk space as an extension of RAM.
Device Management
It acts as an intermediary between the hardware and software of a computer. For instance, when a program needs to read a file from a disk, it requests this service from the kernel, which then communicates with the disk drive’s hardware to read the data.
Security and Access Control
The kernel enforces access control policies, preventing unauthorised access to the system and its resources. It manages user permissions and ensures that processes have the required privileges to execute their tasks.
System Calls
These are the mechanisms through which user-space applications interact with the kernel. For example, when an application needs to open a file, it makes a system call, which is handled by the kernel.
Types of Kernels
Monolithic Kernels: These kernels include various services like the filesystem, device drivers, network interfaces, etc., within one large kernel. Example: Linux.
Microkernels: These kernels focus on minimal functionality, providing only basic services like process and memory management. Other components like device drivers are run in user space. Example: Minix.
Hybrid Kernels: These are a mix of monolithic and microkernel architectures. Example: Windows NT kernel.
Examples of Kernels
Linux Kernel: Used in Linux distributions.
Windows NT Kernel: Used in various versions of Microsoft Windows.
XNU Kernel: Used in macOS and iOS.
Check GNU Compiler Compatibility
NVIDIA CUDA Libraries work in conjunction with GCC (GNU Compiler Collection) on Linux systems.
GCC is commonly used for compiling the host (CPU) part of the code, while CUDA tools like nvcc (NVIDIA CUDA Compiler) are used for compiling the device (GPU)part of the code.
This version 11.4 should work with CUDA 12.3, which requires at least 11.2.
GCC is considered 'backward compatible', so this version of 11.4 should be fine.
What is GCC and why is the version important?
GCC is a collection of compilers for various programming languages.
Although it started primarily for C (hence the original name GNU C Compiler), it now supports C++, Objective-C, Fortran, Ada, Go, and D.
Cross-Platform Compatibility
GCC can be used on many different types of operating systems and hardware architectures. This cross-platform capability makes it a versatile tool for developers who work in diverse environments.
Optimization and Portability
GCC offers a wide range of options for code optimization, making it possible to tune performance for specific hardware or application requirements. It also emphasizes portability, enabling developers to compile their code on one machine and run it on another without modification.
Standard Compliance
GCC strives to adhere closely to various programming language standards, including those for C and C++. This compliance ensures that code written and compiled with GCC is compatible with other compilers following the same standards.
Debugging and Error Reporting
GCC is known for its helpful debugging features and detailed error reporting, which are invaluable for developers in identifying and fixing code issues.
Integration with Development Tools
GCC easily integrates with various development tools and environments. It's commonly used in combination with IDEs, debuggers, and other tools, forming a complete development ecosystem.
Check GLIBC Compatibility
The GNU C Library, commonly known as glibc, is an important component of GNU systems and Linux distributions.
GLIBC is the GNU Project's implementation of the C standard library. It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
To check the GLIBC version:
ldd--version
The first line of the output will show the version number. For example:
ldd (Ubuntu GLIBC2.35-0ubuntu3.7) 2.35
Compare this with the GLIBC version in your table.
The GLIBC version of 2.35 is the same as the version required for the NVIDIA CUDA Toolkit
What is GLIBC?
GLIBC is the GNU Project's implementation of the C standard library.
It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
Compatibility: It's designed to be compatible with the POSIX standard, the Single UNIX Specification, and several other open standards, while also extending them in various ways.
System Calls and Kernel: glibc serves as a wrapper for system calls to the Linux kernel and other essential functions. This means that most applications on a Linux system depend on glibc to interact with the underlying kernel.
Portability: It's used in systems that range from embedded systems to servers and supercomputers, providing a consistent and reliable base across various hardware architectures.
Checking GLIBC Version
To check the version of glibc on a Linux system, you can use the ldd command, which prints the shared library dependencies. The version of glibc will be displayed as part of this output. Here's how to do it:
Run the Command: Type the following command and press Enter:
ldd--version
The first line of the output will typically show the glibc version. For example, it might say ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31, where "2.31" is the version of glibc.
Importance in Development
When developing software for Linux, it's crucial to know the version of glibc your application will be running against, as different versions may have different features and behaviours.
For applications intended to run on multiple Linux distributions, understanding glibc compatibility is key to ensuring broad compatibility.
With the NVIDIA CUDA Toolkit's compatibility with host installations, the next step is to do a check for compatibility
Process for checking installations have been successful
First, check your Ubuntu version. Ensure it matches Ubuntu 22.04, which is our designated Linux operating system
lsb_release-a
Then, verify that your system is based on the x86_64 architecture.Run:
uname-m
The output should be:
x86_64
To check if your system has a CUDA-capable NVIDIA GPU, run
nvidia-smi
You should see an output like this, which details the NVIDIA Drivers installed and the CUDA Version.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 On | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 56W / 400W| 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1314 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
A full analysis
To do this all at once...
If you would like a full printout of your system features, enter this command into the terminal:
The output from the terminal will provide you all the information necessary to check system information for compatibility.
Typical analysis of the output from a A-100 80GB instance
Machine Architecture: x86_64
Your system uses the 64-bit version of the x86 architecture. This is a standard architecture for modern desktops and servers, supporting more memory and larger data sizes compared to 32-bit systems.
Kernel Details
Kernel Name: Linux, indicating that your operating system is based on the Linux kernel.
Kernel Release: 5.4.0-167-generic. This specifies the version of the Linux kernel you are running. 'Generic' here implies a standard kernel version that is versatile for various hardware setups.
Kernel Version: #184-Ubuntu SMP. This shows a specific build of the kernel, compiled with Symmetric Multi-Processing (SMP) support, allowing efficient use of multi-core processors. The timestamp shows the build date.
Hostname: ps1rgbvhl
This is the network identifier for your machine, used to distinguish it in a network environment.
Operating System: GNU/Linux
This indicates that you're using a GNU/Linux distribution, a combination of the Linux kernel with GNU software.
Detailed Kernel Version
This reiterates your kernel version and build details. It also mentions the GCC version used for building the kernel (9.4.0), which affects compatibility with certain software.
CPU Information: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
The system is powered by an Intel Xeon Gold 6342 processor, which is a high-performance, server-grade CPU. The 2.80 GHz frequency indicates its base clock speed.
Memory Information: MemTotal: 92679772 kB
The system has a substantial amount of RAM (approximately 92.68 GB). This is a significant size, suitable for memory-intensive applications and multitasking.
Ubuntu Distribution Information
Distributor ID: Ubuntu. This shows the Linux distribution you're using.
Description: Ubuntu 20.04.6 LTS, indicating the specific version and that it's a Long-Term Support (LTS) release.
Release: 20.04, the version number.
Codename: focal, the internal codename for this Ubuntu release.
NVCC Version
The output details the version of the NVIDIA CUDA Compiler (NVCC) as 12.1, built in February 2023. NVCC is a key component for compiling CUDA code, essential for developing applications that leverage NVIDIA GPUs for parallel processing tasks.
In summary, the output paints a picture of a powerful, 64-bit Linux system with a high-performance CPU and a significant amount of RAM, running an LTS version of Ubuntu.
The presence of the NVCC with CUDA version 12.1 indicates readiness for CUDA-based development, particularly in fields like data science, machine learning, or any computationally intensive tasks that can benefit from GPU acceleration.
Installation of .NET SDK - required for Polyglot Notebooks
Installation of .NET
.NET is a free, open-source, and cross-platform framework developed by Microsoft.
It is used for building various types of applications, including web applications, desktop applications, cloud-based services, and more. .NET provides a rich set of libraries and tools for developers to create robust and scalable software solutions.
Add the Microsoft package repository
Installing with APT can be done with a few commands. Before you install .NET, run the following commands to add the Microsoft package signing key to your list of trusted keys and add the package repository.
The .NET SDK allows you to develop apps with .NET. If you install the .NET SDK, you don't need to install the corresponding runtime. To install the .NET SDK, run the following commands:
The ASP.NET Core Runtime allows you to run apps that were made with .NET that didn't provide the runtime. The following commands install the ASP.NET Core Runtime, which is the most compatible runtime for .NET. In your terminal, run the following commands:
As an alternative to the ASP.NET Core Runtime, you can install the .NET Runtime, which doesn't include ASP.NET Core support: replace aspnetcore-runtime-8.0 in the previous command with dotnet-runtime-8.0:
sudoapt-getinstall-ydotnet-runtime-8.0
If you want to change the version of CUDA being used in your environment
The Conda installation for CUDA is an efficient way to install and manage the CUDA Toolkit, especially when working with Python environments.
Conda Overview
Conda can facilitate the installation of the CUDA Toolkit.
Installing CUDA Using Conda
Basic installation command: conda install cuda -c nvidia.
This command installs all components of the CUDA Toolkit.
Uninstalling CUDA Using Conda
Uninstallation command: conda remove cuda.
It removes the CUDA Toolkit installed via Conda.
Special Tip: After uninstallation, check for any residual files or dependencies that might need manual removal.
Installing Previous CUDA Releases
Install specific versions using: conda install cuda -c nvidia/label/cuda-<version>.
Replace <version> with the desired CUDA version (e.g., 11.3.0).
Special Tip: Installing previous versions can be crucial for compatibility with certain applications or libraries. Always check version compatibility with your project requirements.
Practical Example: Installing CUDA Toolkit:
Create a virtual environment
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc: Installs the NVIDIA CUDA Compiler (nvcc) from the specified NVIDIA channel on Conda. This is aligned with the CUDA version 11.8.0, ensuring compatibility with the specific version of PyTorch being used.
conda install -c anaconda cmake: Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
conda install -c anaconda cmake: Installs 'lit', a tool for executing LLVM's integrated test suites.
Additional Tools Installation (Optional)
conda install -c anaconda cmake: Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
conda install -c conda-forge lit: Installs 'lit', a tool for executing LLVM's integrated test suites.
Installing PyTorch and Related Libraries
The pip installcommand is used to install specific versions of PyTorch (torch), along with its sister libraries torchvision and torchaudio. The --index-url specifies the PyTorch wheel for CUDA 11.8, ensuring that the installed PyTorch version is compatible with CUDA 11.8.
These commands add a new PPA (Personal Package Archive) for Ubuntu toolchain tests and install GCC 11 and G++ 11. These are needed for building certain components that require C++ compilation, particularly for deepspeed, a deep learning optimization library.
Checking to see whether the revised version of CUDA is installed
CUDA in Conda Environments
When you create a Conda environment and install a specific version of CUDA (like 11.8 in your case), you are installing CUDA toolkit libraries that are compatible with that version within that environment.
This installation does not change the system-wide CUDA version, nor does it affect what nvidia-smi displays.
The Conda environment's CUDA version is used by the programs and processes running within that environment. It's independent of the system-wide CUDA installation.
Verifying CUDA Version in Conda Environment
To check the CUDA version in your Conda environment, you should not rely on nvidia-smi. Instead, you can check the version of the CUDA toolkit you have installed in your environment. This can typically be done by checking the version of specific CUDA toolkit packages installed in the environment, like cudatoolkit.
You can use a command like conda list cudatoolkit within your Conda environment to see the installed version of the CUDA toolkit in that environment.
Compatibility
It's important to ensure that the CUDA toolkit version within your Conda environment is compatible with the version supported by your NVIDIA driver (as indicated by nvidia-smi). If the toolkit version in your environment is higher than the driver's supported version, you may encounter compatibility issues.
In summary, nvidia-smi shows the maximum CUDA version supported by your GPU's driver, not the version used in your current Conda environment. To check the CUDA version in a Conda environment, use Conda-specific commands to list the installed packages and their versions.
Another way of putting it:
CUDA Driver Version:The version reported by nvidia-smi is the CUDA driver version installed on your system, which is 12.3 in your case. This is the version of the driver software that allows your operating system to communicate with the NVIDIA GPU.
CUDA Toolkit Version in PyTorch:When you install PyTorch with a specific CUDA toolkit version (like cu118 for CUDA 11.8), it refers to the version of the CUDA toolkit libraries that PyTorch uses for GPU acceleration. PyTorch packages these libraries with itself, so it does not rely on the system-wide CUDA toolkit installation.
Compatibility: The key point is compatibility. Your system's CUDA driver version (12.3) is newer and compatible with the CUDA toolkit version used by PyTorch (11.8). Generally, a newer driver version can support older toolkit versions without issues.
Functionality Check:As long as torch.cuda.is_available() returns True, it indicates that PyTorch is able to interact with your GPU using its bundled CUDA libraries, and you should be able to run CUDA-accelerated PyTorch operations on your GPUs.
In summary, your setup is fine for running PyTorch with GPU support. The difference in the CUDA driver and toolkit versions is normal and typically not a problem as long as the driver version is equal to or newer than the toolkit version required by PyTorch.
Test Compatibility
Below are some scripts to create to test for compatibility.
These scripts will test that both your CPU and GPU are correctly processing the CUDA code. It will also test to make sure there are no compatibility issues between the installed GCC version and the CUDA Toolkit version you are using.
Compatibility Test Scripts
To test the compatibility of your GCC version with the CUDA Toolkit version installed, you can use a simple CUDA program.
Below is a basic script for a CUDA program that performs a simple operation on the GPU. This script will help you verify that your setup is correctly configured for CUDA development.
First, create a simple CUDA program. Let's call it test_cuda.cu:
#include<stdio.h>#include<cuda_runtime.h>// Kernel function to add two vectors__global__ voidadd(int n,float*x,float*y) {int index =blockIdx.x *blockDim.x +threadIdx.x;int stride =blockDim.x *gridDim.x;for (int i = index; i < n; i += stride)y[i] =x[i] +y[i];}intmain(void) {int N =1<<25; // 33.6M elementsfloat*x,*y;cudaEvent_t start, stop;cudaMallocManaged(&x, N*sizeof(float));cudaMallocManaged(&y, N*sizeof(float));for (int i =0; i < N; i++) {x[i] =1.0f;y[i] =2.0f; }cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start);int blockSize =256;int numBlocks = (N + blockSize -1) / blockSize; add<<<numBlocks, blockSize>>>(N, x, y);cudaEventRecord(stop);cudaEventSynchronize(stop);float milliseconds =0;cudaEventElapsedTime(&milliseconds, start, stop);printf("Time taken: %f ms\n", milliseconds);float maxError =0.0f;for (int i =0; i < N; i++) maxError =fmax(maxError,fabs(y[i]-3.0f));printf("Max error: %f\n", maxError);cudaEventDestroy(start);cudaEventDestroy(stop);cudaFree(x);cudaFree(y);return0;}
Next, create a shell script to compile and run this CUDA program.
Name this script test_cuda_compatibility.sh:
#!/bin/bash# Define the CUDA filecuda_file="test_cuda.cu"# Define the output executableoutput_executable="test_cuda_executable"# Compile the CUDA programnvcc $cuda_file -o $output_executable# Check if the compilation was successfulif [ $? -eq0 ]; thenecho"Compilation successful. Running the CUDA program..."./$output_executableelseecho"Compilation failed."fi
This script compiles the test_cuda.cu file using nvcc, the NVIDIA CUDA compiler, and then runs the compiled executable if the compilation is successful.
How to Use the Script:
Save the CUDA program code in a file named test_cuda.cu.
Save the shell script in a file named test_cuda_compatibility.sh.
Make the shell script executable:
chmod+xtest_cuda_compatibility.sh
Run the script:
./test_cuda_compatibility.sh
If everything is set up correctly, the script will compile the CUDA program and run it, resulting in output from both the CPU and GPU.
If there are compatibility issues between GCC and the CUDA Toolkit, the script will likely fail during compilation, and you'll see error messages indicating what went wrong.
Remember: Compatibility between the GCC version and the CUDA Toolkit is crucial. Make sure the GCC version you choose is compatible with your CUDA Toolkit version.
Where are you now?
We have now created a deep learning development environment optimised for NVIDIA GPUs, with compatibility across key components.
We have so far:
-Installed CUDA Toolkit and Drivers
-Set up the NVIDIA Container Toolkit to allow access to NVIDIA Docker containers
-Ensured Host Compatibility by verifying variables such as GCC (GNU Compiler Collection) and GLIBC (GNU C Library) are compatible with the CUDA version.
-Created a Compatibility Check Script: Developing a script to check for compatibility issues
With these components in place, your environment is tailored for deep learning development.
It supports the development and execution of deep learning models, leveraging the computational power of GPUs for training and inference tasks.