View on GitHub

newsched

The GNU Radio 4.0 Runtime Proof Of Concept

CUDA Custom Buffers

We want to explore creating our own custom buffers, and the CUDA programming model (assuming you have an nVidia GPU) makes for convenient way to show zero-copy in and out of the GPU

Installation Caveats

If the CUDA drivers and toolkit have been installed using apt-get, meson needs a bit of help to find the modules. When calling meson setup ..., set the CUDA_ROOT environment variable to \usr, e.g. CUDA_ROOT=\usr meson setup ...

If you want the latest CUDA toolkit, and install it from tar files, there is probably an easier way to get this working, but … on Ubuntu 20.04, in order to get CUDA installed locally on my machine, I had to:

  1. Download the CUDA runfile installer here
  2. Uninstall all nvidia-* packages using apt remove
  3. Blacklist the noveau driver
  4. Reboot the machine - now that CUDA and noveau are uninstalled, it can’t boot into GNOME
  5. In GRUB, boot to the 5.4 kernel with Recovery Mode
  6. Enable Networking
  7. Install the runfile
  8. Reboot back into GRUB 5.4 kernel

This was all done with CUDA toolkit 11.0 and the 450 driver, so it might be the case that another combination works with the newer kernel, but this is what I had to go throug.

Build Options

First, we must add an option to enable/disable CUDA related code from compilation at the command line, with an option in meson_options.txt

Next, we add the detection of the CUDA compiler

cuda_available = add_languages('cuda', required : false)

and a dependency that will be used throughout the rest of the meson.build files

cuda_dep = dependency('cuda', version : '>=10.1', required : cuda_available and get_option('enable_cuda'))

Custom Buffers

We implement 2 types of CUDA custom buffers

  1. buffer_cuda.cu - Establishes separate host and device memory, and post_write method initiates the transfer between where applicable.
  2. buffer_cuda_pinned.cu - Creates pinned host memory, and performs no extra H2D or D2H memcpys. This is for the type of machine that has GPU integrated with CPU, such as Jetson.

Note: Both of these are doing simple double copy circular buffers but this logic should be changed eventually

The buffer_cuda_sm.cu buffer class uses the single mapped buffer abstraction, but still working out some of the kinks

buffer_cuda_pinned.cu is almost exactly the same as simplebuffer.h just with a CUDA host allocated pinned buffer instead of normal CPU memory.

buffer_cuda.cu is more interesting, especially in its post_write method.

In buffer_cuda.h, we have defined some convenience macros to wrap the buffer creation arguments:

#define CUDA_BUFFER_ARGS_H2D buffer_cuda_properties::make(buffer_cuda_type::H2D)
#define CUDA_BUFFER_ARGS_D2H buffer_cuda_properties::make(buffer_cuda_type::D2H)
#define CUDA_BUFFER_ARGS_D2D buffer_cuda_properties::make(buffer_cuda_type::D2D)

which will make it easier when calling set_custom_buffer on the connected edge. Now, taking a look at the post_write method, we see that after we have written to the host or device buffer (whichever was returned from write_info depending on where the buffer sits in the chain), we take the action to initiate an H2D, D2D, or D2H transfer (additional complication from the double circular buffer)

    if (_buffer_type == cuda_buffer_type::H2D) {
        cudaMemcpy(&_device_buffer[wi1],
                   &_host_buffer[wi1],
                   bytes_written,
                   cudaMemcpyHostToDevice);

        // memcpy(&_host_buffer[wi2], &_host_buffer[wi1], num_bytes_1);
        cudaMemcpy(&_device_buffer[wi2],
                   &_device_buffer[wi1],
                   num_bytes_1,
                   cudaMemcpyDeviceToDevice);
        if (num_bytes_2) {
            // memcpy(&_host_buffer[0], &_host_buffer[_buf_size], num_bytes_2);
            cudaMemcpy(&_device_buffer[0],
                       &_device_buffer[_buf_size],
                       num_bytes_2,
                       cudaMemcpyDeviceToDevice);
        }
    } else if (_buffer_type == cuda_buffer_type::D2H) {
        cudaMemcpy(&_host_buffer[wi1],
                   &_device_buffer[wi1],
                   bytes_written,
                   cudaMemcpyDeviceToHost);

        memcpy(&_host_buffer[wi2], &_host_buffer[wi1], num_bytes_1);

        if (num_bytes_2) {
            memcpy(&_host_buffer[0], &_host_buffer[_buf_size], num_bytes_2);
        }
    } else // D2D
    {
        cudaMemcpy(&_device_buffer[wi2],
                   &_device_buffer[wi1],
                   num_bytes_1,
                   cudaMemcpyDeviceToDevice);
        if (num_bytes_2)
            cudaMemcpy(&_device_buffer[0],
                       &_device_buffer[_buf_size],
                       num_bytes_2,
                       cudaMemcpyDeviceToDevice);
    }
    // advance the write pointer
    _write_index += bytes_written;
    if (_write_index >= _buf_size) {
        _write_index -= _buf_size;
    }
}

If the buffer is instantiated as H2D:

If the buffer is instantiated as D2H:

If the buffer is instantiate as D2D:

In any of these cases, the block implementation assumes that the sample buffers provided in the work_input and work_output structs are device memory and are able to launch kernels on it directly.

QA Test

A single QA test for the CUDA copy block is implemented and simply checks whether the samples going in and out of the copy block get the samples across.

The only things of note are:

  1. QA tests for CUDA are put into a separate folder to allow easy enabling/disabling from the option/dependency
  2. Blocks are instantiated from the cuda block module
     auto copy1 = cuda::copy::make(1024);
     auto copy2 = cuda::copy::make(1024);
    
  3. CUDA buffers are manually specified using the set_custom_buffer method
    fg->connect(src, 0, copy1, 0)->set_custom_buffer(CUDA_BUFFER_ARGS_H2D);
    fg->connect(copy1, 0, copy2, 0)->set_custom_buffer(CUDA_BUFFER_ARGS_D2D);
    fg->connect(copy2, 0, snk1, 0)->set_custom_buffer(CUDA_BUFFER_ARGS_D2H);