Kernel library#

Metal accelerated kernels

This library provides low-level operations accelerated with Metal framework. This library could be used like following:

#include <metalchat/kernel.h>

Hardware accelerator#

class hardware_accelerator#

Hardware accelerator is an abstraction of the kernel execution pipeline.

Accelerator is responsible of whole Metal kernels lifecycle: creation of kernels from a library, execution and scheduling of kernels, and allocation of tensors within a GPU memory.

The hardware accelerator can be copied. Modification of the allocator are distributed to all copies of the hardware accelerator.

Public Types

using allocator_type = polymorphic_hardware_allocator<void>#: A type of the hardware memory allocator used to either allocate or transfer memory of tensors within a running kernel thread.

Public Functions

hardware_accelerator(const std::filesystem::path &path, std::size_t thread_capacity = 64)#

Create hardware accelerator from the kernel (shader) library.

You can create a new hardware accelerator in the following way:

auto gpu = hardware_accelerator("metalchat.metallib");

Parameters:

path – Specifies a location of the compiled Metal shaders library.
thread_capacity – Specifies the size of the command buffer. Commands are executed in batches of thread_capacity size. Kernel won’t be scheduled until the buffer is filled with the configured number of kernels, or when the execution is explicitly triggered (usually by calling future_tensor::get method).

explicit hardware_accelerator(std::size_t thread_capacity = 64)#

Create hardware accelerator from within a bundle.

When the library is distributed as a bundle, then it’s possible to load the shader library from the bundle. This constructor performs lookup of the distribution bundle and loads shader library named metalchat.metallib.

std::string name() const#: Get name of the hardware accelerator.

metal::shared_device get_metal_device()#: Return a shared pointer to the underlying Metal Device.

allocator_type get_allocator() const#

Return an allocator associated with the current thread.

Use hardware_accelerator::set_allocator method to set a new allocator to the currently running thread.

void set_allocator(allocator_type alloc)#

Set allocator to the current thread.

Hardware accelerator uses a polymorphic allocator in order to provide an option to change the implementation during kernel queue scheduling. The allocator essentially is used to transfer all tensors allocated outside of the GPU memory to GPU memory.

Note

You can explore a variety of different allocators in Allocator library .

template<hardware_allocator_t<void> Allocator> inline void set_allocator(Allocator &&alloc)#: Set allocator to the current thread.

const basic_kernel &load(const std::string &name)#

Load the kernel from the kernel library.

Accelerator caches kernels, so kernel is loaded only once on the first call. A kernel returned from this method is attached to a metalchat::recursive_kernel_thread, and could be used to create a kernel task.

Example:

using namespace metalchat;

auto gpu = hardware_accelerator();
auto kernel = gpu.load<float, 16>("hadamard");

auto output = future_tensor(empty<float>({32}, gpu));
auto input1 = future_tensor(rand<float>({32}, gpu));
auto input2 = future_tensor(rand<float>({32}, gpu));

// Schedule a kernel task with 2 thread groups, each of 16 threads size.
auto task = kernel_task(kernel, dim3(32), dim3(16));

// This kernel expects output tensor as the first argument.
auto packaged_task = task.bind_front(output, input1, input2);
auto result = future_tensor(output, std::move(packaged_task));

// Block the current thread, until the result is ready.
result.get();

const basic_kernel &load(const std::string &name, const std::string &type)#

Load the kernel from kernel library.

This is a convenience method that appends to the kernel name it’s type: name_type, so users won’t need to format kernel name manually.

template<typename T> inline const basic_kernel &load(const std::string_view &name)#

Load the kernel from kernel library

This is a convenience method that loads kernels with names in the following format: {name}_{data_type}.

template<typename T1, typename T2, typename ...TN> inline const basic_kernel &load(const std::string_view &name)#

Load the kernel from kernel library

This is a convenience method that loads kernels with names in the following format: {name}_{data_type1}_{data_type2}_....

template<typename T> inline const basic_kernel &load(const std::string_view &name, std::size_t block_size)#

Load the kernel from kernel library.

This is a convenience method that loads kernels with names in the following format: {name}_{block_size}_{data_type}.

Basic kernel#

class basic_kernel#

Binary kernel wrapper#

template<typename T> class binary_kernel_wrapper#

Kernel task#

struct dim3#

The type that is used to specify dimension of the GPU compute grid (thread group). When defining variable of type dim3, any values left unspecified is initialized to 1.

Public Members

const std::size_t x#: X value of a 3-dimensional vector.

const std::size_t y#: Y value of a 3-dimensional vector.

const std::size_t z#: Z value of a 3-dimensional vector.

template<immutable_tensor... Args> class kernel_task#

A class that wraps a Metal kernel task, arguments for the task, and scheduler parameters.

Tasks are executed asynchronously on a hardware_accelerator. This implies that before scheduling a task execution all specified arguments must be bound to the task using either kernel_task::bind_front or bind_back methods.

Most commonly, task are used as asynchronously invocable instances for future_tensor, so that operation that produces result for a tensor could be asynchronously awaited.

Warning

Usually there is no need to create a kernel task manually, as a kernel usually creates one for you and passes required arguments correctly. You could explore a collection of available kernels in Metal accelerated kernels .

Warning

When kernel task is used with future_tensor, consider moving the ownership of the task to the future tensor with the respective constructor and std::move to release memory from the dependent tensors (kernel task arguments) on a kernel completion.

Public Functions

kernel_task(const kernel_task &task) noexcept = default#: The copy constructor of the kernel_task.

inline kernel_task(basic_kernel kernel, dim3 grid, dim3 thread, Args... args)#

Creates a new kernel task with the specified kernel function and hardware grid configuration.

auto accelerator = hardware_accelerator();
auto kernel = accelerator.load<float, 16>("hadamard");

// Create a kernel with 4 thread groups of size 16x16x1 each.
auto task = kernel_task(kernel, dim3(64, 64), dim3(16, 16));

Parameters:

kernel – a kernel function instance.
grid – total size of 3-dimensional GPU compute grid.
thread – a size of 3-dimensional GPU compute thread group.
args – optional kernel arguments.

inline std::shared_future<void> operator()()#: Schedules execution of the stored kernel. The method returns a shared future that could be used to await for the task completion.

Warning

The function call operator can be called only once for each kernel_task.

inline std::shared_future<void> operator()(std::function<void()> callback)#

Schedules execution of the stored kernel. The method returns a shared future that could be used to await for the task completion.

This method allows to specify an arbitrary callback function that will be executed before releasing the returned future object.

Warning

The function call operator can be called only once for each kernel_task.

inline void encode(hardware_function_encoder encoder)#

Encode the kernel and all it’s arguments with the specified encoder.

The encoding process implies setup of kernel arguments (tensors), data offsets, and kernel dependencies (outputs from other kernels).

Note

This method is called by a kernel_thread, when the kernel is scheduled for executions by calling one of overloaded function call operators kernel_task::operator()(std::function<void()>), or kernel_task::operator()() therefore there is no need to call this method manually.

inline void make_ready_at_thread_exit()#

Immideately schedules execution of the kernel task by a hardware accelerator.

The accelerator keeps a queue of tasks and executes them in batches, so once a batch is assembled, accelerator starts processing it. This behaviour could be changed by calling this method, so that processing starts for all tasks in the buffer.

Method raises std::runtime_error, when this method is executed for a task that is not invoked (pushed to a command buffer) with one of kernel_task::operator()(), or kernel_task::operator()(std::function<void()>) methods.

template<immutable_tensor... FrontArgs> inline kernel_task<FrontArgs..., Args...> bind_front(FrontArgs... front_args)#

Returns a new kernel task with bound arguments at positions starting from the beginning of the task arguments sequence.

The kernel task expects all arguments to be tensors, since the kernel should be encodable to the hardware kernel queue.

Note

The bound arguments are shallow copies of the tensor, meaning that tensor layout (sizes, strides, offsets) are preserved, but data might be modified through the tensor that shares the same underlying contiguous container.

Template Parameters:: FrontArgs – a sequence of argument types to bind.
Parameters:: front_args – a sequence of arguments to bind.

template<immutable_tensor... BackArgs> inline kernel_task<Args..., BackArgs...> bind_back(BackArgs... back_args)#

Returns a new kernel task with bound arguments appended to the end of the task arguments sequence.

The kernel task expects all arguments to be tensors, since the kernel should be encodable to the hardware kernel queue.

Template Parameters:: BackArgs – a sequence of argument types to bind.
Parameters:: back_args – a sequence of arguments to bind.

inline std::string name() const#: Returns a name of the kernel.

Kernel thread#

class kernel_thread#

Public Functions

bool joinable() const#

Checks if the kernel_thread object identifies an active thread of execution.

Specifically, returns true if the kernel thread is not committed and there are open slots available to encode new functions.

Recursive kernel thread#

class recursive_kernel_thread#