Metal accelerated kernels#

Arithmetic kernels#

template<typename T> class add#

Warning

doxygenclass: Cannot find class “metalchat::kernel::add2” in doxygen xml output for project “metalchat” from directory: _xml

template<typename T> class sub#

template<typename T> class div#

Divides each element of the input1 by corresponding element of input2.

auto input1 = tensor<float>({{3.0, 6.0, 9.0}});
auto input2 = tensor<float>({{1.0, 2.0, 3.0}});

auto accelerator = hardware_accelerator();
auto div = kernel::div<float>(accelerator);

auto output = div(input1, input2);
std::cout << output.get() << std::endl;
// out:
// [[3.0, 3.0, 3.0]], sizes=(1, 3)

Note

The kernel performs true division. The kernel does not support type promotion.

Public Functions

inline div(hardware_accelerator &gpu)#: The kernel constructor.

template<immutable_tensor_t<T> Input1, immutable_tensor_t<T> Input2> inline auto operator()(Input1 input1, Input2 input2)#

Invokes the kernel.

Parameters:

input1 – the divident
input2 – the divisor

Returns:

a future_tensor with the result.

template<immutable_tensor_t<T> Input1, immutable_tensor1_t<T> Input2> inline auto operator()(Input1 input1, Input2 input2)#

Invokes the kernel by broadcasting the last dimension

Parameters:

input1 – the divident
input2 – the divisor

Returns:

a future_tensor with the result.

template<typename T> class cumsum#

template<typename T> class hadamard#

template<typename T> class scalar_mul#

Comparison kernels#

template<typename T> class gt#

template<typename T> class sort#

Batched matrix multiplication#

template<typename T, std::size_t BlockSize = 8> class bmm#

Copying kernels#

template<typename T> class clone#

Create a copy of a tensor.

The metal kernel implementation supports only copying of 2-dimensional tensors, considering that all dimensions that are larger than 1 (a vector) are simply batch dimensions, we could simply collapse all of them into a single batch dimension.

The resulting tensor from the future operation is also 2-dimensional, therefore if caller wants to retain original dimensionality, she must keep the original output tensor or adjust the resulting tensor shape as needed.

Note

The operation is executed asynchronously on GPU, therefore output tensor should be allocated on GPU memory beforehand.

Public Functions

inline clone(hardware_accelerator &accelerator)#: The kernel constructor.

template<immutable_tensor_t<T> Input, immutable_hardware_tensor_t<T> Output> inline auto operator()(Input input, Output output)#

Invokes the kernel.

Parameters:

input – a tensor to clone data from.
output – a tensor to clone data to.

Returns:

a future_tensor with the data copied from an input tensor.

template<immutable_tensor_t<T> Input> inline auto operator()(Input input)#

Creates an output tensor like the input and invokes the kernel.

Parameters:: input – a tensor to clone the data from.
Returns:: a future_tensor with the data copied from an input tensor.

template<typename T> class gather#

Gathers values given the index tensor.

auto T = tensor<float>({{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}});
auto index = tensor<int32_t>({{0, 0}, {1, 0}});

auto accelerator = hardware_accelerator();
auto gather = kernel::gather<float>(accelerator);

auto output = gather(T, index);
std::cout << output.get() << std::endl;
// out:
// [[1.0, 1.0],
//  [5.0, 4.0]], sizes=(2, 2)

Note

Current implementation treats all tensors as 2-dimensional with dimension 0 as a batch dimension, and gather elements only along 0 dimension.

Public Functions

inline gather(hardware_accelerator &gpu)#: The kernel constructor.

template<immutable_tensor_t<T> Input, immutable_tensor_t<int32_t> Index> inline auto operator()(Input input, Index index)#

Invokes the kernel.

Parameters:

input – an input tensor to gather values from.
index – an index tensor that specifies locations of elements within input tensor.

Returns:

a future_tensor with the elements gathered from an input tensor.

template<typename T> class roll#

Roll the tensor along the given dimension. Elements that are shifted beyond the last position are re-introduced at the first position. The tensor is always flattened before rolling and then restored to the original shape.

Public Functions

inline roll(hardware_accelerator &accelerator)#: The kernel constructor.

template<immutable_tensor_t<T> Input> inline auto operator()(Input input, int32_t shift, std::size_t dim)#

Invokes the kernel.

Parameters:

input – an input tensor.
shift – the number of places by which the elements of the tensor are shifted.
dim – an axis along which to roll.

Returns:

a future_tensor with elements rolled along the specified dimension.

template<immutable_tensor_t<T> Input, immutable_tensor_t<T> Output> inline auto operator()(Input input, Output output, int32_t shift, std::size_t dim)#

Invokes the kernel.

Parameters:

input – an input tensor.
output – an output tensor.
shift – the number of places by which the elements of the tensor are shifted.
dim – an axis along which to roll.

Returns:

a future_tensor with elements rolled along the specified dimension.

template<typename T> class scatter#

Writes values into the tensor at the specified indices.

Warning

When indices are not unique, the behaviour is non-deterministic.

Public Functions

inline scatter(hardware_accelerator &gpu)#: The kernel constructor.

template<immutable_tensor_t<T> Output, immutable_tensor_t<bool> Mask> inline auto operator()(Output output, Mask mask, T value)#

Invokes the kernel, and writes a single value to the output tensor according to the specified boolean mask.

auto T = tensor<float>({{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}});
auto M = tensor<bool>({{true, false, false}, {false, true, true}});

auto accelerator = hardware_accelerator();
auto scatter = kernel::scatter<float>(accelerator);

auto output = scatter(T, M, 9.0);
std::cout << output.get() << std::endl;
// out:
// [[9.0, 2.0, 3.0],
//  [4.0, 9.0, 9.0]], sizes=(2, 3)

Parameters:

output – a tensor to write data to.
mask – a boolean mask tensor (should be the same size as an output tensor).
value – a value to write.

Returns:

a future_tensor with the kernel operation result.

Sparse kernels#

template<typename T> class embedding#

template<typename T> class rope#

template<typename T> class rope_freqs#

Non-linear activation kernels#

template<typename T> class rmsnorm#

template<typename T> class silu#: Applies the Sigmoid Linear Unit (SiLU) function, element-wise.

template<typename T> class softmax#