String tokenization#

These are the building blocks of the tokenization mechanism: chopping the input character sequence into the grouped units of characters. This library could be used like following:

#include <metalchat/text.h>

using namespace metalchat::text;

Byte pair encoder#

template<typename RegularExpression> class byte_pair_encoder#

Token encoder that splits arbitrary utf-8 encoded string into a sequence of tokens that could be used to run the inference of a language transformer. The approach and the implementation is inspired by tiktoken.

Constructors of this class require a path to a token map, such map is distributed altogether with, for example, Llama model and is called tokenizer.model. When the provided file does not exist or has invalid format, constructor will raise an exception.

Here is an example of a tokenizer model: in the first column - a base64-encoded token, in the second column - a token identifier of byte_pair_encoder::index_type):

4LmM4LiB4Lij 0
zrbOsQ== 1
IOuNlOyasQ== 2
2YjZhNin2Ko= 3

Consider the following basic example:

using namespace metalchat::text;

byte_pair_encoder<text::regexp> tokenizer("tokenizer.model");
auto tokens = tokenizer.encode("This is a test sentence.");
auto string = tokenizer.decode(tokens.begin(), tokens.end());

std::cout << string << std::endl;
// output: This is a test sentence.

Public Types

using index_type = int32_t#: Type used to indicate position of the token in the model (token dictionary).

Public Functions

byte_pair_encoder(const byte_pair_encoder&) = default#: The byte_pair_encoder copy constructor.

inline byte_pair_encoder(std::istream &is, const std::string &token_regex)#

Create an instance of a byte-pair encoder using a base64-encoded token map.

This constructor reads token map from the specified input stream line-by-line and decodes base64-decoded tokens.

Parameters:

is – An input stream containing tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline byte_pair_encoder(const std::filesystem::path &p, const std::string &token_regex)#

Create an instance of byte-pair encoder using a base64-encoded token map.

This constructor allows to specify a custom token regular expression that fits best to the target language model.

Parameters:

p – A path to the tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline byte_pair_encoder(const char *path, const std::string &token_regex)#

Convenience constructor, interprets path argument as path to the tokenizer model.

Parameters:

path – A path to the tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline void insert(const std::string &value, index_type key, tokenkind kind = token::regular)#

Insert a new token-pair into the encoder.

Parameters:

value – A string representation of a token.
key – Target encoding of a token (a position in the token embedding).
kind – A type of the token, used for special token binding.

inline void insert_back(const std::string &value, tokenkind kind = token::regular)#

Insert a new token by binding it to the last position (in the token embedding).

Parameters:

value – A string representation of a token.
kind – A type of the token, used for special token binding.

inline std::size_t size() const#: Returns the number of all available tokens in the encoder.

template<std::output_iterator<index_type> OutputIt> inline void encode(const std::string &s, OutputIt output) const#

Encode the provided string into tokens.

This method iteratively splits the string into tokens and then appends a corresponding token index into end of the provided iterator output. When the token is not presented in the token dictionary, it is divided into byte-pairs, then index of the byte pair is appended to the end of the container.

inline index_type encode(tokenkind kind) const#

Encode a special token.

Method returns a position of a special token within a tokenizer model. When a token is a token::regular kind, then method raises an exception. Regular token encoding is available through encode(const std::string&, OutputIt) const method.

template<std::output_iterator<index_type> OutputIt> inline void encode(tokenkind kind, OutputIt output) const#

Encode a special token.

Method encodes the provided special token and pushes the result to the output iterator.

inline const std::string decode(index_type id) const#

Decode a single position-encoded token to the string representation.

Method at first attempts to find a token within a model token map, then tries to query special tokens. In token is not found, method raises an exception.

template<std::forward_iterator ForwardIt, std::output_iterator<std::string> OutputIt> inline void decode(ForwardIt first, ForwardIt last, OutputIt output) const#

Iteratively decode a sequence of position-encoded tokens.

The result of decoding is sequentially appended to the specified container. If one of the tokens is not decoded correctly, an exception is raised. All successfully decoded tokens before thrown exception are left in the container.

template<std::forward_iterator ForwardIt> inline std::string decode(ForwardIt first, ForwardIt last) const#

Iteratively decode a sequence of position-encoded tokens.

All decoded tokens will be concatenated into a resulting string.

using metalchat::text::tokenkind = int32_t#

Specifies kind of the token.

Tokens are used to transform a natural language sentences into a vector of integers mapping them to a embedding space of the respective language model. There are specific kinds of tokens that allow to instruct the model for a specific behaviour.

Regular expression#

The implementation of regular expressions from the standard C++ library does not support Perl syntax, used in the Tiktoken library. This implementation uses Perl Compatible Regular Expressions library (PCRE) to make available Perl syntax.

class regexp#

class regexp_iterator#

Regular expression iterator.

This iterator is used to provide convenient interface to access match group data. So every match is considered an element of the backing container and this iterator returns matches sequentially until the last match.

Public Functions

regexp_iterator &operator++()#: Advance the iterator to the next regular expression match.

value_type operator*()#

Return the current match of the regular expression.

The method throws std::runtime_error when it is called on a terminated iterator.

bool operator!=(const regexp_iterator&)#

Compares two regular expression iterators.

The implementation is naive for simplicity reasons, and only compares the ends of iterators.

regexp_iterator()#: Initialize the end-of-match-group iterator.

regexp_iterator(const regexp &regex, const std::string &input)#: Initializes the iterators, stores the address of regexp in data member, and performs the finds the first match from the input string to initialize match group members.