String tokenization#

These are the building blocks of the tokenization mechanism: chopping the input character sequence into the grouped units of characters. This library could be used like following:

#include <metalchat/text.h>

using namespace metalchat::text;

Byte pair encoder#

template<typename CharT, typename RegularExpression = unicode_regexp<CharT>> class byte_pair_encoder#

Token encoder that splits arbitrary utf-8 encoded string into a sequence of tokens that could be used to run the inference of a language transformer. The approach and the implementation is inspired by tiktoken.

Constructors of this class require a path to a token map, such map is distributed altogether with, for example, Llama model and is called tokenizer.model. When the provided file does not exist or has invalid format, constructor will raise an exception.

Here is an example of a tokenizer model: in the first column - a base64-encoded token, in the second column - a token identifier of byte_pair_encoder::index_type):

4LmM4LiB4Lij 0
zrbOsQ== 1
IOuNlOyasQ== 2
2YjZhNin2Ko= 3

Consider the following basic example:

using namespace metalchat::text;

using Tokenizer = byte_pair_encoder<char>;
using TokenizerTraits = tokenizer_traits<Tokenizer>;

Tokenizer tokenizer("tokenizer.model");
auto tokens = TokenizerTraits::encode(tokenizer, "This is a test sentence.");
auto string = TokenizerTraits::decode(tokenizer, tokens.begin(), tokens.end());

std::cout << string << std::endl;
// output: This is a test sentence.

Public Types

using char_type = CharT #: Type used to indicate position of the token in the model (token dictionary).

Public Functions

byte_pair_encoder(const byte_pair_encoder&) = default#: The byte_pair_encoder copy constructor.

inline byte_pair_encoder(std::basic_istream<CharT> &is, const string_type &token_regex)#

Create an instance of a byte-pair encoder using a base64-encoded token map.

This constructor reads token map from the specified input stream line-by-line and decodes base64-decoded tokens.

Parameters:

is – An input stream containing tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline byte_pair_encoder(const std::filesystem::path &p, const string_type &token_regex)#

Create an instance of byte-pair encoder using a base64-encoded token map.

This constructor allows to specify a custom token regular expression that fits best to the target language model.

Parameters:

p – A path to the tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline byte_pair_encoder(const char *path, const string_type &token_regex)#

Convenience constructor, interprets path argument as path to the tokenizer model.

Parameters:

path – A path to the tokenizer model.
token_regex – A regular expression to split the input string into tokens.

inline void insert(const string_type &value, index_type id)#

Insert a new token-pair into the encoder.

Parameters:

value – A string representation of a token.
id – Target encoding of a token (a position in the token embedding).

inline void insert_back(const string_type &value)#

Insert a new token by binding it to the last position (in the token embedding).

Parameters:: value – A string representation of a token.

inline std::size_t size() const#: Returns the number of all available tokens in the encoder.

template<std::output_iterator<index_type> OutputIt> inline OutputIt encode(const string_type &s, OutputIt output) const#

Encode the provided string into tokens.

This method iteratively splits the string into tokens and then appends a corresponding token index into end of the provided iterator output. When the token is not presented in the token dictionary, it is divided into byte-pairs, then index of the byte pair is appended to the end of the container.

template<std::output_iterator<string_type> OutputIt> inline OutputIt decode(index_type id, OutputIt output) const#

Decode a single position-encoded token to the string representation.

Method at first attempts to find a token within a model token map, then tries to query control tokens. In token is not found, method raises an exception.

Sentence Piece#

class sentence_piece#

A tokenizer that applies byte-pair tokenizer directly on unicode text.

Public Functions

sentence_piece(const sentence_piece&) = default#: The sentence_piece copy constructor.

inline sentence_piece()#: The sentence_piece default constructor.

inline void insert(const string_type &value, index_type id)#

Insert a new token-pair into the encoder.

Parameters:

value – A string representation of a token.
id – Target encoding of a token (a position in the token embedding).

inline void insert_back(const string_type &value)#

Insert a new token by binding it to the last position (in the token embedding).

Parameters:: value – A string representation of a token.

inline std::size_t size() const#: Returns the number of all available tokens in the encoder.

template<std::output_iterator<index_type> OutputIt> inline OutputIt encode(const string_type &s, OutputIt output) const#

Encode the provided string into tokens.

The method replaces all white space characters with a special unicode symbols, and then encodes the whole sequence using byte-pair encoding.

template<std::output_iterator<string_type> OutputIt> inline OutputIt decode(index_type id, OutputIt output) const#

Decode a single position-encoded token to the string representation.

Method replaces all whitespace-replacement unicode code points with a unicode code point of the regular white space.

Regular expression#

The implementation of regular expressions from the standard C++ library does not support Perl syntax, used in the Tiktoken library. This implementation uses Perl Compatible Regular Expressions library (PCRE) to make available Perl syntax.

class regexp#: Subclassed by metalchat::text::unicode_regexp< char >

class regexp_iterator#

Regular expression iterator.

This iterator is used to provide convenient interface to access match group data. So every match is considered an element of the backing container and this iterator returns matches sequentially until the last match.

Public Functions

regexp_iterator &operator++()#: Advance the iterator to the next regular expression match.

value_type operator*()#

Return the current match of the regular expression.

The method throws std::runtime_error when it is called on a terminated iterator.

bool operator!=(const regexp_iterator&)#

Compares two regular expression iterators.

The implementation is naive for simplicity reasons, and only compares the ends of iterators.

regexp_iterator()#: Initialize the end-of-match-group iterator.

regexp_iterator(const regexp &regex, const std::string &input)#: Initializes the iterators, stores the address of regexp in data member, and performs the finds the first match from the input string to initialize match group members.