MySQL 8.0.40
Source Code Documentation
mb_wc.h File Reference

Definitions of mb_wc (multibyte to wide character, ie., effectively “parse a UTF-8 character”) functions for UTF-8 (both three- and four-byte). More...

#include "m_ctype.h"
#include "my_compiler.h"
#include "my_config.h"

Go to the source code of this file.

Classes

struct  Mb_wc_utf8mb3
 Functor that converts a UTF-8 multibyte sequence (up to three bytes) to a wide character. More...
 
struct  Mb_wc_utf8mb4
 Functor that converts a UTF-8 multibyte sequence (up to four bytes) to a wide character. More...
 
class  Mb_wc_through_function_pointer
 Functor that uses a function pointer to convert a multibyte sequence to a wide character. More...
 

Functions

template<bool RANGE_CHECK, bool SUPPORT_MB4>
static int my_mb_wc_utf8_prototype (my_wc_t *pwc, const uchar *s, const uchar *e)
 
static int my_mb_wc_utf8mb3 (my_wc_t *pwc, const uchar *s, const uchar *e)
 Parses a single UTF-8 character from a byte string. More...
 
static int my_mb_wc_utf8mb4 (my_wc_t *pwc, const uchar *s, const uchar *e)
 Parses a single UTF-8 character from a byte string. More...
 
template<bool RANGE_CHECK, bool SUPPORT_MB4>
static ALWAYS_INLINE int my_mb_wc_utf8_prototype (my_wc_t *pwc, const uchar *s, const uchar *e)
 
int my_mb_wc_utf8mb3_thunk (const CHARSET_INFO *cs, my_wc_t *pwc, const uchar *s, const uchar *e)
 A thunk to be able to use my_mb_wc_utf8mb3 in MY_CHARSET_HANDLER structs. More...
 
int my_mb_wc_utf8mb4_thunk (const CHARSET_INFO *cs, my_wc_t *pwc, const uchar *s, const uchar *e)
 A thunk to be able to use my_mb_wc_utf8mb4 in MY_CHARSET_HANDLER structs. More...
 

Detailed Description

Definitions of mb_wc (multibyte to wide character, ie., effectively “parse a UTF-8 character”) functions for UTF-8 (both three- and four-byte).

These are available both as inline functions, as C-style thunks so that they can fit into MY_CHARSET_HANDLER, and as functors.

The functors exist so that you can specialize a class on them and get them inlined instead of having to call them through the function pointer in MY_CHARSET_HANDLER; mb_wc is in itself so cheap (the most common case is just a single byte load and a predictable compare) that the call overhead in a tight loop is significant, and these routines tend to take up a lot of CPU time when sorting. Typically, at the outermost level, you'd simply compare cs->cset->mb_wc with my_mb_wc_{utf8mb3,utf8mb4}_thunk, and if so, instantiate your function with the given class. If it doesn't match, you can use Mb_wc_through_function_pointer, which calls through the function pointer as usual. (It will cache the function pointer for you, which is typically faster than looking it up all the time – the compiler cannot always figure out on its own that it doesn't change.)

The Mb_wc_* classes should be sent by value, not by reference, since they are never larger than two pointers (and usually simply zero).

Function Documentation

◆ my_mb_wc_utf8_prototype() [1/2]

template<bool RANGE_CHECK, bool SUPPORT_MB4>
static int my_mb_wc_utf8_prototype ( my_wc_t pwc,
const uchar s,
const uchar e 
)
static

◆ my_mb_wc_utf8_prototype() [2/2]

template<bool RANGE_CHECK, bool SUPPORT_MB4>
static ALWAYS_INLINE int my_mb_wc_utf8_prototype ( my_wc_t pwc,
const uchar s,
const uchar e 
)
static

◆ my_mb_wc_utf8mb3()

static int my_mb_wc_utf8mb3 ( my_wc_t pwc,
const uchar s,
const uchar e 
)
inlinestatic

Parses a single UTF-8 character from a byte string.

Parameters
[out]pwcthe parsed character, if any
sthe string to read from
ethe end of the string; will not read past this
Returns
the number of bytes read from s, or a value <= 0 for failure (see m_ctype.h)

◆ my_mb_wc_utf8mb3_thunk()

int my_mb_wc_utf8mb3_thunk ( const CHARSET_INFO cs,
my_wc_t pwc,
const uchar s,
const uchar e 
)

A thunk to be able to use my_mb_wc_utf8mb3 in MY_CHARSET_HANDLER structs.

Parameters
csUnused.
pwc[output] The parsed character, if any.
sThe string to read from.
eThe end of the string; will not read past this.
Returns
The number of bytes read from s, or a value <= 0 for failure (see m_ctype.h).

◆ my_mb_wc_utf8mb4()

static ALWAYS_INLINE int my_mb_wc_utf8mb4 ( my_wc_t pwc,
const uchar s,
const uchar e 
)
static

Parses a single UTF-8 character from a byte string.

The difference between this and my_mb_wc_utf8mb3 is that this function also can handle four-byte UTF-8 characters.

Parameters
[out]pwcthe parsed character, if any
sthe string to read from
ethe end of the string; will not read past this
Returns
the number of bytes read from s, or a value <= 0 for failure (see m_ctype.h)

◆ my_mb_wc_utf8mb4_thunk()

int my_mb_wc_utf8mb4_thunk ( const CHARSET_INFO cs,
my_wc_t pwc,
const uchar s,
const uchar e 
)

A thunk to be able to use my_mb_wc_utf8mb4 in MY_CHARSET_HANDLER structs.

Parameters
csUnused.
pwc[output] The parsed character, if any.
sThe string to read from.
eThe end of the string; will not read past this.
Returns
The number of bytes read from s, or a value <= 0 for failure (see m_ctype.h).