Definitions of mb_wc (multibyte to wide character, ie., effectively “parse a UTF-8 character”) functions for UTF-8 (both three- and four-byte). More...

#include "m_ctype.h"
#include "my_compiler.h"
#include "my_config.h"

Classes
struct	Mb_wc_utf8mb3
	Functor that converts a UTF-8 multibyte sequence (up to three bytes) to a wide character. More...

struct	Mb_wc_utf8mb4
	Functor that converts a UTF-8 multibyte sequence (up to four bytes) to a wide character. More...

class	Mb_wc_through_function_pointer
	Functor that uses a function pointer to convert a multibyte sequence to a wide character. More...

Functions
template<bool RANGE_CHECK, bool SUPPORT_MB4>
static int	my_mb_wc_utf8_prototype (my_wc_t pwc, const uchar s, const uchar *e)

static int	my_mb_wc_utf8mb3 (my_wc_t pwc, const uchar s, const uchar *e)
	Parses a single UTF-8 character from a byte string. More...

static int	my_mb_wc_utf8mb4 (my_wc_t pwc, const uchar s, const uchar *e)
	Parses a single UTF-8 character from a byte string. More...

template<bool RANGE_CHECK, bool SUPPORT_MB4>
static ALWAYS_INLINE int	my_mb_wc_utf8_prototype (my_wc_t pwc, const uchar s, const uchar *e)

int	my_mb_wc_utf8mb3_thunk (const CHARSET_INFO cs, my_wc_t pwc, const uchar s, const uchar e)
	A thunk to be able to use my_mb_wc_utf8mb3 in MY_CHARSET_HANDLER structs. More...

int	my_mb_wc_utf8mb4_thunk (const CHARSET_INFO cs, my_wc_t pwc, const uchar s, const uchar e)
	A thunk to be able to use my_mb_wc_utf8mb4 in MY_CHARSET_HANDLER structs. More...

Detailed Description

Definitions of mb_wc (multibyte to wide character, ie., effectively “parse a UTF-8 character”) functions for UTF-8 (both three- and four-byte).

These are available both as inline functions, as C-style thunks so that they can fit into MY_CHARSET_HANDLER, and as functors.

The functors exist so that you can specialize a class on them and get them inlined instead of having to call them through the function pointer in MY_CHARSET_HANDLER; mb_wc is in itself so cheap (the most common case is just a single byte load and a predictable compare) that the call overhead in a tight loop is significant, and these routines tend to take up a lot of CPU time when sorting. Typically, at the outermost level, you'd simply compare cs->cset->mb_wc with my_mb_wc_{utf8mb3,utf8mb4}_thunk, and if so, instantiate your function with the given class. If it doesn't match, you can use Mb_wc_through_function_pointer, which calls through the function pointer as usual. (It will cache the function pointer for you, which is typically faster than looking it up all the time – the compiler cannot always figure out on its own that it doesn't change.)

The Mb_wc_* classes should be sent by value, not by reference, since they are never larger than two pointers (and usually simply zero).

Function Documentation

◆ my_mb_wc_utf8_prototype() [1/2]

template<bool RANGE_CHECK, bool SUPPORT_MB4>

static int my_mb_wc_utf8_prototype	(	my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

static

◆ my_mb_wc_utf8_prototype() [2/2]

template<bool RANGE_CHECK, bool SUPPORT_MB4>

static ALWAYS_INLINE int my_mb_wc_utf8_prototype	(	my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

static

◆ my_mb_wc_utf8mb3()

static int my_mb_wc_utf8mb3	(	my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

inlinestatic

Parses a single UTF-8 character from a byte string.

Parameters

[out]	pwc	the parsed character, if any
	s	the string to read from
	e	the end of the string; will not read past this

Returns: the number of bytes read from s, or a value <= 0 for failure (see m_ctype.h)

◆ my_mb_wc_utf8mb3_thunk()

int my_mb_wc_utf8mb3_thunk	(	const CHARSET_INFO *	cs,
		my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

A thunk to be able to use my_mb_wc_utf8mb3 in MY_CHARSET_HANDLER structs.

Parameters

cs	Unused.
pwc	[output] The parsed character, if any.
s	The string to read from.
e	The end of the string; will not read past this.

Returns: The number of bytes read from s, or a value <= 0 for failure (see m_ctype.h).

◆ my_mb_wc_utf8mb4()

static ALWAYS_INLINE int my_mb_wc_utf8mb4	(	my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

static

Parses a single UTF-8 character from a byte string.

The difference between this and my_mb_wc_utf8mb3 is that this function also can handle four-byte UTF-8 characters.

Parameters

[out]	pwc	the parsed character, if any
	s	the string to read from
	e	the end of the string; will not read past this

Returns: the number of bytes read from s, or a value <= 0 for failure (see m_ctype.h)

◆ my_mb_wc_utf8mb4_thunk()

int my_mb_wc_utf8mb4_thunk	(	const CHARSET_INFO *	cs,
		my_wc_t *	pwc,
		const uchar *	s,
		const uchar *	e
	)

A thunk to be able to use my_mb_wc_utf8mb4 in MY_CHARSET_HANDLER structs.

Parameters

cs	Unused.
pwc	[output] The parsed character, if any.
s	The string to read from.
e	The end of the string; will not read past this.

Returns: The number of bytes read from s, or a value <= 0 for failure (see m_ctype.h).

Classes

Functions

Detailed Description

Function Documentation

◆ my_mb_wc_utf8_prototype() [1/2]

◆ my_mb_wc_utf8_prototype() [2/2]

◆ my_mb_wc_utf8mb3()

◆ my_mb_wc_utf8mb3_thunk()

◆ my_mb_wc_utf8mb4()

◆ my_mb_wc_utf8mb4_thunk()