Unicode 和 UTF-8 介绍-谢先斌的博客

Unicode标准（The Unicode Standard）是由Unicode联盟维护，整理、编码了世界上大部分的文字系统，使得电脑能以通用的字符集来处理和显示文字。

介绍

Unicode

Unicode 是信息技术领域的业界标准，本质是一个编码表，2023年9月发布 15.1 版本中，拥有149,813 个字符

Unicode 不是一种字符编码，它定了一张字符表参考，统一了所有字符的编码，是一个字符集(Character Set)，只是给所有的字符一个唯一编号，但是却没有规定如何存储
Unicode 字符集的编码范围是 0x0000 ~ 0x10FFFF，可以容纳1114112个不重复的字符，每个字符都可以转化为二进制数值，称为 Code point(码点)

UTF-8

UTF-8(8-bit Unicode Transformation Format) 是一种针对 Unicode 的可变长度字符编码，用来表示 Unicode 标准中的任何字符

UTF-8 实际上是一种存储方式，是一种在存储和传输上节约空间、提高性能的编码形式
UTF-8 对一个字符使用 1-6 个字节（每个字节8个比特bit）进行编码
UTF-8 通过开头的标志位位数实现了变长，即第一个字节由n位连续的1加一位0组成，首字节连续的1的个数表示字符编码所需的字节数
- Each byte in a UTF-8 byte sequence consists of two parts(参考):
  - marker bits (the most significant bits): The marker bits are a sequence of zero to four 1 bits followed by a 0 bit
  - and payload bits

Bytes count	Bits	Range	Bytes
1	7	U+0000 ~ U+007F	0xxxxxxx
2	11	U+0080 ~ U+07FF	110xxxxx 10xxxxxx
3	16	U+0800 ~ U+FFFF	1110xxxx 10xxxxxx 10xxxxxx
4	21	U+10000 ~ U+1FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5	26	U+200000 ~ U+3FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6	31	U+4000000 ~ U+7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

二进制转换成十进制对照表：
- 字节 0xxxxxxx 范围 0 - 127
- 字节 10xxxxxx 范围 128 - 191
- 字节 110xxxxx 范围 192 - 223
- 字节 1110xxxx 范围 224 - 239
- 字节 11110xxx 范围 240 - 247
字节说明
- U+0000 ~ U+007F 使用 1 个字节，对应十进制为 0~127 共 128 个字符，完全兼容 ASCII 编码表
  - ASIIC 第一位是0，其余7位都可以是0或1，一共有2^7个，即128个字符，包括[a-zA-Z0-9]和一些标点符号，可见的只有95个字符，其他的是一些控制符
- U+0080 ~ U+07FF 使用 2 个字节，对应十进制 128~2047，主要是拉丁文、希腊文、西里尔字母、亚美尼亚语、希伯来文、阿拉伯文、叙利亚文
- U+0800 ~ U+FFFF 使用 3 个字节，中文就在这里表示
  - 中文的码点范围是4E00-9FFF，共计20992个字符
  - 一个中文字符占 3 个字节
- U+10000 ~ U+1FFFFF 使用 4 个字节
- U+200000 ~ U+3FFFFFF 使用 5 个字节
- U+4000000 ~ U+7FFFFFFF 使用 6 个字节
概念
- 编码(encode)：将 Unicode codepoint 转换为符合 UTF-8 编码的字节
  - encode text to bytes
- 解码(decode)：将已知序列的 UTF-8 编码的字节，转换为 Unicode codepoint
  - decode bytes to text
其他
- 互联网工程工作小组（IETF） 要求所有互联网协议都必须支持UTF-8编码
- UTF-16 使用2个字节编码，UTF-32 使用4个字节编码，支持数量有限，且部分常用字符浪费存储

扩展

escapes 转移的意思

MySQL 中的 utf8 和 utf8mb4

MySQL 中的 utf8 实际上不是真正的 UTF-8，utf8 只支持每个字符最多 3 个字节，对于超过 3 个字节的字符就会出错，而真正的 UTF-8 至少要支持 4 个字节，MySQL 中的 utf8mb4 才是真正的 UTF-8

Golang 实现剔除非 utf8 字符

Golang Unicode 编码

package main

import (
	"fmt"
	"strings"
	"unicode/utf8"
)

// RemoveNonUTF8Chars 从字符串中剔除所有非UTF-8字符
func RemoveNonUTF8Chars(s string) string {
	var builder strings.Builder
	for len(s) > 0 {
		r, size := utf8.DecodeRuneInString(s)
		if r == utf8.RuneError && size == 1 {
			// 遇到了非UTF-8字符，跳过它
			s = s[1:]
		} else {
			// 有效的UTF-8字符，将其添加到结果字符串中
			builder.WriteString(s[:size])
			s = s[size:]
		}
	}
	return builder.String()
}

func main() {
	input := "Hello, 世界! \x80\x81\x82" // 包含非UTF-8字符 \x80\x81\x82
	output := RemoveNonUTF8Chars(input)
	fmt.Println(output) // 输出: Hello, 世界!
}
source repo / download raw

Golang json 转码会转化非 utf8 字符

package main

import (
	"encoding/json"
	"fmt"
)

type Bar struct {
	Name string `json:"name"`
}

func main() {
	// 构造
	bs := []byte{78, 79, 241}
	fmt.Println(bs)
	fmt.Println(string(bs))

	// struct 非 unicode 转化实例
	name1 := []byte{0x80, 0x61, 0x62, 0x63}
	bar1 := &Bar{
		Name: string(name1),
	}

	bar1bs, err := json.Marshal(bar1)
	if err != nil {
		panic(err)
	}

	var bar2 Bar
	err = json.Unmarshal(bar1bs, &bar2)
	if err != nil {
		panic(err)
	}

	bar2bs, err := json.Marshal(bar2)
	if err != nil {
		panic(err)
	}

	fmt.Printf("bar1 name: %s, bs: %#v\n", bar1.Name, bar1bs)
	fmt.Printf("bar2 name: %s, bs: %#v\n", bar2.Name, bar2bs)
}
source repo / download raw

输出结果：

[78 79 241]
NO�
bar1 name: �abc, bs: []byte{0x7b, 0x22, 0x6e, 0x61, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0x5c, 0x75, 0x66, 0x66, 0x66, 0x64, 0x61, 0x62, 0x63, 0x22, 0x7d}
bar2 name: �abc, bs: []byte{0x7b, 0x22, 0x6e, 0x61, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0xef, 0xbf, 0xbd, 0x61, 0x62, 0x63, 0x22, 0x7d}

在Go语言中，json.Marshal会将非标准ASCII字符编码为UTF-8格式（字节码可能变化），并在生成的JSON字符串中保留这些字符

Python2

Python2 Specific Encodings

python2
>>> s = 'hello'
>>> s
'hello'
>>> [ord(i) for i in s]
[104, 101, 108, 108, 111]

# 在 UTF-8 中三个字节代表一个汉字
>>> s = '中文'
>>> s
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(s)
6
>>> [i for i in s]
['\xe4', '\xb8', '\xad', '\xe6', '\x96', '\x87']
>>> s.decode('utf-8')  # 解码到 Unicode
u'\u4e2d\u6587'
>>> s.decode('utf-8').encode('utf-8')  # 编码 utf-8
'\xe4\xb8\xad\xe6\x96\x87'
>>> print(s.decode('utf-8').encode('utf-8'))
中文

# 获取第一个字符
>>> s.decode('utf-8')[0]
u'\u4e2d'
>>> s.decode('utf-8')[0].encode('utf-8')
'\xe4\xb8\xad'
>>> print(s.decode('utf-8')[0].encode('utf-8'))
中

Python3

encodings 实现编码相关代码
codecs 模块的定义基类标准Python解码器(编码器和解码器)提供内部Python编的注册表，负责管理的编码解和错误的处理过程中查找
- Encodings and Unicode，支持的字符集（名称）也在这里
Python3 Specific Encodings
- raw_unicode_escape Latin-1 编码，其他码位用 \uXXXX 和 \UXXXXXXXX 编码。现有的反斜线不会以任何方式转义。它用于 Python pickle 协议。
- unicode_escape 编码适合 ASCII 编码的 Python 源代码中的 Unicode 字面内容，但引号不转义。从 Latin-1 源代码解码。请注意，Python 源代码实际上默认使用 UTF-8。
python3 中的表现形式
- \x 开头的ASCII范围是 [0,160] 区间内的值
  - 小于 0xFF 时，默认显示未 \xFF 格式，最大值(255)

# chr: Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.
>>> chr(0)
'\x00'
>>> chr(160)
'\xa0'

# < 0xFF 表示为 unicode 格式
>>> s = '\x1babcd'
>>> s2 = ['\\u{:04x}'.format(ord(c)) for c in s if ord(c) <= 0x7F]
>>> s2
['\\u001b', '\\u0061', '\\u0062', '\\u0063', '\\u0064']
# 进转移不可见字符
>>> s2 = ['\\u{:04x}'.format(ord(c)) for c in s if ord(c) <= 0x1F]
>>> s2
['\\u001b']

\u 开头表示在 FFFF 之前的 unicode 后跟 4 个16进制字符（A-F）
\U 开头表示在 FFFF 之后的 unicode 后跟 8 个16进制字符（A-F）

>>> chr(65535)
'\uffff'
>>> chr(65536)
'𐀀'
>>> chr(222222)  # \U 示例
'\U0003640e'

# 以下3种表示同一个值
>>> ord('\x80')
128
>>> ord('\u0080')
128
>>> ord('\U00000080')
128

中英文编码对比

python3
>>> s = 'hello'
>>> s
'hello'
>>> [ord(i) for i in s]
[104, 101, 108, 108, 111]

>>> s = '中文'
>>> s
'中文'
>>> s[0]
'中'
>>> len(s)
2
>>> s.encode('utf-8')
b'\xe4\xb8\xad\xe6\x96\x87'
>>> len(s.encode('utf-8'))
6
>>> [i for i in s.encode('utf-8')]
[228, 184, 173, 230, 150, 135]
>>> s.encode('utf-8').decode('utf-8')
'中文'

# 编码为 unicode_escape 格式
>>> s.encode('unicode_escape')
b'\\u4e2d\\u6587'
# 解码
>>> s.encode('unicode_escape').decode('unicode_escape')
'中文'

>>> help(s.encode)
encode(encoding='utf-8', errors='strict') method of builtins.str instance
    Encode the string using the codec registered for encoding.

    encoding
      The encoding in which to encode the string.
    errors
      The error handling scheme to use for encoding errors.
      The default is 'strict' meaning that encoding errors raise a
      UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
      'xmlcharrefreplace' as well as any other name registered with
      codecs.register_error that can handle UnicodeEncodeErrors.

>>> help(s.encode().decode)
decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.

    encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.

Unicode 和 UTF-8 介绍

介绍