go正则表达式使用指南

CKeen约 2227 字大约 7 分钟

作者：程序员CKeen
博客：http://ckeen.cnopen in new window

长期坚持做有价值的事！积累沉淀，持续成长，升维思考！希望把编码作为长期兴趣爱好😄

一、go正则表达式使用简介

正则表达式一般主要用来做文本字符串的检索和匹配检测。一般在爬虫程序中可以用来检索匹配想要的数据，也可以用在对一些输入文本的参数进行校验，比如检验用户注册手机号是否合法。下面我们主要介绍一下go语言中支持的正则表达式模式和go的正则表达式包regexp里面函数提供的功能。

二、go正则表达式的模式

1. 正则匹配

语法	说明
^	匹配文本或者行的开始
$	匹配文本或者行的结束
.	匹配任意单个字符，不包括\n
\	转义字符
\d	匹配数字0-9
\D	匹配非数字
\s	匹配空白字符，包括空格,\r,\n,\t,\f,\v
\S	匹配非空白字符
\w	匹配单词字符，a-z,A-Z,0-9
\W	匹配非单词字符
\A	匹配文本的开头
\z	匹配文本的末尾
\b	匹配边界内的ASCII字符，\w和\W之间，\A到\Z之间
\B	匹非ASCII字符
｜	匹配该字符左右侧任意一个
*	匹配0个或多个，贪婪匹配，尽可能多的匹配
+	匹配1个或多个，贪婪匹配，尽可能多的匹配
?	匹配0个或1个
`{n}`	匹配N个
`{n,}`	匹配最少N个，贪婪匹配，尽可能多的匹配
`{n,m}`	最少匹配n个，最多匹配m个，贪婪匹配，尽可能多的匹配
*?	匹配0个或多个，非贪婪匹配
+?	匹配1个或多个，非贪婪匹配
??	匹配0个或1个，非贪婪匹配，最少0个
`{n,}?`	匹配最少N个，非贪婪匹配，尽可能少的匹配
`{n,m}?`	最少匹配n个，最多匹配m个，非贪婪匹配，尽可能少的匹配
[...]	包含字符集（字符类），对应的位置可以是字符集中任意字符
A-Z/a-z	字符集范围，对应为A-Z或者a-z中的任何一个[a-z]
0-9	数字范围，表示0-9中的一个数字，比如[0-9]
[^...]	不包含字符集（字符类），对应的位置不为是字符集中任意字符

2.分组匹配

语法	说明
(...)	被捕获的组，被括起来的表达式将作为分组
`(?P<name>xxx)`	被捕获的组，给xxx的分组匹配项定义一个名字，后使用SubexpNames获取到命名
(?:xxx)	非捕获的组
(?flags:xxx)	在组内设置标记，非捕获的组，标记影响当前组后的正则表达式，标记类型如下表，`foo(?:d)`搜索不区分大小的food，fooD

标识说明：

标识	说明
`i`	不区分大小写
`m`	多行匹配模式
`s`	让.可以匹配\n
`U`	非贪婪

3. 转义字符

语法	说明
\a	bell字符
\f	换页符
\t	tab符
\n	换行
\r	回车
\v	垂直换行符
*	匹配文本里面的*号
+	匹配文本里面的+号
?	匹配文本里面的?号

三、go的regexp包的使用

go的regexp包提供了两组函数，一组是regexp包级别的，一组是Regexp对象的。而包级别是对Regexp对象的封装。我们看下Match的源码：

// Match reports whether the byte slice b
// contains any match of the regular expression pattern.
// More complicated queries need to use Compile and the full Regexp interface.
func Match(pattern string, b []byte) (matched bool, err error) {
	re, err := Compile(pattern)
	if err != nil {
		return false, err
	}
	return re.Match(b), nil
}

可以看到它是调用了Compile获取到Regexp的对象，然后再调用该对象的Match方法。基于此，我们这里只介绍Regexp对象的方法。

1.创建Regexp对象

创建该对象go提供了两个方法：

func Compile(expr string) (*Regexp, error)
func MustCompile(str string) *Regexp

这里我们简单使用Compile方法创建一个Regexp的实例：

re, err := regexp.Compile(`\d{5}`)

if err != nil {
	log.Fatalf("create re failure, err:%v\n",err)
}

2.使用Match进行文本匹配

Match提供了字符串的匹配操作，Match的提供如下几个接口：

 func (re *Regexp) Match(b []byte) bool
 func (re *Regexp) MatchReader(r io.RuneReader) bool
 func (re *Regexp) MatchString(s string) bool

三个函数定义了从byte数组/RuneReader/string里面进行匹配，只要在查询文本里面能找到满足要求的字符就返回成功。

下面我们来匹配满足139开头的手机号：

re, err := regexp.Compile(`^139\d{8}$`)

if err != nil {
	log.Fatalf("create re failure, err:%v\n",err)
}

matched1 := re.MatchString("13912345678")
fmt.Println("matched1 success:", matched1)
matched2 := re.MatchString("113912345678")
fmt.Println("matched2 success:", matched2)

// 打印结果
// matched1 success: true
// matched2 success: false

这里我们使用了^开始和$结尾来限定匹配长度。如果不限定^开始和$结尾，那么只要在查找查找串里面包含139开头的11位数字就返回成功了

3.使用Find类函数进行文本查找

Find提供了字符串的查找，Find函数提供了两种查找方式，一种是Find：查找到一条正则匹配到结果就返回了，一种是FindAll：查找所有正则模式匹配到结果才返回。Find和FindAll里面又包含直接查匹配内容的函数和查匹配内容的起始下标的函数。

Find类提供的函数：

func (re *Regexp) FindString(s string) string
func (re *Regexp) FindStringIndex(s string) (loc []int)

FindAll类的函数：

func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllString(s string, n int) []string
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int

FindAll中的第二个参数n表示的是搜索文本的长度，n<0表示搜索整个文本长度。

searchText := "客服电话：13800008888<br/>销售电话：15999990000<br/>QQ:1000000<br/>邮箱1000000@qq.com<br/>地址：xxxx"
re2 :=  regexp.MustCompile(`1\d{10}`)
mobile := re2.FindString(searchText)
fmt.Println("find mobile:", mobile)

findStrArr := re2.FindAllString(searchText, -1)
fmt.Println("find all string, strings:", findStrArr)


// 打印结果
// find mobile: 13800008888
// find all string, strings: [13800008888 15999990000]

4.使用Split函数对文本进行切分操作

func (re *Regexp) Split(s string, n int) []string

使用正则匹配项对文本进行分割，示例：

searchText := "客服电话：13800008888<br/>销售电话：15999990000<br/>QQ:1000000<br/>邮箱1000000@qq.com<br/>地址：xxxx"
re3 :=  regexp.MustCompile(`<br/>`)
splitStrArr := re3.Split(searchText,-1)
fmt.Println("splite search text, result:", splitStrArr)

// 打印结果
// splite search text, result: [客服电话：13800008888 销售电话：15999990000 QQ:1000000 邮箱1000000@qq.com 地址：xxxx]

5. 使用Replace类函数进行文本替换

func (re *Regexp) ReplaceAll(src, repl []byte) []byte
func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
func (re *Regexp) ReplaceAllString(src, repl string) string
func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

对文本中正则的匹配项进行全部替换，返回值为替换后的结果，示例：

searchText := "客服电话：13800008888<br/>销售电话：15999990000<br/>QQ:1000000<br/>邮箱1000000@qq.com<br/>地址：xxxx"

re4 :=  regexp.MustCompile(`@`)
result := re4.ReplaceAllString(searchText, "#")
fmt.Println("repalce str, result:", result)

// 打印结果
// repalce str, result: 客服电话：13800008888<br/>销售电话：15999990000<br/>QQ:1000000<br/>邮箱1000000#qq.com<br/>地址：xxxx

6.分组匹配子组项Submatch

// 所以子组匹配项
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

// 第一个子组匹配项
func (re *Regexp) FindStringSubmatch(s string) []string
func (re *Regexp) FindStringSubmatchIndex(s string) []int
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int

分组匹配是按小括号中匹配的内容再划分子组，然后通过Submatch的类函数可以查询到子组匹配的内容，返回的内容为[xxx, subxxx1, subxxx2]的切片，第一个xxx为全匹配的内容， subxxx1为子组匹配的内容。还可以使用(?P<name>xxx)模式设置捕获名称，然后SubexpNames()函数获取到名称。示例：

searchText := "客服电话：13800008888<br/>销售电话：15999990000<br/>QQ:1000000<br/>邮箱1000000@qq.com<br/>地址：xxxx"
re7 := regexp.MustCompile(`客服电话：(?P<servicePhone>\d{11})<br/>销售电话：(?P<marketPhone>\d{11})<br/>`)
match := re7.FindStringSubmatch(searchText)
groupNames := re7.SubexpNames()
fmt.Printf("%v, %v, %d, %d\n", match, groupNames, len(match), len(groupNames))
// 打印结果：
// [客服电话：13800008888<br/>销售电话：15999990000<br/> 13800008888 15999990000], [ servicePhone marketPhone], 3, 3


re8 := regexp.MustCompile(`.*?(\d{11})<br/>`)
match1 := re8.FindAllStringSubmatch(searchText, -1)
fmt.Printf("%v, %d\n", match1, len(match1))
// 打印结果
// [[客服电话：13800008888<br/> 13800008888] [销售电话：15999990000<br/> 15999990000]], 2

参考链接

https://github.com/google/re2/wiki/Syntaxopen in new window

https://golang.google.cn/pkg/regexp/open in new window

go正则表达式使用指南

# 一、go正则表达式使用简介

# 二、go正则表达式的模式

# 1. 正则匹配

# 2.分组匹配

# 3. 转义字符

# 三、go的regexp包的使用

# 1.创建Regexp对象

# 2.使用Match进行文本匹配

# 3.使用Find类函数进行文本查找

# 4.使用Split函数对文本进行切分操作

# 5. 使用Replace类函数进行文本替换

# 6.分组匹配子组项Submatch

# 参考链接