正则表达式语法概述(熟悉可跳过)

对于go 正则表达式语法也可以运行下列命令查看

go doc regexp/syntax

单字符集

符号	描述
.	匹配任意字符,可能包含'\n'(当 flag s = true)
[xyz]	字符集(匹配字符集中任意字符)
[^xyz]	否定字符集(排除字符集中任意字符)
\d	Perl字符集(匹配任意数字)
\D	否定Perl字符集(匹配任意非数字)
[[:alpha:]]	ASCII字符集(匹配ASCII字符集)
[[:^alpha:]]	否定ASCII字符集(排除ASCII字符集)
\pN	Unicode字符集(字符名)
\p{Greek}	Unicode字符集
\PN	否定Unicode字符集(一个字符名)
\P{Greek}	否定Unicode字符集

复合

符号	描述
xy	匹配字符x后接字符y
x\|y	匹配字符x或字符y

重复

符号	描述
x*	匹配零次或更多x,倾向于更多
x+	匹配一次或更多x,倾向于更多
x?	匹配0次或1次x,倾向于一次
x{n,m}	匹配n或n+1或 ... 或 m次x,倾向于更多
x{n, }	匹配n次或更多次x,倾向于更多
x{n}	明确匹配n次x
x*?	匹配0或更多次x,倾向于更少
x+?	匹配1或更多次x,倾向于更少
x??	匹配0次或1次x,倾向于0次
x{n,m}?	匹配n或 n+1 或 ... 或m次x,倾向于更少
x{n, }?	匹配n次或更多次x,倾向于更少
x{n}?	明确匹配n次x

分组

符号	描述
(re)	带编号的捕获分组(用于submatch)
(?P<name>re)	命名且带编号的捕获分组(用于submatch)
(?:re)	非捕获分组
(?flags)	对当前分组设置flags,非捕获分组
(?flags:re)	为re设置flags,非捕获分组

标志(Flags)

Flags语法: xyz(set flags) -xyz(clear flags) xy-z(set xy flags and clear z flags),具体有下列flags:

符号	描述
i	大小写不敏感(default false)
m	多行模式:^和$匹配begin/end line以及begin/end text(default false)
s	使得.匹配'\n'(default false)
U	非贪婪模式:交换 x和x?,x+和x+?等语义(default false)

零宽度字符串

符号	描述
^	匹配在文本开始或行开始(flag m = true)
$	匹配文本结束(like \z not Perl's \Z)或文本开始(flag m = true)
\A	匹配文本开始
\b	匹配ASCII文本边界(\w on one side,and \W,\A, or \z on the other)
\B	排除ASCII文本边界
\z	匹配文本结尾

转义序列

符号	描述
\a	响铃(== \007)
\f	换页符(== \014)
\t	水平制表符(== \011)
\n	换行符(== \012)
\r	回车符(== \015)
\v	垂直制表符(== \013)
\*	字面量,用于任何标点字符
\123	八进制字符代码(最多三位数字)
\x7F	十六进制代码(明确两位数字)
\x{10FFFF}	十六进制代码
\Q...\E	字面量 ... 即使...中含标点

字符集元素

符号	描述
x	单字符
A-Z	范围内字符(包含A,Z)
\d	Perl字符集(匹配数字0-9)
[:foo:]	ASCII字符集foo
\p{Foo}	Unicode字符集Foo
\pF	Unicode字符集(一个字符的名字)

被命名的字符集作字符集元素

符号	描述
[\d]	匹配数字(== \d)
[^\d]	匹配非数字(== \D)
[\D]	匹配非数字(== \D)
[^\D]	匹配数字(== \d)
[[:name:]]	在字符集中匹配被命名的ASCII类(== [:name:])
[^[:name:]]	在字符集中排除被命名的ASCII类(== [:^name:])
[\p{Name}]	在字符集中匹配带命名属性的Unicode字符类(== \p{Name})
[^\p{Name}]	在字符集中排除带命名属性的Unicode字符类(== \P{Name})

Perl字符集(ASCII Only)

符号	描述
\d	匹配数字(== [0-9])
\D	匹配非数字(== [^0-9])
\s	匹配空白字符(== [\t\r\f\n])
\S	匹配非空白字符(== [^\t\r\f\n])
\w	匹配单词字符集(== [0-9a-zA-Z_])
\W	匹配非单词字符集(== [^0-9a-zA-Z])

ASCII字符集

符号	描述
[[:alnum:]]	字母(== [0-9a-zA-Z])
[[:alpha:]]	拼音(== [a-zA-Z])
[[:ascii:]]	ASCII([\x00-\x7F])
[[:cntrl:]]	控制符(== [\x00-\x1F\x7F])
[[:digit:]]	数字(== [0-9])
[[:graph:]]	可视化和可打印(== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{\|}~])
[[:lower:]]	小写(== [a-z])
[[:print:]]	可打印的(== [ -~] == [ [:graph:]])
[[:punct:]]	标点(== [!-/:-@[-`{-~])
[[:space:]]	空白(== [\t\f\r\v\n])
[[:upper:]]	大写(== [A-Z])
[[:word:]]	单词(== [0-9a-zA-Z_])
[[:xdigit:]]	十六进制字符(== [0-9A-Fa-f])

Regexp

regexp是go std library提供的一个用于处理正则表达式库,这里接受正则表达式的语法和Python,Perl等其他语言相同,更精确地说,被接受的语法由RE2确定,在 https://golang.org/s/re2syntax
中可以找到相关描述,里面的方法用于匹配的方法用正则表达式描述,类似于下面的格式:

Find(All)?(String)?(Submatch)?(Index)?

如果方法中存在'All',则表示会匹配整个表达式全部的非重叠连续区域,当然,存在一个指示参数可选希望获得匹配多少个,反之,则只会得到匹配中的一个

如果方法中存在'String',则表示方法接受的表达式类型是string,返回的匹配结果也是string,否则都是[]byte

如果方法中存在'Submatch',则表示希望方法返回的匹配结果是分组的切片,否则返回整个匹配结果

如果方法中存在'Index',则表示返回匹配结果在原表达式(字符串或byte切片)的位置范围

匹配

为了判断某表达式是否匹配正则表达式格式,go language提供了如下方法:

func Match(pattern string, b []byte) (matched bool, err error)

func MatchReader(pattern string, reader io.RuneReader) (matched bool, err error)

func MatchString(pattern string, s string) (matched bool, err error)

Example

package main

import (
	"regexp"
	"fmt"
)

func checkError(err error) {
	if err != nil {
		panic(err)
	}
}

func main() {
	pattern := `[0-9]{6,11}`
	tb := []byte("520912345")
	matched, err := regexp.Match(pattern, tb)
	checkError(err)
	fmt.Println(matched) // true
	matched, err = regexp.MatchString(pattern, "1977")
	checkError(err)
	fmt.Println(matched) // false
}

预编译正则表达式

由于一个正则表达式可能会被多次使用,所以go提供了预编译方法,得到一个*Regexp,得到调用其方法的实例,可以重复利用已有的预编译好的正则表达式,并且除了些许配置函数如: Longest()外,Regexp在多个goroutine里是并发安全的

func Compile(expr string) (*Regexp, error)

编译解析一个正则表达式并返回,若成功返回,则得到一个正则表达式对象匹配文本
当匹配文本时,正则表达式匹配文本最左边(leftmost)开始匹配,并选择回朔搜索最早的匹配结果返回,所谓的leftmost-first匹配语义和Perl,Python等其他实现的方法相同,对于POSIX采用则是
leftmost-longest match,我们可以使用CompilePOSIX声明采用的标准

func CompilePOSIX(expr string) (*Regexp, error)

CompilePOSIX和Compile类似,但是限制正则表达式语法为 POSIX ERE(egrep)且改变匹配语义为
leftmost-longest

func MustCompile(expr string) *Regexp

类似于Compile,但是正则表达式解析失败时会发生panic,常用于全局*Regexp的初始化

func MustCompilePOSIX(expr string) *Regexp 类似于CompilePOSIX,但是正则表达式解析失败时会发生panic,常用于全局*Regexp的初始化

Example

package main

import (
	"regexp"
	"fmt"
)

func main() {
	re := regexp.MustCompile(`a.*?a`)
	text := []byte("aabbaa")
	finded := re.FindAll(text, -1)
	fmt.Printf("%q\n", finded) //["aa", "aa"]
	lre := regexp.MustCompilePOSIX(`a.*?a`) // ["aabbbaa"]
	finded = lre.FindAll(text, -1)
	fmt.Printf("%q\n", finded)
}

注:leftmost-first和leftmost-longest区别在于leftmost-longest是贪婪匹配,而leftmost-first是惰性的匹配

查找

regexp提供了很方便的查找函数,可以通过正则表达式查找文本中符合正则表达式的子序列

func (re *Regexp) Find(b []byte) []byte

func (re *Regexp) FindAll(b []byte, n int) [][]byte

func (re *Regexp) FindIndex(b []byte) []int

func (re *Regexp) FindAllIndex(b []byte, n int) [][]int

func (re *Regexp) FindSubmatch(b []byte) []byte

func FindSubmatchIndex() []int

func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte

func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

func (re *Regexp) FindString(str string) string

func (re *Regexp) FindAllString(str string, n int) []string

func (re *Regexp) FindStringIndex(str string) []int

func (re *Regexp) FindAllStringIndex(str string, n int) [][]int

func (re *Regexp) FindStringSubmatch(str string) []string

func (re *Regexp) FindAllStringSubmatch(str string, n int) [][]string

func (re *Regexp) FindStringSubmatchIndex(str string) []int

func (re *Regexp) FindAllStringSubmatchIndex(str string, n int) [][]int

...

这些函数可以用来匹配文本子序列,子匹配,以及匹配在文本位置

package main

import (
	"regexp"
	"fmt"
)

func main() {
	expr := `(\d{4})-(\d{3})-(\d{4})`
	text := []byte("1231-456-45631245-123-0012")
	re := regexp.MustCompile(expr)
	finded := re.Find(text)
	fmt.Printf("%s\n", finded)
	allfinded := re.FindAll(text, -1)
	fmt.Printf("%s\n", allfinded)
	submatch := re.FindSubmatch(text)
	fmt.Printf("%s\n", submatch)
	allsubmatch := re.FindAllSubmatch(text, -1)
	fmt.Printf("%s\n", allsubmatch)
	index := re.FindIndex(text)
	fmt.Printf("%d\n", index)
	allindex := re.FindAllIndex(text, -1)
	fmt.Printf("%d\n", allindex)
	submatchIndex := re.FindSubmatchIndex(text)
	fmt.Printf("%d\n", submatchIndex)
	allSubmatchIndex := re.FindAllSubmatchIndex(text, -1)
	fmt.Printf("%d\n", allSubmatchIndex)
}

正则表达式中匹配和未匹配是一个非常重要的概念,如果已经匹配的是不会继续加入扫描范围中的,但是有时对于更精确,动态的将部分匹配序列继续加入扫描范围是很困难的,但是我们能通过多次使用正则表达式实现

这里有一段代码解释了匹配了的序列部分是没有继续加入扫描范围的

package main

import (
	"fmt"
	"regexp"
)

func main() {
	expr := `\d{3}-\d{4}-\d{3}`
	text := []byte("123-1234-5671-1234-123")
	re := regexp.MustCompile(expr)
	/*
		实际输出:[123-1234-567]
		而这里结果不是 [123-1234-567, 671-1234-123]
		是因为"67"匹配后就被"吃"了
	*/
	all := re.FindAll(text, -1)
	fmt.Printf("%s\n", all)
}

替换

我们可以通过一系列函数来实现替换匹配的分组(包括命名分组和编号分组),来重新生成文本(注:生成的文本是先copy源文本再替换副本,并返回副本的),部分API(没Literal,Func)可以使用Expand(占位符)来引用其分组,对于编号分组可以使用$0,$1,$2 ...,命名分组可以使用$name(name为分组名)也可以是${0},${1},${2} ... ${name}解析时是贪婪的,所以$10会被当成${10},要使用$字面量可以使用$$进行转义

func (re *Regexp) ReplaceAll(src []byte, repl []byte) []byte

func (re *Regexp) ReplaceAllString(src string, repl string) string

func (re *Regexp) ReplaceFunc(src []byte, repl func([]byte) []byte) []byte

func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

func (re *Regexp) ReplaceAllLiteral(src []byte, repl []byte) []byte

func (re *Regexp) ReplaceAllStringLiteral(src string, repl string) string

...

Example

package main

import (
	"fmt"
	"regexp"
	"bytes"
)

func main() {
	expr := `\b(?P<begin>\w+)\b \b(\w+)\b \b(?P<end>\w+)\b`
	text := []byte("Hello Marco Epsilon")
	re := regexp.MustCompile(expr)
	replace := []byte("$end $2 $begin")
	replaced := re.ReplaceAll(text, replace)
	// Output: Epsilon Marco Hello
	fmt.Printf("%s\n", replaced)
	// add func to lowercase first alpha,without using expand
	funced := re.ReplaceAllFunc(text, func(repl []byte) []byte {
		return bytes.ToLower(repl)
	})
	// Output: hello marco epsilon
	fmt.Printf("%s\n", funced)
	// keep $3 $2 $1 because of literal,without using expand
	literaled := re.ReplaceAllLiteral(text, []byte("$3 $2 $1"))
	fmt.Printf("%s\n", literaled)
}

组合

除了上面一系列的方法,还有专门用于精确替换和组合特定分组序列的方法

func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte

func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

package main

import (
	"fmt"
	"regexp"
)

func main() {
	expr := `(?m)(?P<key>\w+):\s(?P<value>\w+)$`
	re := regexp.MustCompile(expr)
	text := []byte(`
		optional1: content1
		optional2: content2
		optional3: content3
	`)
	dst := []byte{}
	template := []byte("$key=$value\n")
	allIndex := re.FindAllSubmatchIndex(text, -1)
	for _, index := range allIndex {
		dst = re.Expand(dst, template, text, index)
	}
	/*
	Output:
		optional1=content1
		optional2=content2
		optional3=content3
	*/
	fmt.Printf("%s", dst)
}

More Functions

func (re *Regexp) LiteralPrefix() (prefix string, completed bool)

用于返回正则表达式字面量前缀和是否完全是字面量的指示

package main

import (
	"regexp"
	"fmt"
)

func main() {
	expr := `hello,this is literal,next is not\d{3}`
	re := regexp.MustCompile(expr)
	literal, completed := re.LiteralPrefix()
	// Output:
	// hello,this is literal,next is not
	// false
	fmt.Println(literal)
	fmt.Println(completed)
}

func (re *Regexp) Longest()

将正则表达式匹配模式改为贪婪模式即从leftmost-first改为 leftmost-longest(这个API可能会导致并发不安全)

package main

import (
	"fmt"
	"regexp"
)

func main() {
	// leftmost-first mode
	expr := `(?U)a.*b`
	text := []byte("abaaaab")
	re := regexp.MustCompile(expr)
	// Output:
	// [ab aaaab]
	fmt.Printf("%s\n", re.FindAll(text, -1))
	// leftmost-longest mode
	re.Longest()
	// Output:
	// [abaaaab]
	fmt.Printf("%s\n", re.FindAll(text, -1))
}

func (re *Regexp) NumSubexp() int

返回子分组个数

package main

import (
	"fmt"
	"regexp"
)

func main() {
	expr := `((\d{3})haha\d{2}){4}`
	re := regexp.MustCompile(expr)
	// Output:
	// 2
	fmt.Println(re.NumSubexp())
}

func (re *Regexp) Split(s string, n int) []string

按照正则表达式FindAllString的结果返回不包含该结果子字符串集合,如果正则表达式中不包含元字符,则该结果和strings.SplitN相同返回的子字符串个数个数和n的关系

n > 0

返回最多n个子串,最后一个子串是未分割的剩余字符串

n = 0

结果将会是nil

n < 0

返回所有子字符串

package main

import (
	"fmt"
	"regexp"
)

func main() {
	expr := `ab|bc`
	re := regexp.MustCompile(expr)
	text := "cbabcbdebacbcabcbbacb"
	results := re.Split(text, -1)
	for _, value := range results {
		fmt.Println(value)
	}
	fmt.Println("-----------")
	results = re.Split(text, 3)
	for _, value := range results {
		fmt.Println(value)
	}
	// Output:
	/*
	cb
	cbdebac

	cbbacb
	-----------
	cb
	cbdebac
	abcbbacb
	*/
}

func (re *Regexp) String() string

返回之前用于编译的字符串

package main

import (
	"regexp"
	"fmt"
)

func main() {
	expr := `ab|bc\d{2,5}`
	re := regexp.MustCompile(expr)
	fmt.Println(re.String())
	//Output: ab|bc\d{2,5}
}

func (re *Regexp) SubexpNames() []string

返回正则表达式中子匹配表达式的名称切片m,m[0]永远为空字串(m[0]为整个表达式的名字,而整个表达式没有名字,故为空)

注:对于没有使用(?Pre)这种形式子匹配(即只带编号的匹配),该函数返回的结果和整个表达式没有名字的结果一样,也是空字符串,保持了统一性

package main

import (
	"regexp"
	"fmt"
)

func main() {
	expr := `(?P<name>[^:]+):\s+(.*),\s+(?P<age>[^:]+):\s+(.*)`
	re := regexp.MustCompile(expr)
	names := re.SubexpNames()
	for _, v := range names {
		fmt.Println(v)
	}
	//Output:
	/*
	
	name

	age

	*/
}