2019年03月25日阅读 4

nginx正则表达式(上篇)

微信公众号：郑尔多斯

关注「郑尔多斯」公众号，回复「领取资源」，获取IT资源500G干货。
升职加薪、当上总经理、出任CEO、迎娶白富美、走上人生巅峰！想想还有点小激动

关注可了解更多的Nginx知识。任何问题或建议，请公众号留言;
关注公众号，有趣有内涵的文章第一时间送达！

前言

在Nginx中location, server_name,rewrite等模块使用了大量的正则表达式，通过正则表达式可以完整非常强悍的功能，但是这部分对我们阅读源码也产生了非常大的困惑。本文就集中精力来学习一下Nginx中的正则表达式，帮助我们更透彻的理解nginx中的功能。

起源

Nginx中的正则表达式使用了pcre格式，并且封装了pcre函数库的几个常用函数，我们学习一下这几个函数，通过它们就可以透彻的理解nginx中的正则表达式。

编译正则表达式

正则表达式在使用之前要首先经过编译(compile)，得到一个编译之后的数据结构，然后通过这个数据结构进行正则匹配和其他各种信息的获取。
PCRE中进行编译的函数有两个，分别为pcre_compile()和pcre_compile2()，这两个函数的功能类似，Nginx使用了前者，所以我们对pcre_compile进行分析。




    
pcre *pcre_compile(
     const char *pattern, 
     int options, 
     const char **errptr, 
     int *erroffset, 
     const unsigned char *tableptr
);
复制代码

参数说明：
pattern: 将要被编译的正则表达式。
options: 编译过程中使用到的选项。在Nginx中，只使用到了PCRE_CASELESS选项，表示匹配过程中不区分大小写。
errptr:保存编译过程中遇到的错误。该字段如果为NULL，那么pcre_compile()会停止编译，直接返回NULL.
erroffset:该字段保存编译过程中发生错误的字符在pattern中的偏移量。
tableptr:这个参数的作用不清楚，但是文档中说可以为NULL，并且Nginx中也确实设置为NULL,所以可以忽略这个字段。

返回值：
该函数返回一个pcre *指针，表示编译信息，通过这个返回值可以获取与编译有关的信息，该结构体也用于pcre_exec()函数中，完整匹配操作。

获取编译信息

通过上述的编译返回的结构体，可以获取当前pattern的许多信息，比如捕获分组的信息等，下面的函数就是完成这个功能的。

int pcre_fullinfo(
      const pcre *code, 
      const pcre_extra *extra, 
      int what, 
      void *where
);
复制代码

参数说明：
code : 这个参数就是上面的pcre_compile()返回的结构体。
extra: 这个参数是pcre_study()返回的结构体，如果没有，可以为NULL.
what : 我们要获取什么信息
where: 保存返回的数据

返回值：
如果函数执行成功，返回0.
nginx中通过该函数获取了如下信息：

PCRE_INFO_CAPTURECOUNT: 得到的是所有子模式的个数,包含命名捕获分组和非命名捕获分组;

PCRE_INFO_NAMECOUNT: 得到的是命名子模式的个数,不包括非命名子模式的个数;

在这里要说明一个情况：PCRE允许使用命名捕获分组，也允许使用匿名捕获分组（即分组用数字来表示），其实命名捕获分组只是用来标识分组的另一种方式，命名捕获分组也会获得一个数字分组名称。PCRE提供了一些方法可以通过命名捕获分组的名称来快速获取捕获分组内容的函数，比如：pcre_get_named_substring() .
也可以通过以下步骤来获取捕获分组的信息：

将命名捕获分组的名称转换为数字。
通过上一步的数字来获取分组的信息。
这里就牵涉到了一个 name to number 的转换过程，PCRE维护了一个 name-to-number 的map，我们可以根据这个map完成转换功能，这个map有以下三个属性：

PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE

这个map包含了若干个固定大小的记录，可以通过PCRE_INFO_NAMECOUNT参数来获取这个map的记录数量(其实就是命名捕获分组的数量)，通过PCRE_INFO_NAMEENTRYSIZE来获取每个记录的大小，这两种情况下，最后一个参数都是一个int类型的指针。其中每个每个记录的大小是由最长的捕获分组的名称来确立的。The entry size depends on the length of the longest name.

PCRE_INFO_NAMETABLE 返回一个指向这个map的第一条记录的指针（一个char类型的指针），每条记录的前两个字节是命名捕获分组所对应的数字分组值，剩下的内容是命名捕获分组的name，以'\0'结束。返回的map的顺序是命名捕获分组的字母顺序。

下面是PCRE官方文档中的一个例子：

When PCRE_DUPNAMES is set, duplicate names are in order of their parentheses numbers. For example, consider the following pattern (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.

例子

这里有一个从网上找的例子，但是具体找不到原文的链接了，如下：

//gcc pcre_test.c -o pcre_test -L /usr/lib64/ -lpcre
#include <stdio.h>
#include <pcre.h>

int main()
{
    pcre  *re;
        const   char       *errstr;
    int  erroff;
    int captures =0, named_captures, name_size;
    char  *name;
    char *data = "(?<date> (?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d) )";
    int n, i;
    char  *p;
    p = data;
    printf("%s \n", p);
    re = pcre_compile(data, PCRE_CASELESS, &errstr, &erroff, NULL);
    if(NULL == re)
    {
        printf("compile pcre failed\n");
        return 0;
    }
    n = pcre_fullinfo(re, NULL, PCRE_INFO_CAPTURECOUNT, &captures);
    if


    
(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_CAPTURECOUNT failed %d \n", n);
        return 0;
    }
    printf(" captures %d \n", captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMECOUNT, &named_captures);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMECOUNT failed %d \n", n);
        return 0;
    }
    printf("named_captures %d \n", named_captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMEENTRYSIZE, &name_size);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMEENTRYSIZE failed %d \n", n);
        return 0;
    }
    printf("name_size %d \n", name_size);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMETABLE, &name);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMETABLE failed %d \n", n);
        return 0;
    }
    p =name;
    int j;
    for(j = 0; j < named_captures; j++)
    {
        for(i = 0; i <2; i++)
        {
            printf("%x ", p[i]);
        }
        printf("%s \n", &p[2]);
        p += name_size;
    }
    return 0;
}



    
复制代码

输出结果如下：

从结果中可以看出来：
总共有 5 个捕获分组
4 个命名捕获分组
每个记录的最大长度是 8，这里就是 month 这条记录是最长的了，因为最后面还有一个 '\0' 结束符，所以长度为 8
我们可以看出来，对于每个命名捕获分组，也都会给它分配一个数字编号。并且capture的数字是和非命名子模式一起排列的,也就是根据左括号的先后排列的

匹配

上面介绍了编译，以及获取其他信息，那么剩下的就是最重要的匹配了。

int pcre_exec(
    const pcre *code, 
    const pcre_extra *extra,
    const char *subject, 
    int length, 
    int startoffset, 
    int options, 
    int *ovector, 
    int ovecsize
);
复制代码

参数说明：
code: 编译函数的返回值
extra: pcre_study的返回值，可以为NULL
subject: 待匹配的字符串
length : subject的长度
startoffset: 开始匹配的位置
option: 匹配的选项
vector: 保存匹配结构的数据
ovecsize : vector数组的长度，必须为3的倍数
下面是PCRE文档中对该函数的一些解释，我翻译了一部分：

How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.

通常来说，一个pattern可以匹配一个subject中的特定一部分，除此之外，subject中的一部分还可能会被pattern中的一部分匹配（意思就是：pattern中可能存在捕获分组，那么subject中的一部分可能会被这部分捕获分组所匹配）。

Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.

我们在pcre_exec()中的vector参数就是会保存一系列integer offset，通过这些整形偏移量我们就可以获取捕获分组的内容。vector参数的数量是通过ovecsize参数指定的，ovecsize参数的大小必须是三的倍数。

The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec()while matching capturing subpatterns, and is not available for passing back information. The length passed in ovecsize should always be a multiple of three. If it is not, it is rounded down.

vector参数的前2/3用来保存后向引用的分组捕获（比如$1, $2等），每个substring都会使用vector中的两个整数。剩余的1/3被pcre_exec()函数在捕获分组的时候使用，不能被用来保存后向引用。

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of a pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. The first pair, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.

当匹配成功之后，从vector参数的第一个元素开始，每对元素都代表一个捕获分组，直到最多前2/3个元素。vector参数的每对元素的第一个元素表示当前捕获分组的第一个字符在subject中的偏移量，第二个元素表示捕获分组最后一个元素后面的元素在subject中的位置。vector的前两个元素, ovector[0]和ovector[1] 用来表示subject中完全匹配pattern的部分。next pair用来表示第一个捕获分组，以此类推。pcre_exec()的返回值是匹配的最大分组的number加1(这部分不好翻译，直接看英文更容易理解）。例如，如果两个捕获分组被匹配成功，那么返回值就是3。如果没有匹配成功任何分组，那么返回值就是1。

If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.

如果某个捕获分组被多次匹配成功，那么返回最后一次匹配成功的substring的信息。

If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. In particular, if the substring offsets are not of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector.

如果vector太小，无法保存所有的捕获分组，那么pcre会尽可能的使用这个数组（但是最多使用2/3）,并且pcre_exec()函数返回0。特别指出，如果我们对捕获分组的信息不感兴趣，那么可以把vector参数设置为NULL，ovecsize参数设置为0。

The pcre_info() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.

我们可以使用pcre_info()函数来获取当前的pattern中有多少捕获分组(其实现在使用的都是pcre_fullinfo()函数)。比如ovector参数的值为n，那么为了获取被整个pattern匹配的string的信息，我们应该把ovecsize的值设置为 (n + 1) * 3.

It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corresponding to unused subpatterns are set to -1.

举一个例子，如果我们使用"abc"来匹配"(a|(z))(bc)"，那么pcre_exec() 函数将返回4.其中第一个和第三个捕获分组捕获成功，但是第二个分组没有捕获成功。所以第二个分组对应的那个下标对的值会被设置为 -1。

Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1. However, you can refer to the offsets for the second and third capturing subpatterns if you wish (assuming the vector is large enough, of course).

参考

PCRE函数库链接：http://regexkit.sourceforge.net/Documentation/pcre/pcreapi.html#SEC1
微软关于正则表达式的用法：https://docs.microsoft.com/zh-cn/dotnet/standard/base-types/anchors-in-regular-expressions

喜欢本文的朋友们，欢迎长按下图关注订阅号郑尔多斯，更多精彩内容第一时间送达