第十六章. 更漂亮的正则表达式

Fancier Regular Expressions

声明

本章翻译仅用于 Raku 学习和研究, 请支持电子版或纸质版

第十六章. 更漂亮的正则表达式

You won’t see all the rest of the regular expression syntax in this chapter, but you’ll see the syntax you’ll use the most. There’s much more to patterns, but this should get you most of the way through common problems. With grammars (Chapter 17), the power of even simple patterns will become apparent.

在本章中,你不会看到所有其他正则表达式语法,但你将看到最常用的语法。模式有很多,但这应该可以解决常见问题。使用 grammars(第17章),即使是简单模式的威力也会变得明显。

量词

Quantifiers allow you to repeat a part of a pattern. Perhaps you want to match several of the same letter in a row—an a followed by one or more b’s then another a. You don’t care how many b’s there are as long as there’s at least one of them. The + quantifier matches the immediately preceding part of the pattern one or more times:

量词允许你重复模式的一部分。也许你想要连续匹配几个相同的字母 - 一个 a 后跟一个或多个 b,然后是另一个a。你不在乎有多少 b,只要有至少一个 b 就好了。 + 量词与紧接其前的部分模式匹配一次或多次:

my @strings = < Aa Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ab+ a / ?? 'Matched!' !! 'Missed!';
}

The first Str here doesn’t match because there isn’t at least one b. All of the others have an a followed by one or more bs and another a:

这里的第一个字符串不匹配,因为没有至少一个 b。所有其他字符串都有一个 a 后跟一个或多个 b,还有另一个 a

Aa Missed!
Aba Matched!
Abba Matched!
Abbba Matched!
Ababa Matched!

A quantifier only applies to the part of the pattern immediately in front of it—that’s the b, not the ab. Group the ab and apply the quantifier to the group (which counts as one thingy):

量词仅适用于紧接在其前的部分模式 - 即 b,而不是 ab。将 ab 分组并将量词应用于组(计为一个东西):

my @strings = < Aa Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i [ab]+ a / ?? 'Matched!' !! 'Missed!';
}

Now different Strs match. The ones with repeated b’s don’t match because the quantifier applies to the [ab]group. Only two of the Strs have repeated ab’s:

现在匹配的是不同的字符串了。重复 b 的那些不匹配,因为量词应用于 [ab] 组。只有两个字符串重复了 ab

Aa Missed!
Aba Matched!
Abba Missed!
Abbba Missed!
Ababa Matched!

EXERCISE 16.1Using butterfly_census.txt (the file you made at the end of Chapter 9), use a regex to count the number of distinct butterfly species whose names have two or more consecutive i’s. Use the + quantifier in your pattern.

练习16.1使用 butterfly_census.txt(你在第9章末尾创建的文件),使用正则表达式来计算名称有两个或更多个连续 i 的不同蝴蝶物种的数量。在模式中使用 + 量词。

Zero or More

The * quantifier is like + but matches zero or more times. This makes that part of the pattern optional. If it matches it can repeat as many times as it likes. Perhaps you want to allow the letter a between b’s. The a’s can be there or not be there:

* 量词类似于 + 但匹配零次或多次。这使得该模式的一部分可选。如果它匹配,它可以重复任意次数。也许你想允许 b 之间有字母 aa 可以在那里或不在那里:

my @strings = < Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ba*b / ?? 'Matched!' !! 'Missed!';
}

The Strs with consecutive b’s match because they have zero a’s between the b’s, but the Str with bab also matches because it has zero or more a’s between them:

带有连续 b字符串匹配了,因为它们在 b 之间没有 a,但是带有 bab字符串也匹配了,因为它们之间有零或多个 a

Aba Missed!
Abba Matched!
Abbba Matched!
Ababa Matched!

EXERCISE 16.2Adapt your solution from the previous exercise to find the butterfly species names that have consecutive a’s that may be separated by either n or s.

练习16.2 从上一个练习中获取解决方案,找到具有连续 a 的蝴蝶种类名称,这些名称可以用 ns 分隔。

Greediness

The + and * quantifiers are greedy; they match as much of the text as they can. Sometimes that’s too much. Change the earlier example to match another b after the quantifier. Now there must be at least two b’s in a row:

+* 量词是贪婪的;他们尽可能多地匹配文本。有时匹配太多了。更改前面的示例以匹配量词后的另一个 b。现在必须连续至少有两个 b

my @strings = < Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ab+ ba / ?? 'Matched!' !! 'Missed!';
}

The first Str doesn’t match because it doesn’t have one or more b’s followed by another b. It’s the same for the last Str. The middle two Strs have enough b’s to satisfy both parts of the pattern:

第一个字符串不匹配,因为它没有一个或多个 b,后面再跟另一个 b。对于最后一个字符串来说也是如此。中间的两个字符串有足够的 b 来满足模式的两个部分:

Aba Missed!
Abba Matched!
Abbba Matched!
Ababa Missed!

But think about how this works inside the matcher. When it sees the b+ it matches as many b’s as it can. In Abbba, the b+ starts by matching bbb. The b+ part of the pattern is satisfied. The matcher moves on to the next part of the pattern, which is another b. The text doesn’t have any leftover b’s to satisfy that part because the greedy quantifier matched them all.

The match doesn’t fail because of another tactic the matcher can use: it can backtrack on the quantifier that just matched to force it to give up some of the text. The b+ needs one or more b’s. Whether it matched two or three doesn’t matter, because either satisfies that. Backing up one position in the text leaves a b for the next part to match. Once it backs up it tries the next part of the pattern.

但想想在匹配器中如何工作。当它看到 b+ 时,它尽可能多地匹配 b。在 Abbba 中,b+ 从匹配 bbb 开始。满足模式的 b+ 部分。匹配器移动到模式的下一部分,这是另一个 b。该文本没有任何剩余的 b 来满足该部分,因为贪婪的量词把它们全部匹配完了。

匹配不会因为匹配器可以使用的另一种策略而失败:它可以回溯刚刚匹配的量词,迫使它放弃一些文本。 b+ 需要一个或多个 b。它是否匹配两个或三个并不重要,因为要么满足这一点。在文本中回退一个位置会空出一个 b 以供下一部分匹配。一旦它回退它就会尝试模式的下一部分。

Zero or One

The ? quantifier matches zero or once only; it makes the preceding part of the pattern optional. In this pattern you can have one or two b’s because you used ? to make one of them optional:

? 量词匹配零或一次;它使模式的前一部分可选。在这种模式中,你可以使用一个或两个 b,因为你使用过 ? 使其中一个可选:

my @strings = < Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ab? ba / ?? 'Matched!' !! 'Missed!';
}

Now the first Str can match because the first b can match zero times. The third Str can’t match because there is more than one b and the ? can’t match more than one of them:

现在第一个字符串可以匹配,因为第一个 b 可以匹配零次。第三个字符串无法匹配,因为有多个 b 并且 ? 不能匹配多个 b

Aba Matched!
Abba Matched!
Abbba Missed!
Ababa Matched!

Minimal and Maximal

If you want to match an exact number of times use **. With a single number after it the ** matches exactly that number of times. This matches exactly three b’s:

如果要匹配确切的次数,请使用 **。在它之后有一个数字,** 恰好匹配那个次数。这恰好与三个 b 匹配:

my @strings = < Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ab**3 a / ?? 'Matched!' !! 'Missed!';
}

There’s only one Str that matches:

只有一个字符串匹配:

Aba Missed!
Abba Missed!
Abbba Matched!
Ababa Missed!

You can use a range after the **. The quantified part must match at least the range minimum and will only match as many repetitions as the range maximum:

你可以在 ** 之后使用范围。量化部分必须至少匹配范围最小值,并且只匹配范围最大值的重复次数:

my @strings = < Aba Abba Abbba Ababa Abbbba >;
for @strings {
    put $_, ' ', m/ :i a b**2..3 a / ?? 'Matched!' !! 'Missed!';
}

Two Strs match—the ones with two or three consecutive b’s:

两个字符串匹配 - 具有两个或三个连续 b 的那个:

Aba Missed!
Abba Matched!
Abbba Matched!
Ababa Missed!
Abbbba Missed!

An exclusive range works too. Match two or three times by excluding the 1 and 4 endpoints to get the same output:

排除范围也有效。通过排除 14 端点来匹配两到三次以获得相同的输出:

my @strings = < Aba Abba Abbba Ababa >;
for @strings {
    put $_, ' ', m/ :i ab**1^..^4 a / ?? 'Matched!' !! 'Missed!';
    }

EXERCISE 16.3Output all the lines from the butterfly census file that have four vowels in a row.

EXERCISE 16.4Output all the lines from the butterfly census file that have exactly four repetitions of an a followed by a nonvowel (such as in Paralasa).

练习16.3 输出蝴蝶人口普查文件中连续有四个元音的所有行。

练习16.4 输出蝴蝶人口普查文件中的所有行,这些行恰好有四个重复的 a 后跟一个非元音(例如在 Paralasa 中)。

Controlling Quantifiers

Adding a ? after any quantifier makes it match as little as possible—the greedy quantifiers become nongreedy. The modified quantifier stops matching when the next part of the pattern can match.

These two patterns look for an H, some stuff, and then an s. The first one is greedy and matches all the way to the final s. The second one is nongreedy and stops at the first s it encounters. The greedy case matches the entire text but the nongreedy case matches only the first word:

在任何量词后面添加 ? 使得它尽可能少地匹配 - 贪婪的量词变得不贪婪。当模式的下一部分可以匹配时,修改的量词停止匹配。

这两个模式寻找 H,然后是一些东西,然后是一个 s。第一个是贪婪的,一直匹配到最后的 s。第二个是非贪婪的,并在它遇到的第一个 s 后停止。贪婪的案例匹配整个文本,但非贪婪的案例只匹配第一个单词:

$_ = 'Hamadryas perlicus';

say "Greedy: ",    m/ H .*  s /;  # Greedy: 「Hamadryas perlicus」
say "Nongreedy: ", m/ H .*? s /;  # Nongreedy: 「Hamadryas」

You’ll probably find that you often want to make the quantifiers nongreedy.

EXERCISE 16.5Output all the text in the input that appears between underscores. The Butterflies_and_Moths.txt file has some interesting nongreedy matches.

你可能会发现你经常想让量词不贪婪。

练习16.5 输出输入中出现在下划线之间的所有文本。 Butterflies_and_Moths.txt 文件有一些有趣的非贪婪匹配。

Turning Off Backtracking

The : modifier lets you turn off backtracking by preventing a quantifier from unmatching what it has already matched. In both of these patterns the .+ can match everything to the end of the Str. The first one has to unmatch some of that to allow the rest of the pattern to match. The second one uses .+:, which means it can’t give back any of the text to allow the first s to match, so that match fails:

: 修饰符允许你通过阻止量词取消匹配已匹配的内容来关闭回溯。在这两种模式中,.+ 可以匹配所有东西直到字符串的末尾。第一个必须与其中一些取消匹配,以允许模式的其余部分匹配。第二个使用 .+: ,这意味着它无法归还任何文本以允许第一个匹配,因此匹配失败:

$_ = 'Hamadryas perlicus';
say "Backtracking: ",
    m/ H .+  s \s perlicus/;  # Backtracking: 「Hamadryas perlicus」
say "Nonbacktracking: ",
    m/ H .+: s \s perlicus/;  # Nonbacktracking: Nil

The : can go immediately after the **. Each tries to match groups of three characters with a def at the end. The first one matches the entire Str because it’s greedy, but then backs up enough to allow def to match. The second one uses **:, so it refuses to unmatch the def and the pattern fails:

: 可以直接跟在 ** 后面。每个尝试匹配三个字符的组然后是末尾的 def。第一个匹配整个字符串,因为它是贪婪的,但后来回退足够多的字符以允许 def 匹配。第二个使用 **:,因此它拒绝取消匹配 def , 模式就失败了:

$_ = 'abcabcabcdef';
say "Backtracking: ",
    m/ [ ... ] **  3..4 def /;  # 「abcabcabcdef」
say "Nonbacktracking: ",
     m/ [ ... ] **: 3..4 def /;  # Nil

Table 16-1 summarizes the behavior of the different types of quantifiers.

表16-1总结了不同类型量词的行为。

Quantifier Example Meaning
? b? 零个或一个 b
* b* 零个或多个 b
+ b+ 一个或多个 b
** N b ** 4 正好 4 个 b
** M..N b ** 2..4 两到四个 b
** M^..^N b ** 1^..^5 带有排除范围的两到四个 b
?? b?? 零个 b (不常见的情况)
*? b*? 零个或多个 b,非贪婪的
+? b+? 一个或多个 b,非贪婪的
?: b?: 零个或多个 b,没有回溯
*: b*? 零个或多个 b,贪婪的,没有回溯
+: b+? 一个或多个 b,没有回溯
**: M..N b ** 2..4 两到四个 b,贪婪的,没有回溯

Captures

When you group with parentheses instead of square brackets you capture parts of the text:

当你使用圆括号而不是方括号分组时,你可以捕获文本的一部分:

say 'Hamadryas perlicus' ~~ / (\w+) \s+ (\w+) /;

In the .gist output you see the captures labeled with whole numbers starting from zero. The captures are numbered by their position in their subpattern from left to right:

.gist 输出中,你会看到标记从零开始的整数的捕获。捕获按照从左到右的子模式中的位置进行编号:

「Hamadryas perlicus」
 0 => 「Hamadryas」
 1 => 「perlicus」

You can access the captures with postcircumfix indices (but only if the match succeeds). This looks like a Positional but isn’t, but that’s a distinction you don’t need to worry about here. The output shows the same captures you saw before:

你可以使用 postcircumfix 索引访问捕获(但仅在匹配成功时)。这看起来像一个 Positional,但不是,但这是一个区别,这里你不需要担心。输出显示你之前看到的相同捕获:

my $match = 'Hamadryas perlicus' ~~ / (\w+) \s+ (\w+) /;

if $match {
    put "Genus: $match[0]";   # Genus: Hamadryas
    put "Species: $match[1]"; # Species: perlicus
}

The special variable $/ already stores the result of the last successful match. You can access elements in it directly:

特殊变量 $/ 已经存储了上次成功匹配的结果。你可以直接访问其中的元素:

$_ = 'Hamadryas perlicus';
if / (\w+) \s+ (\w+) / {
    put "Genus: $/[0]";    # Genus: Hamadryas
    put "Species: $/[1]";  # Species: perlicus
};

It gets better. There’s a shorthand to access the captures in $/. The number variables $0 and $1 are actually $/[0] and $/[1] (and this is true for as many captures as you create):

它变得更好了。有一个简写来访问 $/ 中的捕获。数字变量 $0$1 实际上是 $/[0]$/[1] (对于你创建的捕获次数,这是正确的):

$_ = 'Hamadryas perlicus';
if / (\w+) \s+ (\w+) / {
    put "Genus: $0";   # Genus: Hamadryas
    put "Species: $1"; # Species: perlicus
};

If a previous match fails then $/ is empty and you don’t see the values from the previous successful match. An unsuccessful match resets to $/ to nothing:

如果先前的匹配失败,则 $/ 为空,并且你看不到上一次成功匹配的值。不成功的匹配将 $/ 重置为空:

my $string = 'Hamadryas perlicus';

my $first-match = $string ~~ m/(perl)(.*)/;
put "0: $0 | 1: $1";  # 0: perl | 1: icus

my $second-match = $string ~~ m/(ruby)(.*)/;
put "0: $0 | 1: $1";  # 0:  | 1: -- nothing in these variables

Named Captures

Instead of relying on the numbered captures, you can give them names. These become keys in a Hash in the Match object. Label a capture with a $<LABEL>= in front of the capturing parentheses:

你可以为它们命名,而不是依赖于编号的捕获。这些成为 Match 对象中 Hash 的键。在捕获的圆括号前用 $ <LABEL>= 标记捕获:

$_ = 'Hamadryas perlicus';
if / $<genus>=(\w+) \s+ $<species>=(\w+) / {
    put "Genus: $/<genus>";      # Genus: Hamadryas
    put "Species: $/<species>";  # Species: perlicus
};

The output is often much easier to understand when you label the captures. It’s also easier to modify the pattern without disrupting later code, since the positions of labels don’t matter.

As before, you can leave off the slash in $/ but only if you use the angle brackets. This looks like Associative indexing even though the Match isn’t an Associative type:

标记捕获时,输出通常更容易理解。在不破坏后续代码的情况下修改模式也更容易,因为标签的位置无关紧要。

和以前一样,只要使用尖括号,就可以省略 $/ 中的斜杠。即使 Match 不是关联类型,这看起来像关联索引:

$_ = 'Hamadryas perlicus';
if / $<genus>=(\w+) \s+ $<species>=(\w+) / {
    put "Genus: $<genus>";      # Genus: Hamadryas
    put "Species: $<species>";  # Species: perlicus
};

A label name in a variable works, but in that case you can’t leave off the /:

变量中的标签名称有效,但在这种情况下,你不能省略 /

$_ = 'Hamadryas perlicus';
my $genus-key = 'genus';
my $species-key = 'species';
if / $<genus>=(\w+) \s+ $<species>=(\w+) / {
    put "Genus: $/{$genus-key}";      # Genus: Hamadryas
    put "Species: $/{$species-key}";  # Species: perlicus
};

If you save the result the names are in your Match in the same way they show up in $/:

如果你将结果保存,则名称在你的Match中的方式与它们在 $/ 中显示的方式相同:

my $string = 'Hamadryas perlicus';
my $match = $string ~~ m/ $<genus>=(\w+) \s+ $<species>=(\w+) /;

if $match {
    put "Genus: $match<genus>";       # Genus: Hamadryas
    put "Species: $match<species>";   # Species: perlicus
};

You don’t even need to know the names because you can get those from the Match. Calling .pairs returns all the names:

你甚至不需要知道这些名字,因为你可以从Match中得到这些名字。调用 .pairs 返回所有名称:

my $string = 'Hamadryas perlicus';
my $match = $string ~~ m/ $<genus>=(\w+) \s+ $<species>=(\w+) /;

put "Keys are:\n\t",
    $match
        .pairs
        .map( { "{.key}: {.value}" } )
        .join( "\n\t" );

The put shows everything without knowing the names in advance:

put 会在事先不知道名字的情况下显示所有内容:

Keys are:
    species: perlicus
    genus: Hamadryas

When patterns get too complex (say, something that you have to spread over multiple lines) the numbered Match variables will probably proliferate beyond your ability to track them. Names do a much better job of reminding you which capture contains what.

当模式变得过于复杂时(比如,你必须分散在多行上),编号的Match变量可能会超出你跟踪它们的能力。名称可以更好地提醒你哪个捕获包含什么。

A Capture Tree

Inside capture parentheses you can have additional capture parentheses. Each group gets its own numbering inside the group that contains it:

在捕获圆括号内,你可以使用其他捕获圆括号。每个组在包含它的组内获得自己的编号:

my $string = 'Hamadryas perlicus';
say $string ~~ m/(perl (<[a..z]>+))/;

The output shows that there are two $0s and one of them is subordinate to the other. The captures are nested so the results are nested:

输出显示有两个 $0,其中一个从属于另一个。捕获是嵌套的,因此结果是嵌套的:

「perlicus」
 0 => 「perlicus」
  0 => 「icus」

To access the top-level match, use $/[0] or $0. To get the nested matches you access the next level with the appropriate subscript:

要访问顶级匹配,请使用 $/[0]$0。要获取嵌套匹配,你可以使用相应的下标访问下一级别:

my $string = 'Hamadryas perlicus';
$string ~~ m/(perl (<[a..z]>+))/;

# explicit $/
say "Top match: $/[0]";       # Top match: perlicus
say "Inner match: $/[0][0]";  # Inner match: icus

# or skip the $/
say "Top match: $0";          # Top match: perlicus
say "Inner match: $0[0]";     # Inner match: icus

This works for named captures in the same way. The outer captures include the inner text as well as the inner captures:

这适用于以相同方式命名的捕获。外部捕获包括内部文本以及内部捕获:

my $string = 'Hamadryas perlicus';
$string ~~ m/
    $<top> = (perl
        $<inner> = (<[a..z]>+)
        )
    /;

# explicit $/
say "Top match: $/<top>";           # Top match: perlicus
say "Inner match: $/<top><inner>";  # Inner match: icus

# or skip the $/
say "Top match: $<top>";            # Top match: perlicus
say "Inner match: $<top><inner>";   # Inner match: icus

It’s not one or the other. You can mix number variables and labels if that makes sense:

它不是一个或另一个。如果有意义,你可以混合数字变量和标签:

my $string = 'Hamadryas perlicus';
$string ~~ m/
    ( perl $<inner> = (<[a..z]>+) )
    /;

# explicit $/
say "Top match: $/[0]";           # Top match: perlicus
say "Inner match: $/[0]<inner>";  # Inner match: icus

# or skip the $/
say "Top match: $0";            # Top match: perlicus
say "Inner match: $0<inner>";   # Inner match: icus

This nesting makes it very easy to construct your pattern. The numbering is localized to the level you are in. If you add other captures to the pattern they only disturb their level.

EXERCISE 16.6Extract from the Butterflies_and_Moths.txt file all the scientific names between underscores (such as _Crocallis elinguaria_). Capture the genus and species separately. Which genus has the most species?

这种嵌套使得构建模式变得非常容易。编号已本地化到你所在的层级。如果你在模式中添加其他捕获,则只会影响其层级。

练习16.6 从 Butterflies_and_Moths.txt 文件中提取下划线之间的所有科学名称(例如 _Crocallis elinguaria_)。分别捕获属和种。哪个属种类最多?

Backreferences

The result of a capture is available inside your patterns. You can use that to match something else in the same pattern. Use the Match variables to refer to the part that you want:

捕获的结果可在模式中使用。你可以使用它来匹配相同模式中的其他内容。使用Match变量来引用所需的部分:

my $line = 'abba';
say $line ~~ / a (.) $0 a  /;

The output shows the entire match and the capture:

输出显示整个匹配和捕获:

「abba」
 0 => 「b」

Refer to captures at the same level with the number variables. The $0 and $1 are backreferences to parts of the pattern that have already matched:

请参阅与数字变量在同一级别的捕获。 $0$1 是对已经匹配的模式部分的反向引用:

my $line = 'abccba';
say $line ~~ / a (.)(.) $1 $0 a  /;

There are only two captures in the output:

输出中只有两个捕获:

「abccba」
 0 => 「b」
 1 => 「c」

If the capture is nested you have to do a bit more work. You might think you can subscript the capture variable, but can you see why it fails silently?

如果捕获是嵌套的,则必须做更多的工作。你可能认为可以下标捕获变量,但是你能看到它为什么会无声地失败吗?

my $line = 'abcca';
say $line ~~ / a (.(.)) $0[0] a  /;  # does not match!

Those square brackets are pattern metacharacters and not postcircumfix indexers! You think that you have an element in $0, but it’s really $0 stringified followed by a group that is the literal text 0.

To get around this parsing problem surround the subscript access in $() so the pattern sees it as one thing. There’s one more trick to make it work out. Backreferences are only valid at a sequence point where the match operator has filled in all the details. An empty code block can force that:

那些方括号是模式元字符而不是 postcircumfix 索引器!你认为你在 $0 有一个元素,但它实际上是 $0 字符串化后跟一个文字文本 0 的组。

为了解决这个解析问题围绕 $() 中的下标访问,所以模式将其视为一件事。还有一个技巧可以让它成功。反向引用仅在匹配运算符填充了所有详细信息的序列点有效。空代码块可以强制执行:

my $line = 'abcca';
say  $line ~~ / a (.(.)) {} $($0[0]) a  /;  # matches

Now the $0[0] can match the c:

现在 $0[0] 可以匹配 c

「abcca」
 0 => 「bc」
  0 => 「c」

Surrounders and Separators

To match something that has prefix and suffix characters, you could type out the pattern in the order it appears in the Str. Here’s an example that matches a word in literal parentheses:

要匹配具有前缀和后缀字符的内容,你可以按照它在字符串中出现的顺序输出模式。这是一个与字面括号中的单词匹配的示例:

my $line = 'outside (pupa) outside';
say $line ~~ / '(' \w+ ')'  /;         # 「(pupa)」

That’s not the best way to communicate that you want to match something in parentheses, though. The start and end characters aren’t next to each other in the pattern; you have to read ahead then surmise that the parentheses are circumfix parts of the same idea.

Instead, connect the beginning and end patterns with ~, then put the interior pattern after that. This describes something surrounded by parentheses subordinate to the structure:

不过,这不是你想要在括号中匹配内容的最佳沟通方式。开始和结束字符在模式中不是彼此相邻的;你必须提前阅读,然后推测括号是同一个想法的一部分。

相反,用 ~ 连接开始和结束模式,然后在之后放置内部模式。这描述了从属于结构的括号所包围的东西:

my $line = 'outside (pupa) outside';
say $line ~~ / '(' ~ ')' \w+ /;

This is automatically nongreedy; it does not grab everything until the last closing parenthesis:

这是自动非贪婪的;在最后一个右括号之前它不会抓取所有内容:

my $line = 'outside (pupa) space (pupa) outside';
say $line ~~ m/ '(' ~ ')' \w+ /; # 「(pupa)」

A global match will still find all the instances:

全局匹配仍将找到所有实例:

my $line = 'outside (pupa) space (pupa) outside';
say $line ~~ m:global/ '(' ~ ')' \w+ /; # (「(pupa)」 「(pupa)」)

Going the other way, suppose that you want to match a series of things that are separated by other characters. A line of comma-separated values is such a thing:

换句话说,假设你想要匹配由其他字符分隔的一系列事物。一行以逗号分隔的值是这样的:

my $line = 'Hamadryas,Leptophobia,Vanessa,Gargina';

To match the letters separated by commas, you could match the first group of letters then every subsequent occurrence of a comma and another group of letters:

要匹配用逗号分隔的字母,你可以匹配第一组字母,然后匹配每个后续的逗号和另一组字母:

say $line ~~ / (\w+) [ ',' (\w+) ]+ /;

That works, but it’s annoying because you have to use \w+ twice even though it’s describing the same thing. The % modifies a quantifier so that the pattern on the right comes between each group:

这是有效的,但它很烦人,因为你必须使用 \w+ 两次,即使它描述同样的事情。 % 修饰量词,使右侧的模式位于每个组之间:

say $line ~~ / (\w+)+ % ',' /;

The output shows that you matched each group of letters:

输出显示你匹配了每组字母:

「Hamadryas,Leptophobia,Vanessa,Gargina」
 0 => 「Hamadryas」
 0 => 「Leptophobia」
 0 => 「Vanessa」
 0 => 「Gargina」

A double percent allows a trailing separator in the overall match:

双百分号允许在整体匹配中使用尾分隔符:

my $line = 'Hamadryas,Leptophobia,Vanessa,';
say $line ~~ / (\w+)+ %% ',' /;

Notice that it matches that comma that follows Vanessa but does not create an empty capture after it:

请注意,它与 Vanessa 后面的逗号匹配,但不会在其后创建空捕获:

「Hamadryas,Leptophobia,Vanessa,」
 0 => 「Hamadryas」
 0 => 「Leptophobia」
 0 => 「Vanessa」
NOTE

Although you’d think that CSV files should be simple, they aren’t. In the wild all sorts of weird things happen. The Text::CSV module handles all of those tricky bits. Use that instead of doing it yourself.

虽然你认为 CSV 文件应该很简单,但事实并非如此。在野外,各种各样奇怪的事情都会发生。 Text::CSV 模块处理所有这些棘手的部分。使用它而不是自己做。

断言

Assertions don’t match text; they require that a certain condition be true at the current position in the text. They match a context instead of characters. Specify these in your pattern to allow the matcher to fail faster. You don’t need to scan the entire text if the pattern should only work at the beginning of the text.

断言不匹配文本;他们要求在文本的当前位置某个条件为真。它们匹配上下文而不是字符。在模式中指定这些以允许匹配器更快地失败。如果模式仅适用于文本的开头,则无需扫描整个文本。

锚点

An anchor prevents the pattern from floating over the text to find a place where it can start matching. It requires that a pattern match at a particular position. If the pattern doesn’t match at that position the match can immediately fail and save itself the work of scanning the text.

The ^ forces your pattern to match at the absolute beginning of the text. This matches because the Hama comes at the beginning of the text:

锚点可防止模式浮动在文本上以找到可以开始匹配的位置。它要求在特定位置匹配模式。如果模式在该位置不匹配,则匹配可能立即失败并自行保存扫描文本的工作。

^ 强制你的模式在文本的绝对开头匹配。下面这个会匹配,因为 Hama 出现在文本的开头:

say 'Hamadryas perlicus' ~~ / ^ Hama /;  # 「Hama」

Trying to match perl after ^ fails because that pattern is not at the beginning of the text:

尝试匹配 ^ 后面的 perl 会失败,因为该模式不在文本的开头:

say 'Hamadryas perlicus' ~~ / ^ perl /;  # Nil (fails)

Without the anchor the match would drift over the text looking at each position to check for perl. That’s extra work (and probably incorrect) if you know that you want to match at the beginning. Once the match fails at the beginning it’s immediately done.

The $ is the end-of-string anchor and does something similar at the end of the text:

没有锚点,匹配将漂移在文本上,查看每个位置以检查 perl。如果你知道你想在开始时匹配,这是额外的工作(可能是不正确的)。一旦匹配在开始时失败,匹配立即结束。

$ 是字符串结尾的锚点,并在文本末尾执行类似的操作:

say 'Hamadryas perlicus' ~~ / icus $ /;  # 「icus」

This one doesn’t match because there’s more text after icus:

这个不匹配,因为 icus 之后有更多的文本:

say 'Hamadryas perlicus navitas' ~~ / icus $ /;  # Nil (fails)

There are anchors for the beginning and end of a line; that could be different from the beginning and end of the text. A line ends with a newline and that newline might be in the middle of your multiline text, like in this one (remember that the here doc strips the indention):

行的开头和结尾都有锚点;这可能与文本的开头和结尾不同。行以换行符结尾,换行符可能位于多行文本的中间,就像在这一行中一样(请记住,here doc 删除了缩进):

$_ = chomp q:to/END/;   # chomp removes last newline
    Chorinea amazon
    Hamadryas perlicus
    Melanis electron
    END

The beginning-of-line anchor, ^^, matches after the absolute beginning of the text or immediately after any newline. These both work because Chorinea is at the start of the text and the start of the first line:

行首的锚点 ^^ 在文本的绝对开头之后或在任何换行符之后立即匹配。下面这两者都有效,因为 Chorinea 位于文本的开头和第一行的开头:

say m/ ^  Chorinea /;  # 「Chorinea」
say m/ ^^ Chorinea /;  # 「Chorinea」

Likewise, the end-of-line anchor, $$, matches before any newline or at the absolute end of the text. These also both work because electron is at the end of the text and the end of the last line:

同样,行尾锚点 $$ 在任何换行符之前或文本的绝对末尾匹配。下面这些也都有效,因为 electron 在文本的末尾和最后一行的结尾:

say m/ electron $  /;  # 「electron」
say m/ electron $$ /;  # 「electron」

Hamadryas can’t match at the absolute beginning of the text but it can match at the beginning of a line:

Hamadryas 在文本的绝对开头不能匹配,但它可以在一行的开头匹配:

say m/ ^  Hamadryas /; # Nil
say m/ ^^ Hamadryas /; # 「Hamadryas」

Similarly, perlicus can’t match at the absolute end of the text but it can match at the end of a line:

同样,perlicus 在文本的绝对末尾不能匹配,但它可以在一行的末尾匹配:

say m/ perlicus $  /;  # Nil
say m/ perlicus $$ /;  # 「perlicus」

Conditions

Word boundaries exist when a non-“word” character is next to a “word” character (in either order). Those terms are a bit fuzzy, since you likely think of word characters as the alphabetic characters. They are, however, the ones that match \w, which includes numbers and other things. The beginning and the end of the Str count as nonword characters.

EXERCISE 16.7Output all the “word” characters that are not alphabetic characters. How many of them are there? The Range0 .. 0xFFFF and the .chr method should be helpful.

Assert a word boundary with <|w>. Suppose that you want to match the name Hamad. Without a word boundary that would match in Hamadryas, but that’s not what you want. The word boundary keeps it from showing up in the middle of another word:

*当非“单词”字符紧邻“单词”字符时(以任一顺序),则存在单词边界。这些术语有点模糊,因为你可能会将单词字符视为字母字符。然而,它们是匹配 \w 的,包括数字和其他东西。 字符串的开头和结尾计为非单词字符。

练习16.7 输出所有非字母字符的“单词”字符。他们中有多少个? Range 0 .. 0xFFFF.chr 方法应该会有所帮助。

<|w> 断言单词边界。假设你想要匹配 Hamad 这个名字。没有在 Hamadryas 中匹配的单词边界,但这不是你想要的。单词边界使它不会出现在另一个单词的中间:

$_ = 'Hamadryas';
say m/ Hamad /;       # 「Hamad」
say m/ Hamad <|w> /;  # Nil

That second pattern can’t match because Hamadryas has a word character (a letter) following Hamad. The next example matches because a space follows Hamad:

第二种模式无法匹配,因为 HamadryasHamad 之后有一个单词字符(一个字母)。下一个例子匹配,因为 Hamad 后面有一个空格:

my $name = 'Ali Hamad bin Perliana';
say $name ~~ / Hamad <|w> /;  # 「Hamad」

Word boundaries on each side isolate a word. These matches look for dry as its own word because it has word boundaries on each side. The first one fails because it’s in the middle of a bigger word:

每一边的单词边界隔离一个单词。这些匹配寻找 day 作为它们自己的单词,因为它的每一边都有单词边界。第一个失败,因为它在一个更大的单词的中间:

$_ = 'Hamadryas';
say m/ <|w> dry <|w> /;  # Nil

$_ = 'The flower is dry';
say m/ <|w> dry <|w> /;  # 「dry」

Instead of <|w> you can use the << or >> to point to where the nonword characters should be:

你可以使用 <<>> 代替 <|w> 来指向非单词字符的位置:

$_ = 'The flower is dry';
say m/ << dry >> /;  # 「dry」

The arrows can point either way, but always toward the nonword characters:

箭头可以指向任一方向,但始终指向非单词字符:

$_ = 'a!bang';
say m/ << .+ >> /;   # 「a!bang」   - greedy
say m/ << .+? >> /;  # 「a」        - nongreedy
say m/ >> .+ >> /;   # 「!bang」
say m/ >> .+ << /;   # 「!」

The opposite of a word boundary assertion is <!|w>. That means that both sides of the assertion must be the same type of character—either both word characters or both nonword characters. Now the results are flipped:

单词边界断言的反义词是 <!|w>。这意味着断言的两边必须是相同类型的字符 - 两个单词字符或两个非单词字符。现在翻转结果:

$_ = 'Hamadryas';
say m/ <!|w> dry <!|w> /;  # 「dry」

$_ = 'The flower is dry';
say m/ <!|w> dry <!|w> /;  # Nil

代码断言

Code assertions are perhaps the most amazing and powerful part of regular expressions. You can inspect what’s happened so far and use arbitrarily complex code to decide if you accept that. If your code evaluates to Trueyou satisfy the assertion and the pattern can keep matching. Otherwise, your pattern fails.

Your code for the assertion shows up in <?{}>. You can put almost anything you like in there:

代码断言可能是正则表达式中最令人惊讶和最强大的部分。你可以检查到目前为止发生了什么,并使用任意复杂的代码来决定你是否接受。如果你的代码求值为 True 则你满足断言,并且模式可以保持匹配。否则,你的模式将失败。

你的断言代码显示在 <?{}> 中。你可以把几乎任何你喜欢的东西放在那里:

'Hamadryas' ~~ m/ <?{ put 'Hello!' }> /;   # Hello!

This matches no characters in Hamadryas but is also not the null pattern (which is not valid). From inside the assertion you get Hello! as output:

这与 Hamadryas 中的任何字符都不匹配,但也不是空模式(它是无效的)。从断言内你得到 Hello! 作为输出:

put
    'Hamadryas' ~~ m/ <?{ put 'Hello!' }> /
        ?? 'Worked' !! 'Failed';

This first outputs from inside the assertion:

这首先从断言内部输出:

Hello!
Worked!

Change the assertion so that False is the last expression:

更改断言,以便 False 是最后一个表达式:

put
    'Hamadryas' ~~ m/ <?{ put 'Hello!'; False }> /
        ?? 'Worked' !! 'Failed';

You get much more output. As the code assertion fails the match cursor moves along the text and tries again. Each time the code assertion returns False it tries again. It keeps doing that until it gets to the end of the Str:

你得到更多的输出。由于代码断言失败,匹配游标沿文本移动并再次尝试。每次代码断言返回 False 时,它再次尝试。它一直这样做,直到它到达字符串的结尾:

Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Failed

Here’s something more complex. Suppose you want to match even numbers only. You could create a pattern that looks for an even digit at an end of a Str:

这是更复杂的事情。假设你只想匹配偶数。你可以创建一个模式,在字符串的末尾查找偶数:

say '538' ~~ m/ ^ \d* <[24680]> $ /;   # 「538」

With a code assertion you don’t care which digits you match as long as they are even. This makes the pattern a bit simpler by showing the complexity as code. Your intent may be clearer this way:

使用代码断言,只要它们是偶数,就不关心匹配哪些数字。通过将复杂性显示为代码,这使得模式更简单。你的意图可能会更加清晰:

say '538' ~~ m/ ^ (\d+) <?{ $0 %% 2 }> /;

There’s a capture and that text also is divisible by two, so that match succeeds:

有一个捕获,该文本也可被 2 整除,因此匹配成功:

「538」
 0 => 「538」

It stills works if the characters aren’t the ASCII decimal digits:

如果字符不是 ASCII 十进制数字,它仍然有效:

say '١٣٨' ~~ m/ ^ (\d+) <?{ $0 %% 2 }> /;

Or even:

甚至:

say '١٣٨' ~~ m/ ^ (\d+) <?{ $0 %% ٢ }> /;

匹配IPV4 地址

Consider a pattern to match a dotted-decimal IP address. There are four decimal numbers from 0 to 255, such as 127.0.0.1 (the loopback address). You could write a pattern without an assertion, but you have to figure out how to restrict the range of the number:

考虑匹配点分十进制 IP 地址的模式。从 0 到 255 有四个十进制数,例如 127.0.0.1(环回地址)。你可以编写一个没有断言的模式,但你必须弄清楚如何限制数字的范围:

my $dotted-decimal = rx/ ^
    [
    || [ <[ 0 1 ]> <[ 0 .. 9 ]> ** 0..2 ]  # 0 to 199
    || [
        2
        [
        || <[ 0 .. 4 ]> <[ 0 .. 9 ]>       # 200 to 499
        || 5 <[ 0 .. 5 ]>                  # 250 to 255
        ]
       ]
    ] ** 4 % '.'
    $
    /;

say '127.0.0.1' ~~ $dotted-decimal;  # 「127.0.0.1」

Matching on text to suss out numerical values means careful handling of each character position. That’s a lot of work and uses a feature you haven’t seen yet (alternations are coming up). You could reduce that to almost nothing with a code assertion that looks at the text you just matched and tells the pattern if you want to accept it:

匹配文本以取代数值意味着仔细处理每个字符位置。这是很多工作,并使用了你还没有看到的功能(备选分支即将到来)。你可以使用代码断言将其减少到几乎为零,该代码断言查看你刚匹配的文本并告诉模式你是否要接受它:

my $easier = rx/
    ^
    ( <[0..9]>+: <?{ 0 <= $/ <= 255 }> ) ** 4 % '.'
    $
    /;

The assertion is <?{ 0 <= $/ <= 255 }>. That $/ is the Match for only that level of parentheses. This allows you to be sloppy in the pattern for matching digits. You don’t care if you match 4, 5, or 20 digits because the code assertion will check that.

If that code assertion fails after matching digits, you don’t want to give back some of the digits to try again. You know the next thing must be the . between groups of digits. To prevent any backtracking you use the : on that +quantifier. You don’t need this to get the right match but it creates less work to ultimately fail.

The % modifies the ** 4 quantifier so a literal . shows up between each of the four groups of digits.

断言是 <?{ 0 <= $/ <= 255 }> 那个 $/ 是只有那个括号级别的Match。这允许你在匹配数字的模式中马虎。你不关心是否匹配 4, 5 或 20 位数字,因为代码断言将检查该数字。

如果代码断言在匹配数字后失败,则你不希望归还一些数字再次尝试。你知道下一件事必须是数字组之间的 . 。要防止任何回溯,请在量词 + 上使用 :。你不需要这个来获得正确的匹配,但它创造的工作量最少,最终失败。

% 修饰 ** 4 量词,所以字面 . 显示在四组数字中的每一组之间。

Alternations

Sometimes there are several distinct patterns that might match at the same position. An alternation is a way to specify that. There are two ways to do this: it can match the first alternative that succeeds or it can match the longest one.

有时,有几种不同的模式可能在同一位置匹配。交替是一种指定它的方式。有两种方法可以做到这一点:它可以匹配成功的第一个选项,也可以匹配最长的选项。

First Match

If you’ve used regexes in other languages you’re probably used to alternations where the leftmost alternative that can match is the one that wins. Set up this type of alternation with a || between the possibilities:

如果你已经在其他语言中使用了正则表达式,那么你可能会习惯于可以匹配的最左侧备选分支胜出的备选分支。使用 || 在可能的备选分支之间设置此类备选分支:

my $pattern = rx/ abc || xyz || 1234 /;

Either abc, xyz, or 1234 can match:

my @strings = < 1234 xyz abc 789 >;
for @strings {
    put "$_ matches" if $_ ~~ $pattern;
}

The first three Strs match because they have at least one of the alternatives:

前三个字符串匹配,因为他们至少有一个备选分支:

1234 matches
xyz matches
abc matches

The alternation has an interesting feature: you can start it with a || with nothing before it. This is the same pattern and does not create an empty alternative at the beginning:

备选分支有一个有趣的特点:你可以以一个前面什么都没有的 || 开始。这是相同的模式,并且不会在开头创建一个空的备选分支:

my $pattern = rx/ || abc || xyz || 1234 /;

This looks better spread out so each alternation gets its own line. The reformatted pattern starts with || and has a more pleasing parallel structure that allows you to remove lines without disturbing the other alternatives:

这看起来更好地展开,因此每个备选分支单独占一行。重新格式化的模式以 || 开头并且有一个更令人愉悦的并行结构,允许你删除行而不会打扰其他备选分支:

my $pattern = rx/
    || abc
    || xyz
    || 1234
    /;

Instead of placing a || between each alternative, you can put it before a bunch of alternatives. Do that with an Array directly in your pattern:

你可以把 || 放在一堆备选分支之前而不是在每个备选分支之间放置 ||。直接在你的模式中使用数组执行此操作:

my $pattern = rx/ || @(<abc xyz 1234>) /;

An existing variable after the || does the same thing:

|| 之后的现有变量做同样的事情:

my @variable = <abc xyz 1234>;
my $pattern = rx/ || @variable /;

You aren’t interpolating that Array. The pattern uses the current value of the Array when it matches. In this example the Array has 1234 as the last element when you define the pattern. Before you use the pattern you change that last element:

你没有插值该数组。该模式在匹配时使用数组的当前值。在此示例中,数组 在定义模式时将 1234 作为最后一个元素。在使用该模式之前,你需要更改最后一个元素:

my @strings = < 1234 xyz abc 56789 >;
my @variable = <abc xyz 1234>;
my $pattern = rx/ || @variable /;

put "Before:";
for @strings {
    put "\t$_ matches" if $_ ~~ $pattern;
}

# change the array after making the pattern
@variable[*-1] = 789;

put "After:";
for @strings {
    put "\t$_ matches" if $_ ~~ $pattern;
}

The output shows that you matched with the current value of the variable instead of its value when you created the pattern. Different values match after you change the Array:

输出显示你匹配的变量的当前值,而不是匹配该模式创建时的值。更改数组后,匹配到不同的值:

Before:
    1234 matches
    xyz matches
    abc matches
After:
    xyz matches
    abc matches
    56789 matches

EXERCISE 16.8Output all the lines from the butterfly census file that have the genus Lycaena, Zizeeria, or Hamadryas. How many different species did you find?

练习16.8 输出蝴蝶人口普查文件中有 LycaenaZizeeriaHamadryas 属的所有行。你找到了多少种不同的物种?

Longest Token Matching

Some alternations might have “better” possibilities that could match. Rather than choosing the first specified possibility you can tell the match operator to try all of them, then choose the “best” one. This is generally calledlongest token matching (LTM), but it finds the best, not longest, match.

LTM alternation uses a single |. In this pattern all of the alternatives can match. The first possibility it could match is the single a. The “best” match is abcd, though. That’s the match you see in the output:

一些备选分支可能具有可以匹配的“更好”的可能性。你可以告诉匹配操作符尝试所有这些,然后选择“最佳”的可能性,而不是选择第一个指定的可能性。这通常称为最长令牌匹配(LTM),但它找到最佳匹配,而不是最长匹配。

LTM 备选分支使用单个 |。在这种模式中,所有备选分支都可以匹配。它可以匹配的第一种可能性是单个 a。不过,“最佳”匹配是 abcd。这是你在输出中看到的匹配:

my $pattern = rx/
    | a
    | ab
    | abcd
    /;

say 'abcd' ~~ $pattern;  # 「abcd」

An Array variable works just like it did in the || examples:

数组变量就像在 || 例子中一样工作:

my @variable = <a ab abcd>;
my $pattern = rx/ | @variable /;

say 'abcd' ~~ $pattern;  # 「abcd」

What makes one possibility better than another? There are some rules that decide this. Better patterns have longer tokens, and that’s where the confusion comes in. It’s not actually about how much text it matches; it’s about the pattern.

This next part will probably be more than you’ll ever want to know. A pattern can have both declarative and procedural elements. In short, some parts of the pattern merely describe some text and other parts force the match operator to do something. The abc is declarative. The {} inline code is an action.

Consider this example. The longest text that might match is Hamadry. That alternative has the {True} inline code block in it, though. The second alternative is simply Hamad, and that is the one that matches:

是什么让一种可能性比另一种更好?有一些规则可以决定这一点。更好的模式有更长的令牌,这就是困惑的来源。实际上并不是它匹配多少文本;这是关于模式的。

下一部分可能比你想知道的要多。模式可以同时具有声明和过程元素。简而言之,模式的某些部分仅描述一些文本,而其他部分则强制匹配操作符执行某些操作。 abc 是声明性的。 {} 内联代码是一个动作。

请看这个例子。可能匹配的最长文本是 Hamadry。但是,该备选分支中包含 {True} 内联代码块。第二个备选分支只是 Hamad,那是匹配的:

say 'Hamadryas perlicus sixus' ~~ m/
    | Hama{True}dry
    | Hamad
    /;  # 「Hamad」

When the match operator is deciding which one has priority it looks for the pattern that has the longest declarative part. The first one has Hama; the second one has Hamad. That makes the second one the longer token. It’s about the pattern, not the target text. (Ignore that you haven’t read a definition of a token yet.)

Sometimes the two patterns can have the same size tokens, like these two alternatives. One has a character class and the other a literal d. The more specific one (the literal) wins:

当匹配运算符决定哪一个具有优先级时,它会查找具有最长声明部分的模式。第一个有 Hama;第二个有 Hamad。这使得第二个有更长的令牌。它是关于模式,而不是关于目标文本。 (忽略你还没有读过令牌的定义。)

有时这两种模式可以具有相同大小的令牌,就像这两个备选分支一样。一个有字符类,另一个有字面值 d。更具体的一个(字面的)获胜:

$_ = 'Hamadryas perlicus sixus';

say 'Hamadryas perlicus sixus' ~~ m/
    | Hama<[def]>{put "first"}
    | Hamad      {put "second"}
    /;  # 「Hamad」

The code Blocks are only there to show which alternative was “best”:

代码仅用于显示哪种备选分支“最佳”:

second
「Hamad」

Change that around to see it still choose the more specific one:

改变它,看它仍然选择更具体的一个:

$_ = 'Hamadryas perlicus sixus';

say 'Hamadryas perlicus sixus' ~~ m/
    | Hamad      {put "first"}
    | Hama<[def]>{put "second"}
    /;  # 「Hamad」

Now the first alternative is more specific and it is “best”:

现在第一个备选分支更具体,它是“最好的”:

first
「Hamad」

So what counts as a token? It’s the longest stretch of things that aren’t procedural. As I write this, however, the documentation avoids defining that. It requires deep knowledge of what happens in the guts of the language. It’s a big ugly topic that I’ll now ignore, although the book Mastering Regular Expressions by Jeffrey E.F. Friedl(O’Reilly) will tell you most of what you need to know. Perhaps the confusion will sort itself out by the time you read this.

什么算作 token 呢?这是最长的一些不是程序性的东西。然而,当我写这篇文章时,文档避免了定义它。它需要深入了解语言的内容。虽然 Jeffrey E.F.Friedl(O’Reilly)的“掌握正则表达式”这本书将告诉你大部分你需要知道的东西,但我现在忽略了一个很大的丑陋主题。也许当你读到这篇文章时,这种困惑会自行解决。

All of that is to say that the match operator looks at each | alternative and can choose to do the one it thinks provides the best match. The match operator does not have to do them in the order that you typed them.

所有这一切都是说匹配运算符会查看每个 | 备选分支,并可以选择做它认为提供最佳匹配的那个。匹配运算符不必按你键入的顺序执行它们。

Summary

In this chapter you saw the common regex features that will solve most of your pattern problems. You can repeat parts of a pattern, capture and extract parts of the text, define alternate patterns that can match, and specify conditions within the pattern. There is much more that patterns can do for you. Practice what you’ve read here and delve into the documentation to discover more.

在本章中,你看到了可以解决大多数模式问题的常见正则表达式功能。你可以重复模式的某些部分,捕获和提取文本的某些部分,定义可以匹配的备选分支模式,以及指定模式中的条件。模式可以为你做更多的事情。练习你在这里阅读的内容并深入研究文档以发现更多信息。

comments powered by Disqus