第十五章. 正则表达式

Regular Expressions

声明

本章翻译仅用于 Raku 学习和研究, 请支持电子版或纸质版

第十五章. 正则表达式

Regular expressions (or regexes) are patterns that describe a possible set of matching texts. They are a little language of their own, and many characters have a special meaning inside patterns. They may look cryptic at first, but after you learn them you have quite a bit of power.

Forget what you’ve seen about patterns in other languages. The Raku pattern syntax started over. It’s less compact but also more powerful. In some cases it acts a bit differently.

This chapter shows simple patterns that match particular characters or sets of characters. It’s just the start. In Chapter 16 you’ll see fancier patterns and the side effects of matching. In Chapter 17 you’ll take it all to the next level.

正则表达式(或正则表达式)是描述可能的匹配文本集的模式。它们是自己的一种语言,许多字符在模式中具有特殊含义。它们起初可能看起来很神秘,但是在你学习它们之后你会有相当大的力量。

忘记你在其他语言中看到的关于模式的内容。 Raku模式语法重新开始。它不那么紧凑,但也更强大。在某些情况下,它的作用有点不同。

本章介绍与特定字符或字符集匹配的简单模式。这只是一个开始。在第16章中,您将看到更漂亮的模式和匹配的副作用。在第17章中,您将把它全部提升到一个新的水平。

The Match Operator

A pattern describes a set of text values. The simple pattern abc describes all the values that have an a next to a b next to a c. The trick then is to decide if a particular value is in the set of matching values. There are no half or partial matches; it matches or it doesn’t.

A pattern inside m/.../ immediately applies itself to the value in $_. If the pattern is in the Str the match operator returns something that evaluates to True in a condition:

模式描述了一组文本值。简单模式abc描述了c旁边的b旁边的所有值。然后,技巧是确定特定值是否在匹配值集合中。没有半场比赛或部分比赛;它匹配或不匹配。

m /…/中的模式立即将其自身应用于$ _中的值。如果模式在Str中,则匹配运算符返回在条件中评估为True的值:

$_ = 'Hamadryas';
if m/Hama/ { put 'It matched!'; }
else       { put 'It missed!';  }

That’s a bit verbose. The conditional operator takes care of that:

这有点冗长。条件运算符负责:

put m/Hama/ ?? 'It matched!' !! 'It missed!';

You don’t have to match against $_. You can use the smart match to apply it to a different value. That’s the target:

你不必匹配$ _。您可以使用智能匹配将其应用于其他值。这是目标:

my $genus = 'Hamadryas';
put $genus ~~ m/Hama/ ?? 'It matched!' !! 'It missed!';

That target could be anything, including an Array or Hash. These match a single item:

该目标可以是任何东西,包括数组或哈希。这些匹配单个项目:

$genus                ~~ m/Hama/;
@animals[0]           ~~ m/Hama/;
%butterfly<Hamadryas> ~~ m/perlicus/;

But you can also match against multiple items. The object on the left side of the smart match decides how the pattern applies to the object. This matches if any of the elements in @animals matches:

但您也可以匹配多个项目。智能匹配左侧的对象决定模式如何应用于对象。如果@animals中的任何元素匹配,则匹配:

if @animals ~~ m/Hama/ {
    put "Matches at least one animal";
    }

This is the same as matching against a Junction:

这与针对Junction的匹配相同:

if any(@animals) ~~ m/Hama/ {
    put "Matches at least one animal";
    }

The match operator is commonly used in the condition inside a .grep:

匹配运算符通常用于.grep中的条件:

my @hama-animals = @animals.grep: /Hama/;

Match Operator Syntax

The match operator can use alternate delimiters, similar to the quoting mechanism:

匹配运算符可以使用备用分隔符,类似于引用机制:

m{Hama}
m!Hama!

Whitespace inside the match operator doesn’t matter. It’s not part of the pattern (until you say so, as you’ll see later). All of these are the same, including the last example with vertical whitespace:

匹配运算符内的空格并不重要。它不是模式的一部分(直到你这么说,你将在后面看到)。所有这些都是相同的,包括最后一个带有垂直空格的例子:

m/ Hama /
m{ Hama }
m! Hama !
m/
    Hama
/

You can put spaces between alphabetic characters, but you’ll probably get a warning because Raku wants you to put those together:

您可以在字母字符之间放置空格,但您可能会收到警告,因为Raku希望您将它们放在一起:

m/ Ha ma /

If you want a literal space inside the match operator you can escape it (along with other things you’ll see later):

如果你想在匹配运算符中使用文字空间,你可以将其转义(以及稍后你会看到的其他内容):

m/ Ha\ ma /

Quoting whitespace makes it literal too (the space around the quoted whitespace is still insignificant), or you can quote it all together:

引用空格也使它成为字面值(引用的空格周围的空间仍然无关紧要),或者你可以将它们全部引用:

m/ Ha ' ' ma /
m/ 'Ha ma' /

You need to quote or escape any character that’s not alphabetic or a number, even if those characters aren’t “special.” The other unquoted characters may be metacharacters that have special meaning in the pattern language.

您需要引用或转义任何非字母或数字的字符,即使这些字符不是“特殊”。其他未加引号的字符可能是在模式语言中具有特殊含义的元字符。

Successful Matches

If the match operator succeeds it returns a Match object, which is always a True value. If you put that object it shows you the part of the Str that matched. The say calls .gist and the output is a bit different:

如果匹配运算符成功,则返回Match对象,该对象始终为True值。如果您放置该对象,它会向您显示匹配的Str部分。说调用.gist和输出有点不同:

$_ = 'Hamadryas';
my $match = m/Hama/;
put $match; # Hama
say $match; # ?Hama?

The output of say gets interesting as the patterns get more complicated. That makes it useful for the regex chapters, and you’ll see more of that here compared to the rest of the book.

If the match does not succeed it returns Nil, which is always False:

随着模式变得更加复杂,say的输出变得有趣。这使得它对正则表达式章节很有用,并且与本书的其余部分相比,您将在这里看到更多。

如果匹配不成功,则返回Nil,它始终为False:

$_ = 'Hamadryas';
my $match = m/Hama/;
put $match.^name;    # Nil

It’s usually a good idea to check the result before you do anything with it:

在对它做任何事情之前检查结果通常是个好主意:

if my $match = m/Hama/ { # matched
    say $match;
    }

You don’t need the $match variable though. The result of the last match shows up in the special variable $/, which you’ll see more of later:

您不需要 $match 变量。最后一个匹配的结果显示在特殊变量$ /中,稍后您会看到更多:

if m/Hama/ { # matched
    say $/;
    }

Defining a Pattern

Useful patterns can get quite long and unwieldy. Use rx// to define a pattern (a Regex) for later use. This pattern is not immediately applied to any target. This allows you to define a pattern somewhere that doesn’t distract from what you are doing:

有用的模式可能会变得非常冗长和笨拙。使用rx //定义模式(正则表达式)供以后使用。此模式不会立即应用于任何目标。这允许您在某个地方定义一个不会分散您正在做的事情的模式:

my $genus = 'Hamadryas';
my $pattern = rx/ Hama /; # something much more complicated
$genus ~~ $pattern;

and reuse the pattern wherever you need it:

并在任何需要的地方重用模式:

for lines() -> $line {
    put $line if $line ~~ $pattern;
    }

It’s possible to combine saved patterns into a larger one. This allows you to decompose complicated patterns into smaller, more tractable ones that you can reuse later (which you’ll do extensively in Chapter 17):

可以将保存的模式组合成更大的模式。这允许您将复杂的模式分解为更小,更易处理的模式,以后可以重复使用(您将在第17章中进行广泛的讨论):

my $genus = 'Hamadryas';

my $hama  = rx/Hama/;
my $dryas = rx/dryas/;
my $match = $genus ~~ m/$hama$dryas/;

say $match;

Rather than storing a variable in an object, declare a lexical pattern with regex. This looks like a subroutine because it has a Block but it’s not code inside; it’s a pattern and uses that slang:

不是将变量存储在对象中,而是使用正则表达式声明词法模式。这看起来像一个子程序,因为它有一个Block,但它不是代码;这是一种模式并使用俚语:

my regex hama { Hama }

Use this in a pattern by surrounding it with angle brackets:

通过用尖括号包围它,在图案中使用它:

my $genus = 'Hamadryas';
put $genus ~~ m/<hama>/ ?? 'It matched!' !! 'It missed!';

You can define multiple named regexes and use them together:

您可以定义多个已命名的正则表达式并将它们一起使用:

my regex hama  { Hama }
my regex dryas { dryas }

$_ = 'Hamadryas';
say m/<hama><dryas>/;

Each named regex becomes a submatch. You can see the structure when you output it with say. It shows the overall result and the results of the subpatterns too:

每个命名的正则表达式都成为一个子匹配。用say输出它时可以看到结构。它还显示了整个结果和子模式的结果:

?Hamadryas?
 hama => ?Hama?
 dryas => ?dryas?

Treat the Match object like a Hash (although it isn’t) to get the parts that matched the named regexes. The name of the regex is the “key”:

将Match对象视为Hash(尽管不是),以获得与命名正则表达式匹配的部分。正则表达式的名称是“关键”:

$_ = 'Hamadryas';
my $result =  m/<hama><dryas>/;

if $result {
    put "First: $result<hama>";
    put "Second: $result<dryas>";
    }

Predefined Patterns

Table 15-1 shows several of the predefined patterns that are ready for you to use. You can define your patterns in a library and export them just like you could with subroutines:

表15-1显示了几个准备好的预定义模式供你使用。您可以在库中定义模式,并像子程序一样导出它们:

# Patterns.pm6
my regex hama is export { Hama }

Load the module and those named regexes are available to your patterns:

加载模块和那些名为正则表达式的模式可用:

use lib <.>;
use Hama;

$_ = 'Hamadryas';
say m/ <hama> /;
Predefined pattern What it matches
<alnum> Alphabetic and digit characters
<alpha> Alphabetic characters
<ascii> Any ASCII character
<blank> Horizontal whitespace
<cntrl> Control characters
<digit> Decimal digits
<graph> <alnum> + <punct>
<ident> A valid identifier character
<lower> Lowercase characters
<print> <graph> + <space>, but without <cntrl>
<punct> Punctuation and symbols beyond ASCII
<space> Whitespace
<upper> Uppercase characters
<|wb> Word boundary (an assertion rather than a character)
<word> <alnum> + Unicode marks + connectors, like ‘_’ (extra)
<ws> Whitespace (required between word characters, optional otherwise)
<ww> Within a word (an assertion rather than a character)
<xdigit> Hexadecimal digits [0-9A-Fa-f]

EXERCISE 15.1Create a program that uses a regular expression to output all of the matching lines from the files you specify on the command line.

练习15.1创建一个程序,该程序使用正则表达式输出您在命令行中指定的文件中的所有匹配行。

Matching Nonliteral Characters

You don’t have to literally type a character to match it. You might have an easier time specifying its code point or name. You can use the same \x[*CODEPOINT*] or \c[*NAME*] that you saw in double-quoted Strs in Chapter 4.

If you specify a name it must be all uppercase.

You could match the initial capital H by name, even though you have to type a literal H in the name:

您不必逐字输入匹配它的字符。您可以更轻松地指定其代码点或名称。您可以使用在第4章中双引号Strs中看到的相同\ x [* CODEPOINT *]或\ c [* NAME *]。

如果指定名称,则必须全部为大写。

您可以按名称匹配初始大写字母H,即使您必须在名称中键入文字H:

my $pattern = rx/
     \c[LATIN CAPITAL LETTER H] ama
    /;
$_ = "Hamadryas";

put $pattern ?? 'Matched!' !! 'Missed!';

You can do the same thing with the code point. If you specify a code point use the hexadecimal number (with either case):

您可以使用代码点执行相同的操作。如果指定代码点,请使用十六进制数字(两种情况):

my $pattern = rx/
     \x[48] ama
    /;
$_ = "Hamadryas";

put $pattern ?? 'Matched!' !! 'Missed!';

This makes more sense if you want to match a character that’s either hard to type or hard to read. If the Str has the 🐱 character (U+1F431 CAT FACE), you might not be able to distinguish that from 😸 (U+1F638 GRINNING CAT FACE WITH SMILING EYES) without looking very closely. Instead of letting another programmer mistake your intent, you can use the name to save some eyestrain:

my $pattern = rx/
     \c[CAT FACE]  # or \x[1F431]
    /;
$_ = "This is a catface: 🐱";
put $pattern ?? 'Matched!' !! 'Missed!';

Matching Any Character

Patterns have metacharacters that match something other than their literal selves. Some of these are listed in Table 15-2 (and most you won’t see in this chapter). The . matches any character (including a newline). This pattern matches any target that has at least one character:

模式具有与其文字自我匹配的元字符。其中一些列在表15-2中(大多数情况下,您不会在本章中看到)。这个。匹配任何字符(包括换行符)。此模式匹配具有至少一个字符的任何目标:

m/ . /

To match a Str with an a and a c separated by a character, put the dot between them in the pattern. This skips the lines that don’t match that pattern:

要将Str与由字符分隔的a和c匹配,请在模式中将它们放在它们之间。这会跳过与该模式不匹配的行:

for lines() {
    next unless m/a.c/;
    .put
    }

ESCAPING CHARACTERS

Some characters have special meaning in patterns. The colon introduces an adverb and the # starts a comment. To match those as literal characters you need to escape them. A backslash will do:

有些字符在模式中有特殊含义。冒号引入了一个副词,#开始发表评论。要将它们作为文字字符进行匹配,您需要将它们转义。反斜杠可以:

my $pattern = rx/ \# \: Hama \. /

This means to match a literal backslash, you need to escape that too:

这意味着匹配文字反斜杠,你也需要逃避它:

my $pattern = rx/ \# \: Hama \\ /

You can do the same thing with the other pattern metacharacters. To match a literal dot, escape it:

您可以使用其他模式元字符执行相同的操作。要匹配文字点,请将其转义:

my $pattern = rx/ \. /

The backslash only escapes the character that comes immediately after it. You can’t escape a literal space character, and you can’t escape a character that isn’t special. Table 15-2 shows what you need to escape, even though I haven’t shown you most of those features yet.

反斜杠只会逃避紧随其后的字符。您无法转义文字空格字符,也无法转义不特殊的字符。表15-2显示了您需要逃脱的内容,即使我还没有向您展示大部分功能。

Metacharacter Why it’s special
# Starts a comment
\ Escapes the next character or a shortcut
. Matches any character
: Starts an adverb, or prevents backtracking
( and ) Starts a capture
< and > Used to create higher-level thingys
[, ], and ' Used for grouping
+, |, &, -, and ^ Set operations
?, *, +, and % Quantifiers
| Alternation
^ and $ Anchors
$ Starts a variable or named capture
= Assigns to named captures

Characters inside quotes are always their literal selves:

引号内的字符总是它们的文字自我:

my $pattern = rx/ '#:Hama' \\ /

You can’t use the single quotes to escape the backslash since a single backslash will still try to escape the character that comes after it.

您不能使用单引号来转义反斜杠,因为单个反斜杠仍会尝试转义后面的字符。

MATCHING LITERAL SPACES

You have a tougher time if you want to match literal spaces. You can’t escape a space with \ because unspace isn’t allowed in a pattern. Instead, put quotes around the literal space:

如果你想匹配文字空间,你会有更艰难的时间。您无法使用\来转义空格,因为模式中不允许使用空格。相反,在文字空间周围加上引号:

my $pattern = rx/ Hamadryas ' ' laodamia /;

Or put the entire sequence in quotes:

或者将整个序列放在引号中:

my $pattern = rx/ 'Hamadryas laodamia' /;

Those single quotes can quickly obscure what belongs where; it can be helpful to spread the pattern across lines and note what you are trying to do:

那些单引号很快就会模糊属于哪里;将图案分布在线条上并记下您要做的事情会很有帮助:

my $pattern = rx/
    Hamadryas    # genus
    ' '            # literal space
    laodamia     # species
    /;

You can make whitespace significant with the :s adverb:

你可以使用:s副词使空白显着:

my $pattern = rx:s/ Hamadryas laodamia /;

my $pattern = rx/ :s Hamadryas laodamia /;

The :s is the short form of :sigspace:

:s是sigspace的缩写形式:sigspace:

my $pattern = rx:sigspace/ Hamadryas laodamia /;

my $pattern = rx/ :sigspace Hamadryas laodamia /;

Notice that this will match Hamadryas laodamia, even though the pattern has whitespace at the beginning and end. The :s turns the whitespace in the pattern into a subrule <.ws>:

请注意,这将匹配Hamadryas laodamia,即使该模式在开头和结尾都有空格。 :s将模式中的空格转换为子规则<.ws>:

$_ = 'Hamadryas laodamia';
my $pattern = rx/ Hamadryas <.ws> laodamia /;
if m/$pattern/ {
    say $/;  # ?Hamadryas laodamia?
    }

You can combine adverbs, but they each get their own colon. Order does not matter. This pattern has significant whitespace and is case insensitive:

你可以结合副词,但每个副词都有自己的冒号。订单无关紧要。此模式具有重要的空白并且不区分大小写:

my $pattern = rx:s:i/ Hamadryas Laodamia /;

Matching Types of Characters

So far, you’ve matched literal characters. You typed out the characters you wanted, and escaped them in some cases. There are some sets of characters that are so common they get shortcuts. These start with a backslash followed by a letter that connotes the set of characters. Table 15-3 shows the list of shortcuts.

If you want to match any digit, you can use \d. This matches anything that is a digit, not just the Arabic digits:

到目前为止,您已经匹配了字面字符。您键入了所需的字符,并在某些情况下将其转义。有一些字符组很常见,它们可以获得快捷方式。它们以反斜杠开头,后跟一个表示字符集的字母。表15-3显示了快捷方式列表。

如果要匹配任何数字,可以使用\ d。这匹配任何数字,而不仅仅是阿拉伯数字:

/ \d /

Each of these shortcuts comes with a complement. \D matches any nondigit.

这些快捷方式中的每一个都有补充。 \ D匹配任何非数字。

Shortcut Characters that match
\d Digits (Unicode property N )
\D Anything that isn’t a digit
\w Word characters: letters, digits, or underscores
\W Anything that isn’t a word character
\s Any kind of whitespace
\S Anything that isn’t whitespace
\h Horizontal whitespace
\H Anything that isn’t horizontal whitespace
\v Vertical whitespace
\V Anything that isn’t vertical whitespace
\t A tab character (specifically, only U+0009)
\T Anything that isn’t a tab character
\n A newline or carriage return/newline pair
\N Anything that isn’t a newline

EXERCISE 15.2Write a program that outputs only those lines of input that contain three decimal digits in a row. You wrote most of this program in the previous exercise.

练习15.2编写一个程序,只输出那些包含三行十进制数字的输入行。你在上一个练习中写了大部分这个程序。

UNICODE PROPERTIES

The Unicode Character Database (UCD) defines the code points and their names and assigns them one or more properties. Each character knows many things about itself, and you can use some of that information to match them. Place the name of the Unicode property in <:...>. That colon must come right after the opening angle bracket. If you wanted to match something that is a letter, you could use the property Letter:

Unicode字符数据库(UCD)定义代码点及其名称,并为它们分配一个或多个属性。每个角色都知道很多关于自身的事情,你可以使用其中的一些信息来匹配它们。将Unicode属性的名称放在<:…>中。结肠必须在开角支架后面。如果你想匹配一个字母的东西,你可以使用属性字母:

/ <:Letter> /

Instead of matching a property, you can match characters that don’t have that particular property. Put a ! in front of the property name to negate it. This matches characters that aren’t the title-case letters:

您可以匹配没有该特定属性的字符,而不是匹配属性。放一个!在属性名称前面否定它。这匹配不是标题大小写字母的字符:

/ <:!TitlecaseLetter> /

Each property has a long form, like Letter, and a short form, in this case L. There are other properties, such as Uppercase_Letter and Lu, or Number and N:

每个属性都有一个长格式,如Letter和短格式,在本例中为L.还有其他属性,如Uppercase_Letter和Lu,或Number和N:

/ <:L> /
/ <:N> /

You can match the characters that belong to certain Unicode blocks or scripts:

您可以匹配属于某些Unicode块或脚本的字符:

<:Block('Basic Latin')>
<:Script<Latin>>

Even though you can abbreviate these property names I’ll use the longer names in this book. See the documentation for the other properties.

即使您可以缩写这些属性名称,我也会在本书中使用较长的名称。请参阅其他属性的文档。

COMBINING PROPERTIES

One property might not be enough to describe what you want to match. To build fancier ones, combine them with character class set operators. These aren’t the same operators you saw in Chapter 14; they’re special to character classes.

The + creates the union of the two properties. Any character that has either property will match:

一个属性可能不足以描述您想要匹配的内容。要构建更高级的,将它们与字符类集合运算符组合。这些与第14章中看到的操作符不同;他们对角色课很特别。

+创建两个属性的并集。任何具有任何属性的字符都将匹配:

/ <:Letter + :Number> /
/ <:Open_Punctuation + :Close_Punctuation> /

Subtract one property from another with -. Any character with the first property that doesn’t have the second property will match this. The following example matches all the identifier characters (in the UCD sense, not the Raku sense). There are the characters that can start an identifier and those that can be in the other positions:

用 - 减去另一个属性。具有第一个属性但没有第二个属性的任何字符都将与此匹配。以下示例匹配所有标识符字符(在UCD意义上,而不是Raku意义上)。可以启动标识符的字符和可以位于其他位置的字符:

/ <:ID_Continue - :Number> /

You can shorten this to not match a character without a particular property. It looks like you leave off the first part of the subtraction; the - comes right after the opening angle bracket. That implies you’re subtracting from all characters. This matches all the characters that don’t have the Letter property:

您可以将此缩短为与没有特定属性的角色不匹配。看起来你放弃了减法的第一部分; - 在打开角度支架后面。这意味着你要从所有角色中减去。这匹配所有没有Letter属性的字符:

/ <-:Letter> /

EXERCISE 15.3Write a program to count all of the characters that match either the Letter or Number properties. What percentage of the code points between 1 and 0xFFFD are either letters or numbers? The .chr method may be handy here.

练习15.3编写一个程序来计算与Letter或Number属性匹配的所有字符。 1和0xFFFD之间的代码点百分比是字母还是数字? .chr方法在这里可能很方便。

User-Defined Character Classes

You can define your own character classes. Put the characters that you want to match inside <[...]>. These aren’t the same square brackets that you saw earlier for grouping; these are inside the angle brackets. This character class matches either a, b, or 3:

您可以定义自己的角色类。将要匹配的字符放在<[…]>中。这些与您之前看到的用于分组的方括号不同;这些都在尖括号内。此字符类匹配a,b或3:

/ <[ab3]> /

As with everything else so far, this matches one character and that one character can be any of the characters in the character class. This character class matches either case at a single position:

与目前为止的所有其他内容一样,它匹配一个字符,并且一个字符可以是字符类中的任何字符。此字符类匹配单个位置的任一个案例:

/ <[Hh]> ama /    # also / [ :i h ] ama /

You could specify the hexadecimal value of the code point. The whitespace is insignificant:

您可以指定代码点的十六进制值。空白是微不足道的:

/ <[ \x[48] \x[68] ]> ama /

The character name versions work too:

角色名称版本也适用:

/ <[
    \c[LATIN CAPITAL LETTER H]
    \c[LATIN SMALL LETTER H]
    ]>
/

You can make a long list of characters:

您可以制作一长串字符:

/ <[abcdefghijklmnopqrstuvwxyz]> / # from a to z

Inside the character class the # is just a #. If you try to put a comment in there all of the characters in your message become part of the character class:

在角色类中,#只是一个#。如果您尝试在其中放置注释,则消息中的所有字符都将成为字符类的一部分:

/ <[
    \x[48] # uppercase
    \x[68] # lowercase
  ]>
/

You’ll probably get warnings about repeated characters if you try to do that.

如果您尝试这样做,您可能会收到有关重复字符的警告。

CHARACTER CLASS RANGES

But that’s too much work. You can use .. to specify a range of characters. The literal characters work as well as the hexadecimal values and the names. Notice you don’t quote the literal characters in these ranges:

但那工作太多了。您可以使用..指定一系列字符。文字字符以及十六进制值和名称都起作用。请注意,您不引用这些范围中的文字字符:

/ <[a..z]> /
/ <[ \x[61] .. \x[7a] ]> /
/ <[ \c[LATIN SMALL LETTER A] .. \c[LATIN SMALL LETTER Z] ]> /

The range doesn’t have to be the only thing in the square brackets:

范围不一定是方括号中的唯一内容:

/ <[a..z 123456789]> /

You could have two ranges:

你可以有两个范围:

/ <[a..z 1..9]> /

NEGATED CHARACTER CLASSES

Sometimes it’s easier to specify the characters that can’t match. You can create a negated character class by adding a - between the opening angle bracket and the opening square bracket. This example matches any character that is not a, b, or 3:

有时,指定无法匹配的字符会更容易。您可以通过在开角括号和开始方括号之间添加 - 来创建否定字符类。此示例匹配任何不是a,b或3的字符:

/ <-[ab3]> /

Space inside a character class is also insignificant:

字符类中的空格也是微不足道的:

/ <-[ a b 3 ]> /

You can use a negated character class of one character. Quotes inside the character class are literal characters because Raku knows you aren’t quoting:

您可以使用一个字符的否定字符类。字符类中的引号是文字字符,因为Raku知道您没有引用:

/ <-[ ' ]>  /   # not a quote character

This one matches any character that is not a newline:

这个匹配任何不是换行符的字符:

/ <-[ \n ]> /   # not a newline

The predefined character class shortcuts can be part of your character class:

预定义的字符类快捷方式可以是您的角色类的一部分:

/ <-[ \d \s ]> /   # digits or whitespace

Like the Unicode properties, you can combine sets of characters:

与Unicode属性一样,您可以组合字符集:

/ <[abc] + [xyz]> /    # but, also <[abcxyz]>

/ <[a..z] - [ijk]> /   # easier than two ranges

EXERCISE 15.4Create a program to output all the input lines. Skip any line that contains a letter unless it’s a vowel. Also skip any lines that are blank (that is, only have whitespace).

练习15.4创建一个程序来输出所有输入行。跳过包含字母的任何行,除非它是元音。也跳过任何空白行(即只有空格)。

Matching Adverbs

You can change how the match operator works by applying adverbs, just like you changed how Q worked in Chapter 4. There are several, but you’ll only see the most commonly used here.

您可以通过应用副词来更改匹配运算符的工作方式,就像您在第4章中更改Q的工作方式一样。有几个,但您只会看到此处最常用的。

Matching Either Case

So far a character in your pattern matches exactly the same character in the target. An H only matches an uppercase H and not any other sort of H:

到目前为止,模式中的字符与目标中的字符完全匹配。 H只匹配大写的H而不是任何其他类型的H:

my $pattern = rx/ Hama /;
put 'Hamadryas' ~~ $pattern;  # Matches

Change your pattern by one character. Instead of an uppercase H, use a lowercase one:

将模式更改为一个字符。而不是大写的H,使用小写的:

my $pattern = rx/ hama /;
put 'Hamadryas' ~~ $pattern;  # Misses because h is not H

The pattern is case sensitive, so this doesn’t match. But you can make it case insensitive with an adverb. The :iadverb makes the literal alphabetic characters match either case. You can put the adverb right after the rx or the m:

该模式区分大小写,因此不匹配。但是你可以用副词区分大小写。 :iadverb使文字字母符合两种情况。你可以把副词放在rx或m之后:

my $pattern = rx:i/ hama /;
put 'Hamadryas' ~~ $pattern;  # Matches, :i outside

This is the reason you can’t use the colon as the delimiter!

When you use an adverb on the outside of the pattern, that adverb applies to the entire pattern. You can also put the adverb on the inside of the pattern:

这就是你不能使用冒号作为分隔符的原因!

在模式外部使用副词时,该副词适用于整个模式。你也可以把副词放在模式的内部:

my $pattern = rx/ :i hama /;
put 'Hamadryas' ~~ $pattern;  # Matches, :i inside

Isn’t that interesting? Now you start to see why whitespace isn’t counted as part of the pattern. There’s much more going on besides literal matching of characters.

The adverb applies from the point of its insertion to the end of the pattern. In this case it applies to the entire pattern because the :i is at the beginning. Put that adverb later in the pattern, and it applies from there to the rest of the pattern. Here the ha only match lowercase because the adverb shows up later. The rest of the pattern after the :i is case insensitive:

那不是很有趣吗?现在你开始明白为什么空格不算作模式的一部分。除了字符的字面匹配之外还有更多的事情要做。

副词从插入点到模式结尾。在这种情况下,它适用于整个模式,因为:i在开头。将该副词放在模式中,然后从那里应用到模式的其余部分。 ha只与小写匹配,因为副词会在稍后出现。在以下情况之后的其余模式:i不区分大小写:

my $pattern = rx/ ha :i ma /; # final ma case insensitive

You can group parts of patterns with square brackets. This example groups the am but doesn’t do much else because there’s nothing else special going on:

您可以使用方括号对部分图案进行分组。这个例子对am进行分组,但没有做太多其他事情,因为没有其他特别的事情:

my $pattern = rx/ h [ am ] a /;

An adverb inside a group applies only to that group:

组内的副词仅适用于该组:

my $pattern = rx/ h [ :i am ] a /;

The rules are the same: the adverb applies from the point of its insertion to the end of the group:

规则是相同的:副词从插入点到组尾:

my $pattern = rx/ h [ a :i m ] a /; # matches haMa or hama

At this point, you’re probably going to start mixing up what’s going on. There’s another reason whitespace doesn’t matter—you can add comments to your pattern:

在这一点上,你可能会开始混淆正在发生的事情。空白无关紧要的另一个原因 - 您可以为您的模式添加注释:

my $pattern = rx/
    h
    [       # group this next part
        a
        :i   # case insensitive to end of group
        m
    ]       # end of group
    a
    /;

Everything from the # character to the end of the line is a comment. You can use embedded comments too:

从#字符到行尾的所有内容都是注释。您也可以使用嵌入式注释:

my $pattern = rx/
    :i #`( case insensitive ) Hama
    /;

These aren’t particularly good comments because you’re annotating what the syntax already denotes. As a matter of good practice, you should comment what you are trying to match rather than what the syntax does. However, the world isn’t going to end if you leave a reminder for yourself of what a new concept does.

EXERCISE 15.5Write a program that outputs only the lines of input that contain the text ei. You’ll probably want to save this program to build on in later exercises.

这些并不是特别好的注释,因为您正在注释语法已经表示的内容。作为一种良好实践,您应该评论您要匹配的内容而不是语法的内容。但是,如果你给自己留下一个新概念的提醒,世界就不会结束。

练习15.5编写一个只输出包含文本ei的输入行的程序。你可能想要保存这个程序,以便在以后的练习中继续使用。

Ignoring Marks

The :ignoremark adverb changes the pattern so that accents and other marks don’t matter. The marks can be there or not. It works if the marks are in the target or the pattern:

:ignoremark副词会更改模式,以便重音和其他标记无关紧要。标记可以存在与否。如果标记在目标或模式中,它可以工作:

$_ = 'húdié';   # ??
put m/ hudie /            ?? 'Matched' !! 'Missed';  # Missed
put m:ignoremark/ hudie / ?? 'Matched' !! 'Missed';  # Matched

$_ = 'hudie';
put m:ignoremark/ húdié / ?? 'Matched' !! 'Missed';  # Matched

It even works if both the target and the pattern have different marks in the same positions:

如果目标和模式在相同位置具有不同的标记,它甚至可以工作:

$_ = 'hüdiê';
put m:ignoremark/ húdié / ?? 'Matched' !! 'Missed';  # Matched

Some adverbs can show up inside the pattern. They apply to the parts of the pattern that come after them:

一些副词可以出现在模式中。它们适用于它们之后的模式部分:

$_ = 'hüdiê';
put m/ :ignoremark hudie / ?? 'Matched' !! 'Unmatched';  # Matched

Global Matches

A pattern might be able to match several times in the same text. The :global adverb gets all of the nonoverlapping Matches. It returns a List:

模式可能能够在同一文本中多次匹配。 :全局副词获取所有不重叠的匹配。它返回一个List:

$_ = 'Hamadryas perlicus';
my $matches = m:global/ . s /;
say $matches;   # (?as? ?us?)

No matches gets you an empty List:

没有匹配得到一个空列表:

$_ = 'Hamadryas perlicus';
my $matches = m:global/ six /;
say $matches;   # ()

The match operator can find overlapping matches too. Use :overlap to return a potentially longer list. The ?uta? and ?ani? here both match the same a:

匹配运算符也可以找到重叠匹配。使用:重叠以返回可能更长的列表。 ??和?ani?这里两个匹配相同的a:

$_ = 'Bhutanitis thaidina';

my $global = m:global/ <[aeiou]> <-[aeiou]> <[aeiou]> /;
say $global;  # (?uta? ?iti? ?idi?)

my $overlap = m:overlap/ <[aeiou]> <-[aeiou]> <[aeiou]> /;
say $overlap; # (?uta? ?ani? ?iti? ?idi? ?ina?)

Things That Use Patterns

There are many features that you haven’t been able to use so far because you hadn’t seen regexes yet. Now you’ve seen regexes, so you can see these things. There are a couple of Str methods that work with a pattern to transform values. This section is a taste of the features you’ll use most often.

The .words and .comb methods break up text. The .split method is the general case of that. It takes a pattern to decide how to break up the text. Whatever it matches are the parts that disappear. You could break up a line on tabs, for instance:

到目前为止,您还无法使用许多功能,因为您尚未看到正则表达式。现在你已经看过正则表达式,所以你可以看到这些东西。有一些Str方法可以使用模式来转换值。本节介绍了您最常使用的功能。

.words和.comb方法分解文本。 .split方法就是这种情况的一般情况。它需要一种模式来决定如何分解文本。无论它匹配什么是消失的部分。你可以在标签上划分一条线,例如:

my @words = $line.split: / \t /;

.grep can use the match operator to select things. If the match operator succeeds it returns something that’s True, and that element is part of the result:

.grep可以使用匹配运算符来选择事物。如果匹配运算符成功,则返回一些True,并且该元素是结果的一部分:

my @words-with-e = @word.grep: /:i e/;

Or, to put it all together:

或者,把它们放在一起:

my @words-with-e = $line.split( / \t / ).grep( /:i e/ );

.split can specify multiple possible separators. Not all of them need be matches. This breaks up a line on a literal comma or whitespace:

.split可以指定多个可能的分隔符。并非所有人都需要匹配。这会在文字逗号或空格上划分一行:

my @words-with-e = $line
    .split( [ ',', / \s / ] )
    .grep( /:i e/ );

.comb does a job similar to .split, but it breaks up the text by keeping the parts that matched. This keeps all the nonoverlapping groups of three digits and discards everything else:

.comb的工作类似于.split,但它通过保留匹配的部分来分解文本。这将保留所有三个数字的非重叠组,并丢弃其他所有内容:

my @digits = $line.comb: /\d\d\d/;

With no argument .comb uses the pattern of the single . to match any character. This breaks up a Str into its characters without discarding anything:

没有参数.comb使用单一模式。匹配任何角色。这会将Str分解为其角色而不丢弃任何内容:

my @characters = $line.comb: /./;

Substitutions

The .subst method works with a pattern to substitute the matched text with other text:

.subst方法使用模式将匹配的文本替换为其他文本:

my $line = "This is PERL 6";
put $line.subst: /PERL/, 'Perl';  # This is Raku

This one makes the substitution for the first match:

这个替换第一场比赛:

my $line = "PERL PERL PERL";
put $line.subst: /PERL/, 'Perl';  # Perl PERL PERL

Use the :g adverb to make all possible substitutions:

使用:g副词进行所有可能的替换:

my $line = "PERL PERL PERL";
put $line.subst: /PERL/, 'Perl';  # Perl Perl Perl

Each of these returns the modified Str and leaves the original alone. Use .subst-mutate to change the original value:

其中每个都返回修改后的Str并单独留下原始文件。使用.subst-mutate更改原始值:

my $line = "PERL PERL PERL";
$line.subst-mutate: /PERL/, 'Perl', :g;
put $line;  # Perl Perl Perl

These will be much more useful with the regex features you’ll see in the next chapter.

EXERCISE 15.6Using .split, output the third column of a tab-delimited file. The butterfly census file you made at the end of Chapter 9 would do nicely here.

对于您将在下一章中看到的正则表达式功能,这些功能将更加有用。

EXERCISE 15.6使用.split,输出制表符分隔文件的第三列。你在第9章结尾处制作的蝴蝶人口普查文件在这里做得很好。

Summary

You haven’t seen the full power of regexes in this chapter since it was mostly about the mechanism of applying the patterns to text. That’s not a big deal—the patterns can be much more sophisticated, but the mechanisms are the same. In the next chapter you’ll see most of the fancier features you’ll regularly use.

在本章中你没有看到正则表达式的全部功能,因为它主要是关于将模式应用于文本的机制。这不是什么大问题 - 模式可以更复杂,但机制是相同的。在下一章中,您将看到您经常使用的大多数更高级的功能。

comments powered by Disqus