Intro Into Raku Regexes and Grammars

tpm-regex.raku.party

STATUS QUO: PCRE

“Perl 兼容的正则表达式”非常神秘，但许多语言只是盲目地遵循现状。

它们是 <?!, <=?!, or <!#?@$%^( 吗？

/(?<!foo)bar(?=baz)/

更好的正则表达式语法

Raku 并不害怕拒绝现状。

/<!after foo> bar <before baz>/

空格可以自由使用

文字字符串：字母数字字符可以按原样使用。其他的字符，只需使用引号或反斜杠：

say so "I ♥ Raku" ~~ /I \♥ Raku/;    # False
say so "I ♥ Raku" ~~ / 'I ♥ Raku' /; # True
say so "I ♥ Raku" ~~ /
    I #`(BTW, you can use inline,) " ♥ "
    "Raku" # as well as end-of-line comments
/; # True

文本内容的变量

默认设置是将内容与纯文本匹配。将变量放入尖括号中以将其解释为正则表达式。

my $stuff := 'the.+stuff';
say so "the.+stuff"      ~~ / $stuff /; # True
say so "the other stuff" ~~ / $stuff /; # False
say so "the other stuff" ~~ /<$stuff>/; # True

方括号

方括号用于非捕获分组：

say "I really love Raku" ~~ /
    I \s+ [really \s+]? 'love Raku'
/;  # OUTPUT: «｢I really love Raku｣␤»

圆括号

圆括号仍用于捕获分组：

say "I love Raku" ~~ /
    I \s+ (\w+) ' Raku'
/;  # OUTPUT: «｢I love Raku｣
    #             0 => ｢love｣␤»

与 PCRE 相同……

有些东西保持不变：

say "I love Raku" ~~ /
    ^ . \s+ \S+ .? \w+ .*? \d+ $
/;  # OUTPUT: «｢I love Raku｣␤»

…除了

^ 和 $ 有更简单的含义：字符串的开始和结束 - 没有其他魔法。 ^^ 和 $$ 用于行的开头和结尾（在 “\n” 之前，不包括在内）。

say so "foo\n" ~~ /^ foo $/;    # False
say so "foo\n" ~~ /^ foo \n $/; # True
say so "foo\nbar" ~~ /^ foo $$ \n ^^ bar $/; # True

唔…

那么…… \S+ 地狱？

/^ I \s+ truly \s+ madly \s+ deeply
    \s+ love  \s+ Perl  \s+ 6 \n  $/

SIGSPACE

不：使用 :sigspace（ :s 更短）（在 grammars 中：声明 rule 而不是 token）

/^ :s I truly madly deeply love Raku $/

SIGSPACE 在正则表达式中的 terms 之后自动放置 <.ws> token，<.ws> 默认情况下匹配零个或多个空白字符（只要不在单词内 (<!ww> \s*)）:

say so "This is neat!" ~~ /
  :s  This    is     neat \!
/; # True

say so "This is neat!" ~~ /
  :s  T h i s i s n e a t \!
/; # False

字符类

与之前相同，[...]，除了还添加尖括号（对特殊的东西使用尖括号是常见的模式）

say "I love Raku" ~~ /<[a..zIP\d\s]>+/
# OUTPUT: «｢I love Raku｣␤»

Ranges 使用 .. 而不是 -。

要否定一个字符类，在方括号前放置一个减号：

say "I love Raku" ~~ /<-[I]> <-[A..Z]>+ \s+ \d+/
# OUTPUT: «｢Raku｣»

您可以通过添加/减去现有类来创建自定义类。使用加号而不是减号来添加内容：

say "Awesome Raku" ~~ /<[\w]-[a..z]+[erl]>**4/
# OUTPUT: «｢Perl｣»

匹配 Unicode 属性，在尖括号内使用冒号：

say "Я люблю Raku" ~~ /<:Script('Latin')>+/
# OUTPUT: «｢Perl｣»

可以否定 Unicode 属性或将它们与其他字符类混合和匹配：

say "Я люблю Raku" ~~ /<-:Script('Latin')>+/
# OUTPUT: «｢Я люблю ｣»

say "Я люблю Raku" ~~ /<:Script('Latin')  +[\d\s]>**4..*/
# OUTPUT: «｢ Raku｣»

量化修改

匹配用分隔符分隔的东西：

say "this,is,a,really,neat,feature"
  ~~ /(\w+)**3 % ',' $/

# OUTPUT: «｢really,neat,feature｣
#             0 => ｢really｣
#             0 => ｢neat｣
#             0 => ｢feature｣»

捕获一个"单词" (\w+)，我们想要其中三个单词 **3，用逗号 % ',' 分隔，锚定到字符串 $ 的结尾。

ALTERNATIONS

最长 token 匹配（LTM）的选择 |，或首次列出的匹配 ||：

say "Perler" ~~ / [Perl |  \w+] /;
# OUTPUT: «｢Perler｣» # this alternative is the longest

say "Perler" ~~ / [Perl || \w+] /;
# OUTPUT: «｢Perl｣»
# this alternative was the first listed in our regex

变量的内容

列表被解释为匹配的 alternatives 列表：

my @stuff := <foo bar ber>
say so "thefoo" ~~ /the @stuff/;      # True
say so "thefoo" ~~ /< foo bar ber >/; # True

# this is equivalent to the last regex above:
say so "thefoo" ~~ /[ foo | bar | ber ]/;

CONJUGATIONS

&& 测试两个正则表达式是否匹配字符串的相同部分：

say "Perl ۶ or Raku" ~~ / Perl \s+ \d /;
# OUTPUT: «｢Perl ۶｣␤»

say "Perl ۶ or Raku" ~~ /
    Perl \s+ [\d && <:Block('Basic Latin')>]
/ # OUTPUT: «｢Raku｣␤»

命名的，独立的正则表达式

在尖括号中使用其名称：

my regex quoted { \" <( <-["]>+ )> \" }

say 'I love "Perl" and "Raku"'
    ~~ /<quoted> && .+ \d .+/;

# OUTPUT: «｢"Raku"｣
#            quoted => ｢Raku｣␤»

<( 是左匹配标记，)> 是右匹配标记。控制在匹配中捕获的内容。（ <( 在 Perl 5 中就像 \K）。

命名捕获

使用美元符号，后跟名称和等号。下一个 term 将被捕获（使用方括号将多个 term 分组以进行捕获）。

say '2018-07-26T19:00:00-04:00' ~~ /
    $<year>=\d**4 '-' $<month>=\d**2 '-' $<day>=[\d\d]
  T $<time>=[\d\d]**3 % ':'
/;

say "TPM is at $<time> on $<year month day>.join('.')"
# OUTPUT: «TPM is at 19:00:00 on 2018.07.26␤»

在正则表达式之外，$<...> 是 $/<...> 的快捷方式。 $/ 是存储最后一个匹配项的默认变量，但您可以将匹配项赋值给任何变量：

my $m := 'I really love Raku' ~~ / :s
  (I|You) .+  $<how>=< love like >
  $<what>=.+
/;

say "$0 $<how> $<what>";      # OUTPUT: «I love Raku␤»
say "$/[0] $/<how> $/<what>"; # OUTPUT: «I love Raku␤»
say "$m[0] $m<how> $m<what>"; # OUTPUT: «I love Raku␤»

几个用于命名捕获的修饰符：

<foo>       # match regex `foo` and capture under `foo`
<.foo>      # match, but don't capture
<bar=.foo>  # match, but name capture `bar` (in grammars)
<bar=.&foo> # same as above, but used for standalone regexes

重载?

你的大脑是否从所有新信息中融化了？

P5 模式

在正则表达式上使用 :P5 副词启用 Perl 5（PCRE +）正则表达式模式：

say "barbarbaz" ~~ m:P5/(?<!foo)bar(?=baz)/;
# OUTPUT: «｢bar｣␤»

say "foo bar" ~~ m:P5/foo ([a-r]+)/;
# OUTPUT: «｢foo bar｣
#            0 => ｢bar｣␤»

Perl 5.10 的大多数功能都可用。在学习 Raku 正则表达式时很好的训练轮，但很少被 Raku 用户使用。

复杂的正则表达式

让我们分解我们的 date-time 匹配正则表达式：

my regex date {
    $<year>=\d**4 '-' $<month>=\d**2 '-' $<day>=[\d\d]
}
my regex time { $<clock>=[\d\d]**3 % ':' $<tz>=.+ }
my regex date-time { <date> T <time> }

与它匹配会生成一个 Match 对象树，其中包含我们所有的命名捕获：

say '2018-07-26T19:00:00-04:00' ~~ &date-time;

# OUTPUT: «｢2018-07-26T19:00:00-04:00｣
#             date => ｢2018-07-26｣
#                year => ｢2018｣
#                month => ｢07｣
#                day => ｢26｣
#             time => ｢19:00:00-04:00｣
#                clock => ｢19:00:00｣
#                tz => ｢-04:00｣␤»

我们可以访问单个内容，就像它是嵌套哈希一样：

'2018-07-26T19:00:00-04:00' ~~ &date-time;
say "In $<date><year> TPM had a meeting at $<time><clock>";

# OUTPUT: «In 2018 TPM had a meeting at 19:00:00␤»

GRAMMARS 就像类一样

单独的正则表达式就像单独的方法

my regex  re { ...               }
my method me { self.substr: 0, 3 }

say 'foobar' ~~ &re; # OUTPUT: «｢foo｣␤»
say 'foobar' ~~ &me; # OUTPUT: «foo␤»

独立的方法只是空想的 subs。第一个参数变成了调用者。

方法保存在 classes 中。正则表达式保留在 grammars 中。

grammar Re { regex  re { ...                 } }
class   Me { method me (\v) { v.substr: 0, 3 } }

say Re.subparse: 'foobar', :rule<re>; # OUTPUT: «｢foo｣␤»
say Me.me:       'foobar';            # OUTPUT: «foo␤»

grammars 开始解析的默认值是 TOP; 在这里，我们将其重写为 re regex。

你可以将 grammars 子类化并将角色混合到它们中……

grammar GDate {
    regex TOP {
        $<year>=\d**4 '-' $<month>=\d**2 '-' $<day>=[\d\d]
    }
}
role GDateTime is GDate {
    regex time { $<clock>=[\d\d]**3 % ':' $<tz>=.+ }
    regex date-time { <date=.GDate::TOP> [T <time>]? }
}

……甚至定义常规方法：

grammar TPM does GDateTime {
    regex TOP { <date-time> }
    method when (\date) {
        self.parse: date;
        say "TPM had a meeting at "
          ~ $<date-time><time><clock>
    }
}
TPM.when: '2018-07-26T19:00:00-04:00'
# OUTPUT: «TPM had a meeting at 19:00:00␤»

其它正则表达式类型

与 regex 一起，你还可以使用 rule 和 token：

grammar {
    regex TOP  { … }
    rule  date { … }
    token time { … }
}

token - 像 regex, 但是拥有 :ratchet (短形式：:r) 开启(不回溯)
rule - 像 token, 但是拥有 sigspace (短形式：:s) 开启

您可以在本地禁用这些副词：

rule date { # :ratchet and :sigspace here
    [ :!ratchet # no ratchet
        [
            :ratchet # ratchet again
        ]
        # no ratchet here
        [ :!sigspace
            # no ratchet and no sigspace here
        ]
    ]
}

让我们解析一些东西

构建 Grammar 来解析这些数据：

[Grammars Talk]
    name: Zoffix
    lang: Raku
    topic: grammars and regexes
    length: 80
[Perf Talk]
    name: Zoffix
    lang: Raku
    topic: performance
    length: 30

grammar TPM {
    token key    { <-[:\n]>+         }
    token value  { <-[\]\n]>+        }
    rule row     { <key> ':' <value> }
    rule header  { '[' ~ ']' <value> }
    rule section { <header> <row>+   }
    rule TOP     { <section>+        }
}

TPM.parse: q:to/END/;
  [Grammars Talk]
      name: Zoffix
      lang: Raku
      topic: grammars and regexes
      length: 80
  [Perf Talk]
      name: Zoffix
      lang: Raku
      topic: performance
      length: 30
  END

my %result;
for $<section> {
    %result{.<header><value>} = .<row>.map({
        ~.<key> => ~.<value>
    }).hash
}
say %result;

# OUTPUT:
# {
#   Grammars Talk => {
#     lang => Raku, length => 80, name => Zoffix,
#     topic => grammars and regexes
#   },
#   Perf Talk => {
#     lang => Raku, length => 30, name => Zoffix,
#     topic => performance
#   }
# }

EWWW？

如果我们对 token 进行更改，则可能很难在此处进行适当的更改。特别是对于大型 grammars。

my %result;
for $<section> {
    %result{.<header><value>} = .<row>.map({
        ~.<key> => ~.<value>
    }).hash
}
say %result;

（Raku 核心 grammar 目前长 5,570 行）

ACTION 类!

一旦解析了 token（rule/regex），就会调用 Actions 类中具有相同名称的方法，并将该 token 的匹配对象作为参数

grammar Grammar {
    token TOP   { <stuff> }
    token stuff { …       }
}
class Actions {
    method TOP   ($match) { $match.make: $match.made }
    method stuff ($/)     { make 42 }
    # naming param `$/` lets us use some shortcuts
}
Grammar.parse('…', :actions(Actions)).made.say # 42

MAKE/MADE

这真的很简单：

class PretendMatch {
    has $!stuff;
    method make($stuff) {
        $!stuff = $stuff
    }
    method made { $!stuff }
}

只是一种附加并随后从 Match 对象中检索任意数据的方法。

方法为他们负责的 token 制作东西：

class TPMActions {
    method row     ($/) { make ~$<key> => ~$<value> }
    method header  ($/) { make ~$<value> }
    method section ($/) {
        make $<header>.made => $<row>».made.hash
    }
    method TOP ($/) { make $<section>».made.hash }
}

您不需要为每个 token 定义方法。

和以前一样，除了现在我们给我们的 Actions 类一个 :actions 命名参数：

my $match := TPM.parse: q:to/END/, :actions(TPMActions);
    [Grammars Talk]
        name: Zoffix

        […]
    END

dd $match.made

现在，我们可以自由地更改单个 token 和相应的 action 方法，而不会影响其他内容

# {
#     Grammars Talk => {
#         lang => Raku, length => 80, name => Zoffix,
#         topic => grammars and regexes
#     },
#     Perf Talk => {
#         lang => Raku, length => 30, name => Zoffix,
#         topic => performance
#     }
# }

有用的模块

安装 [ Grammar::Debugger](https://github.com/jnthn/grammar-debugger) (还包括 Grammar::Tracer)。

use Grammar::Tracer;
grammar {
    token TOP   { <stuff> }
    token stuff { <some> <other> }
    token some  { abc }
    token other { \d+ }
}.parse: 'abcdef';

只需 use 其中一个……

…而漂亮的输出将显示你的 grammar 无法匹配的地方：

GRAMMER BOOK

由核心开发者 moritz++ 编写

亚马逊和 Apress 上可买。

Grammars

第四天-使用 Grammars 进行解析