Perl: сопоставить с большим массивом - PullRequest
1 голос
/ 29 мая 2020

Допустим, у меня есть $sentence = "The quick brown fox jumps over the lazy dog" и массив @targets = qw(brown red black lazy quick);.

Цель состоит в том, чтобы найти все слова в @targets, которые содержатся в $sentence, в данном случае это три слова : brown lazy quick.

Примечание. Массив @targets может быть очень большим, например, в нем может быть 10 000 элементов. И может быть 10 000 $sentence s, с которыми нужно справиться.

Есть ли эффективный и элегантный способ сделать это?

Ответы [ 5 ]

2 голосов
/ 29 мая 2020

Я быстро сравнил методы, представленные в других ответах:

use feature qw(say);
use strict;
use warnings;
use Algorithm::AhoCorasick::XS;
use Benchmark qw(cmpthese);
use Regexp::Assemble;

{
    my @words = ("barbarous", "blandit", "celery", "gravida", "pencil", "move", "rain",
            "diligent", "letter", "staking", "enjoy", "salty", "abnormal", "volatile",
            "scared", "unsightly", "curabitor", "possible", "morbi", "open", "road");

    my $str = <<'END';
Lorem ipsum dolor sit amet consectetur adipiscing elit Vivamus tempor nulla condimentum ex posuere sed vulputate purus laoreet Etiam ut justo pulvinar lobortis leo in dictum ligula Pellentesque porta et justo a vestibulum Fusce et nulla vitae nisi scelerisque iaculis Donec non mi ligula Donec et ullamcorper eros Duis tincidunt ante dui vel vestibulum risus gravida molestie Nam eleifend placerat enim ultricies eleifend Pellentesque dapibus orci quis nulla faucibus ut rhoncus urna interdum Nam lobortis vitae nunc vel bibendum Nulla sed condimentum justo
Donec mauris felis fermentum eu libero sed cursus fermentum nunc Fusce dolor sapien ultrices nec egestas vel aliquam in mi Duis ullamcorper elit ut ornare tincidunt urna velit ultrices est at molestie urna risus faucibus libero Cras nec quam in orci efficitur tristique venenatis sit amet odio Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas Pellentesque pretium metus eu eros vulputate non ornare orci imperdiet Ut pharetra efficitur urna sit amet sodales Donec pharetra eros sit amet dolor ullamcorper aliquet Cras ac ante sit amet erat fermentum consequat eget a turpis Donec a tellus ex Sed suscipit posuere hendrerit Praesent suscipit ante et tortor tempor congue Lorem ipsum dolor sit amet consectetur adipiscing elit Nunc tincidunt erat sit amet venenatis finibus ante massa ornare lacus at euismod tortor neque at velit
Sed ultricies velit ante vel convallis massa accumsan a Curabitur mauris purus tempus eu leo vitae fringilla porttitor augue Pellentesque mattis sed turpis quis viverra Aenean a mauris nec urna sollicitudin bibendum Etiam elementum efficitur lacus nec finibus Morbi ut dolor iaculis porttitor magna quis posuere arcu Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Duis at luctus sem Suspendisse vel erat porttitor lobortis tortor ac suscipit nulla Cras maximus dapibus odio sit amet aliquet Curabitur iaculis odio at aliquet tincidunt Suspendisse potenti Vivamus maximus est id rhoncus consectetur nisi purus rutrum massa nec ullamcorper risus nisi ut metus
Phasellus lobortis mauris turpis ut sollicitudin felis ultrices vel Ut tristique felis gravida est varius rutrum Sed porttitor pharetra sapien ac gravida urna Curabitur quis dignissim turpis Phasellus turpis sapien scelerisque vitae turpis tincidunt faucibus tempus tortor Proin bibendum elementum libero fermentum rhoncus Ut scelerisque sem eget rutrum scelerisque ex felis scelerisque augue vitae elementum lorem enim sit amet erat Sed congue volutpat mi pellentesque dapibus mi tempor quis Mauris nibh lorem mattis in nibh sed lacinia sagittis urna Etiam posuere odio non nisi placerat fermentum Mauris elit massa fermentum quis hendrerit ut condimentum a risus
Cras ut leo rhoncus laoreet turpis quis consectetur sem Fusce gravida neque et mauris molestie fermentum Proin non vehicula tellus vel volutpat est Nullam quis ornare ligula Nunc sed quam orci Nullam enim dolor cursus interdum fermentum sit amet gravida eu ex Aliquam nec porta mi Ut varius ut dolor vel convallis Curabitur massa purus varius nec diam eget sollicitudin finibus nisi Mauris pharetra in odio in rutrum Vestibulum vel eleifend ex Mauris tellus ex tempor ut dolor luctus facilisis pharetra justo
Cras elementum enim a egestas bibendum sapien ligula convallis tellus in dignissim augue massa vel diam Fusce quis mi diam Phasellus mollis est non ultricies dictum Nullam ullamcorper leo quis lacinia scelerisque libero enim faucibus dui placerat sodales dolor lectus sed ligula Nunc interdum justo vitae faucibus faucibus neque dui cursus nunc ut gravida dui urna sed ipsum Ut luctus lacus nulla id imperdiet elit fermentum sit amet Mauris nulla erat gravida ac enim ut blandit blandit felis Ut feugiat consequat vulputate Mauris erat neque vehicula ac purus ut tristique fringilla nulla
END

    cmpthese(
        -1,
        {
            regexp_assemble => sub { regexp_assemble( $str, \@words ) },
            aho_corasick    => sub { aho_corasick( $str, \@words ) },
            pp_trie         => sub { pp_trie( $str, \@words ) },
            hash_lookup     => sub { hash_lookup( $str, \@words ) },
        }
    );
}

sub hash_lookup {
    my ( $str, $targets ) = @_;

    my %targets = map { $_ => 1 } @$targets;
    my @words = split ' ', $str;
    my @matches;
    foreach my $each_word ( @words ){
        push @matches, $each_word if $targets{$each_word};
    }
    return \@matches;
}

sub pp_trie {
    my ( $str, $targets ) = @_;

    my $filter    = join '|', map {quotemeta} @$targets;
    my $re_filter = qr/\b($filter)\b/;
    my @matches = $str =~ /$re_filter/g;
    return \@matches;
}

sub aho_corasick {
    my ( $str, $targets ) = @_;

    my $matcher = Algorithm::AhoCorasick::XS->new($targets);
    my @matches = $matcher->matches($str);
    return \@matches;
}

sub regexp_assemble {
    my ( $str, $targets ) = @_;
    my $ra = Regexp::Assemble->new;
    $ra->add($_) for @$targets;
    my $regex = $ra->re;
    my @matches = $str =~ /\b($regex)\b/g;
    return \@matches;
}

Вывод :

                   Rate regexp_assemble   hash_lookup       pp_trie aho_corasick
regexp_assemble  1935/s              --          -86%          -90%         -90%
hash_lookup     14221/s            635%            --          -24%         -28%
pp_trie         18618/s            862%           31%            --          -5%
aho_corasick    19651/s            916%           38%            6%           --

Итак, для этой настройки aho_corasick был самым быстрым.

2 голосов
/ 29 мая 2020

Посмотрите, поможет ли вам этот c logi:

#!/usr/bin/perl

use strict;
use warnings;

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets = qw/brown red black lazy quick/;
my %targets = map { $_ => 1 } @targets;

my @words = split(' ', $sentence);

foreach my $each_word ( @words ){
    print $each_word."\n" if $targets{$each_word};
}

Вывод :

quick
brown
lazy
2 голосов
/ 29 мая 2020

Использование Algorithm :: AhoCorasick :: XS , который выполняет очень эффективный поиск нескольких строк:

#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;

my @targets = qw(brown red black lazy quick);
my $matcher = Algorithm::AhoCorasick::XS->new(\@targets);

my @sentences = (
    "The quick brown fox jumps over the lazy dog",
    "The man in black fled across the desert and the gunslinger followed."
);


for my $sentence (@sentences) {
    my @matches = $matcher->matches($sentence); # Or unique_matches
    say "@matches" if @matches;
}

Пример:

$ perl demo.pl
quick brown lazy
black
1 голос
/ 29 мая 2020

Использование только чистого perl

use strict;
use warnings;
use feature 'say';

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets   = qw/brown red black lazy quick/;
my $filter    = join('|',@targets);
my $re_filter = qr/\b($filter)\b/;

my @found = $sentence =~ /$re_filter/g;

say for @found;

Вывод

quick
brown
lazy
1 голос
/ 29 мая 2020

Вот пример использования Regexp::Assemble:

use feature qw(say);
use strict;
use warnings;
use Regexp::Assemble;

my $str = "The quick brown fox jumps over the lazy dog";
my @targets = qw(brown lazy quick);
my $ra = Regexp::Assemble->new;
$ra->add($_) for @targets;
my $regex = $ra->re;
my @matches = $str =~ /($regex)/g;
say for @matches;  # optionally add List::Util::uniq if you want the list 
                     to be unique

Вывод :

quick
brown
lazy
...