Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It’s a bit complicated to set up […] it’s really fast

Sigh, just use Perl. Writing code with the general regex engine took me only one minute of effort, but it runs already nearly 500× faster than codeplea's optimised special purpose code.

Why 100000 loops and not 10 like in the original code? Otherwise Benchmark.pm will show "(warning: too few iterations for a reliable count)".

----

benchmark.php 100000 loops:

    Loaded 3000 keywords to search on a text of 19377 characters.

    Searching with aho corasick...
    time: 329.3522541523
----

benchmark.pl 100000 loops:

    Benchmark: timing 100000 iterations of regex...
    regex: 0.691561 wallclock secs ( 0.69 usr +  0.00 sys =  0.69 CPU) @ 144927.54/s (n=100000)
----

benchmark.pl (fill in the abbreviated ... parts from benchmark_setup.php):

    #!/usr/bin/env perl
    use Benchmark qw(timethese :hireswallclock);
    require Time::HiRes;
    my @needles = qw(
    abandonment abashed abashments abduction ...
    );
    my $haystack = 'unscathed grampus ...
    heroically';
    my $n = join '|', @needles;
    timethese 100000, {
        regex => sub {
            my @found;
            while ($haystack =~ /($n)/cg) {
                push @found, [$1, pos $haystack];
            }
            return @found;
        },
        index => sub {
            my @found;
            for (@needles) {
                my $pos = index $haystack, $_;
                push @found, [$_, $pos] if -1 < $pos;
            }
            return @found;
        }
    };


Is your solution broken for the cases where keywords are prefixes or suffixes of each other? This situation is very common in my use-case. Also, does your solution work if a keyword appears multiple times?

I get what you're saying, but it's not quite as easy as you imply.

Pulling in an entire programming language is a much bigger dependency and maintenance cost than spending a couple hours writing an algorithm. It would make more sense to just use a C extension.

I did try PHP's regex. It was much, much slower.


You are right, the solution is broken. I can't make it work, so I take back what I said.

I learnt something valuable, thank you for that.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: