home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


6.5. Finding the N th Occurrence of a Match

Problem

You want to find the N th match in a string, not just the first one. For example, you'd like to find the word preceding the third occurrence of "fish" :





One fish two fish red fish blue fish



Solution

Use the /g modifier in a while loop, keeping count of matches:

$WANT = 3;
$count = 0;
while (/(\w+)\s+fish\b/gi) {
    if (++$count == $WANT) {
        print "The third fish is a $1 one.\n";
        # Warning: don't `last' out of this loop
    }
}




The third fish is a red one.



Or use a repetition count and repeated pattern like this:

/(?:\w+\s+fish\s+){2}(\w+)\s+fish/i;

Discussion

As explained in the chapter introduction, using the /g modifier in scalar context creates something of a progressive match , useful in while loops. This is commonly used to count the number of times a pattern matches in a string:

# simple way with while loop
$count = 0;
while ($string =~ /PAT/g) {
    $count++;               # or whatever you'd like to do here
}

# same thing with trailing while
$count = 0;
$count++ while $string =~ /PAT/g;

# or with for loop
for ($count = 0; $string =~ /PAT/g; $count++) { }
    
# Similar, but this time count overlapping matches
$count++ while $string =~ /(?=PAT)/g;

To find the N th match, it's easiest to keep your own counter. When you reach the appropriate N, do whatever you care to. A similar technique could be used to find every N th match by checking for multiples of N using the modulus operator. For example, (++$count % 3) == 0 would be every third match.

If this is too much bother, you can always extract all matches and then hunt for the ones you'd like.

$pond  = 'One fish two fish red fish blue fish';

# using a temporary
@colors = ($pond =~ /(\w+)\s+fish\b/gi);      # get all matches
$color  = $colors[2];                         # then the one we want

# or without a temporary array
$color = ( $pond =~ /(\w+)\s+fish\b/gi )[2];  # just grab element 3

print "The third fish in the pond is $color.\n";




The third fish in the pond is red.



Or finding all even-numbered fish:

$count = 0;
$_ = 'One fish two fish red fish blue fish';
@evens = grep { $count++ % 2 == 1 } /(\w+)\s+fish\b/gi;
print "Even numbered fish are @evens.\n";




Even numbered fish are two blue.



For substitution, the replacement value should be a code expression that returns the proper string. Make sure to return the original as a replacement string for the cases you aren't interested in changing. Here we fish out the fourth specimen and turn it into a snack:

$count = 0;
s{
   \b               # makes next \w more efficient
   ( \w+ )          # this is what we'll be changing
   (
     \s+ fish \b
   )
}{
    if (++$count == 4) {
        "sushi" . $2;
    } else {
         $1   . $2;
    }
}gex;




One fish two fish red fish sushi fish



Picking out the last match instead of the first one is a fairly common task. The easiest way is to skip the beginning part greedily. After /.*\b(\w+)\s+fish\b/ , for example, the $1 variable would have the last fish.

Another way to get arbitrary counts is to make a global match in list context to produce all hits, then extract the desired element of that list:

$pond = 'One fish two fish red fish blue fish swim here.';
$color = ( $pond =~ /\b(\w+)\s+fish\b/gi )[-1];
print "Last fish is $color.\n";




Last fish is blue.



If you need to express this same notion of finding the last match in a single pattern without /g , you can do so with the negative lookahead assertion (?!THING) . When you want the last match of arbitrary pattern A, you find A followed by any amount of not A through the end of the string. The general construct is A(?!.*A)*$ , which can be broken up for legibility:

m{
    A               # find some pattern A
    (?!             # mustn't be able to find
        .*          # something
        A           # and A
    )
    $               # through the end of the string
}x

That leaves us with this approach for selecting the last fish:

$pond = 'One fish two fish red fish blue fish swim here.';
if ($pond =~ m{
                    \b  (  \w+) \s+ fish \b
                (?! .* \b fish \b )
            }six )
{
    print "Last fish is $1.\n";
} else {
    print "Failed!\n";
}




Last fish is blue.



This approach has the advantage that it can fit in just one pattern, which makes it suitable for similar situations as shown in Recipe 6.17 . It has its disadvantages, though. It's obviously much harder to read and understand, although once you learn the formula, it's not too bad. But it also runs more slowly though  - around twice as slowly on the data set tested above.

See Also

The behavior of m//g in scalar context is given in the "Regexp Quote-like Operators" section of perlop (1), and in the "Pattern Matching Operators" section of Chapter 2 of Programming Perl ; zero-width positive lookahead assertions are shown in the "Regular Expressions" section of perlre (1), and in the "rules of regular expression matching" section of Chapter 2 of Programming Perl