cd ~

Perl Grymoire (sed)

Introduction
Command-line perl
The essential substitution operator
The slash as a delimiter
Storing the last successful pattern match
Perl regular expressions
Storing search pattern results
Perl modifiers
- Global replacement modifier
- Ignore case modifier
Separating multiple expressions for readability
Filenames on the command line
perl emulating grep
perl emulating tr
perl scripting
perl in shell scripts
perl version
Restricting processing to ranges of text
- Restricting to a specific line number
- Restricting to a range of lines
- Restricting to a range of patterns
Delete lines with flow control
- Deleting a line using a pattern
- Deleting a range of lines using pattern
- Deleting a line matching a specific line number
- Deleting a range of lines
Appending a line
Inserting a line
Printing a specific line number
Matching a multi-line pattern

Introduction

This post is meant to be an educational post about perl as a ad-hoc replacement for sed that echos a bit of the sed grymoire. This is not as thorough as the sed grymoire due to man perlre and man perlretut. This is more of a “How To” swap perl for sed for common problems and is therefore focused on solving problems using perl’s regular expressions in place of sed. I hope someone finds this useful. This post is a result of a coworker complaining about not having access to GNU sed on AIX 7.2 systems for a script he was writing. Much of this post is copied and slightly modified from the sed grymoire. This was created with respect and love for the sed and awk grymoires.

Command-line perl

Perl has several commands and can emulate many UNIX tools such as sed and awk but most people only learn sed and awk. In order to use perl on the command-line, one must understand a handful of options (man perlrun)

-n: Often used with -e or -E. Causes perl to iterate over each line of a file or stream. Effectively running each line through a while loop.
-e: Evaluate one line of a program. Multiple -e options can be specified to combine multiple expressions much like sed.
-E: Same as -e but enables all optional features
-p: Often combined with -n and -e. It effectively does the same thing as the -n switch, making -n redundant and as a result will override -n. Still, many people such as myself write -pne because I never learn… -p has the added effect over -n in that it will implicitly print every line.
-a: Autosplit lines when used with -p or -n and stores results in the @F array. -a implicitly enables -n and the pattern split can be specified with -F
-F: Specify pattern to split on. Implicitly sets both -a and -n
-l: Enables automatic line-ending processing. This option chomps (with no arguments) the new line character off lines and stores that as the input record separator $/ as well as the output record separator $\.

The essential substitution operator

Perl has several operators and can emulate many UNIX tools such as sed and awk but most people only learn GNU sed and GNU awk. GNU sed and GNU awk are not as portable as one would think and often one will run in to problems trying to port shell scripts to systems such as AIX, BSDs, etc due to those systems having their own sed and awk tools that may not support the same options and features.

Perl, like sed, has a substitute operator s that changes occurrences of a regular expression pattern that is matched to a new (substituted) value. A simple example is changing “day” in the “old” file to “night” in the “new” file:

perl -pe 's/day/night/' file

Or piping a stream:

echo 'day' | perl -pe 's/day/night/'

This will result in the output of “night”

Perl, like sed, changes exactly what one would tell it to. So if one executed:

echo 'Sunday' | perl -pe 's/day/night/'

This would output the word “Sunnight” because perl found the string “day” in the input.

Another important concept is that perl when using the -n or -p option runs a file or stream through a while loop. Suppose the input file:

one two three, one two three

four three two one

one hundred

If one used the command:

perl -pe 's/one/ONE/' file

The output would be:

ONE two three, one two three

four three two ONE

ONE hundred

As with sed, this changed “one” to “ONE” once on each line. The first line had “one” twice, but only the first occurrence was changed. That is the default behavior. If one wanted to match all instances of the word one, one would have to use the g modifier. I’ll discuss the g modifier a bit later.

There are four parts to this substitute operator (see man perlre):

s Substitute operator
/../../ Delimiter
one Regular Expression Search Pattern
ONE Replacement string

The search pattern is on the left hand side and the replacement string is on the right hand side.

This covers 90% of the effort needed to learn perl’s substitute operator. With this information one should be able to stop here and replace 99% of GNU sed with perl.

Slash as a delimiter

For those not familiar with sed, the character after the s is the delimiter. It is conventionally a slash, because this is what ed, more, and vi use. It can be anything one wants. If one wants to change a pathname that contains a slash - say /usr/local/bin to /common/bin - one could use the backslash to escape the slash:

perl -pe 's/\/usr\/local\/bin/\/common\/bin/' file

This is what the sed grymoire refers to as a ‘picket fence’ and I have to agree, it’s ugly. It is easier to read if one uses a different character as a delimeter instead such as a slash:

perl -pe 's_/usr/local/bin_/common/bin_' file

Some people use colons:

perl -pe 's:/usr/local/bin:/common/bin:' file

Others use the ”#” character.

perl -pe 's#/usr/local/bin#/common/bin#' file

As long as it’s not in the string one is looking for, anything goes. Remember that the substitution operator requires three delimiters. If one gets a “Substitution replacement not terminated” error, it’s because a delimeter is missing.

Storing the last successful pattern match

Sometimes one wants to search for a pattern and add some characters, like parenthesis, around or near the pattern one has found. It is easy to do this if one is looking for a particular string:

perl -pe 's/abc/(abc)/' file

This won’t work if one doesn’t know exactly what one will find. How can one put the string one matched in the replacement string if one doesn’t know what it is?

The solution requires the special variable $&. It corresponds to the last successful pattern found.

perl -pe 's/[a-z]*/($&)/' file

One can have any number of $& variables in the replacement string. One could also double a pattern, e.g. the first number of a line:

$ echo "123 abc" | perl -pe 's/[0-9]*/$& $&/'
123 123 abc

Perl will match the first string, and make it as greedy as possible. I’ll cover that later. If one doesn’t want it to be so greedy (i.e. non-greedy the matching), one needs to put restrictions on the match.

The first match for '[0-9]*' is the first character on the line, as this matches zero or more numbers. So if the input was “abc 123” the output would be unchanged (well, except for a space before the letters). A better way to duplicate the number is to make sure it matches a number:

$ echo "123 abc" | perl -pe 's/[0-9][0-9]*/$& $&/'
123 123 abc

The string “abc” is unchanged, because it was not matched by the regular expression. If one wanted to eliminate “abc” from the output, one must expand the regular expression to match the rest of the line and explicitly exclude part of the expression using (", ") and $1, which is the next topic.

Perl regular expressions

There is another way to write the above script. “[0-9]*” matches zero or more numbers. “[0-9][0-9]*” matches one or more numbers. The other way to do this is to use the ”+” character and use the pattern “[0-9]+” as the ”+” is a special meta-character when using “perl regular expressions.” Perl regular expressions are a lot more powerful than POSIX or Extended regular expressions such as those available in GNU sed and GNU awk. In fact, they are so powerful that many languages implement support for perl-compatible regular expressions. See man perlre for more information about perl regular expressions.

Storing search pattern results

Parenthesis remember a substring of the characters matched by the regular expression. One can use this to exclude part of the characters matched by the regular expression. The $1 is the first remembered pattern, and the $2 is the second remembered pattern.

If one wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:

perl -pe 's/(\w*).*/$1/'

This can also be acheived without regular expressions:

perl -lae 'print $F[0]'

Regular expressions are greedy, and try to match as much as possible. “\w*” matches a detected word character, and tries to match as many word characters as possible. The “.*” matches zero or more characters after the first match. Since the first one grabs all of the contiguous word characters (greedy!), the second matches anything else. Therefore if one types:

echo abcd123 | perl -pe 's/(\w*).*/$1/'

This will output “abcd” and delete the numbers.

If one wants to switch two words around, one can remember two patterns and change the order around:

perl -pe 's/(\w*) (\w*)/$2 $1/'

Note the space between the two remembered patterns. This is used to make sure two words are found. This can also be acheived without regular expressions using awk-like syntax”

perl -lae 'print $F[1]." ".$F[2]'

The $1 doesn’t have to be in the replacement string (in the right hand side). It can be in the pattern one is searching for (in the left hand side). If one want to eliminate duplicated words one can try something like:

perl -pe 's/([a-z]*) $1//' # This will only work on a pair of words

If one wants to detect a pair of duplicated words, one can use a non-regular expression solution:

perl -lae 'print if $F[0] eq $F[1]

Perl modifiers

One can add additional flags after the last delimiter to modify perl regular expression behaviour. See man perlre for more information.

Global replacement modifier

Most UNIX utilities work on files, reading a line at a time. Perl, by default, is the same way. If one tells it to change a word, it will only change the first occurrence of the word on a line. One may want to make the change on every word on the line instead of just the first. As with sed, we can substitute globally using the g modifier.

echo "Hello world. Hello universe." | perl -pe 's/Hello/Goodbye/g'

In this example, both instances of the word “Hello” are replaced with “Goodby”.

Ignore case modifier

This i modifier makes the pattern match case-insensitive. This will match abc, aBc, ABC, AbC, etc.:

perl -ne 'print if /abc/i' file

Separating multiple expressions for readability

One method of combining multiple expressions is to use a -e (or -E to enable all features) before each command:

# Note the semi-colon. It's required to separate the two statements.
perl -p -e 's/a/A/;' -e 's/b/B/' file

The same can be acheived with separating expressions with semi-colons in a single statement:

perl -pe 's/a/A/;s/b/B/' file

Filenames on the command line

One can specify files on the command line if one wishes. If there is more than one argument to perl (when using -p, -n, or -a options) that does not start with an option, it is assumed to be a filename. This next example will count the number of lines in three files that don’t begin with a “#:”

# Note semi-colons required to separate statements.
perl -ne '$count+=1 unless /^#/;' -e 'END{print $count,"\n";}'

Let’s break this down into pieces. First, we specify -ne options as we don’t want the lines implicitly printed to STDOUT as with -p. BEGIN and END are special functions in perl. Any processing in these functions occur at the beginning and end of processing.

-ne '$count+=1 unless /^#/;' we increase a variable $count by 1 if the line does not start with the # character
-e 'END{print $count,"\n";}' we print the $count variable at the end, once all lines have been processed

Of course one could write the last example with GNU grep and GNU wc:

grep -hv '^#' file1 file2 file3 | wc -l # not portable

And of course one can replace grep with perl:

perl -ne 'print unless /^#/' file1 file2 file3 | wc -l

And lastly a very programmatic perl example that won’t be explained here:

# There's more than one way to do it!
perl -e 'print scalar(grep !/^#/, <>),"\n";' file1 file2 file3

perl emulating grep

perl can easily emulate the behaviour of grep. GNU grep (I’m not sure what version this feature was added) also has support for perl compatible regular expressions, however; the manual states that they are currently experimental.

# simple pattern match
perl -ne 'print if /pattern/' file
grep 'pattern' file

# match multiple patterns
perl -ne 'print if /pattern1|pattern2/' file
grep '\(pattern1\|pattern2\)' file # notice the annoying backslashes...

# match inverse of a pattern
perl -ne 'print unless /pattern/' file
grep -v 'pattern' file

perl emulating tr

The tr operator can emulate the UNIX tr command. for instance, if one wanted to. For instance, one could change all lowercase characters to uppercase:

perl -pe 'tr/[a-z]/[A-Z]/' file1

And of course, there’s more than one way to do it:

perl -pe '$_ = uc' file1

perl scripting

Perl is way more powerful than tools such as sed or awk and as such, perl scripting is outside the scope of this post as it’s meant to cover command-line perl as a drop-in replacement for GNU sed only. See man perlintro for an introduction to perl scripting.

perl in shell scripts

Perl can easily be used in place of many unix tools in shell scripts, subbing for sed, awk, grep, etc in a pinch. It’s also way more portable than those tools.

perl version

Print the version of perl

Restricting processing to ranges of text

To restrict operations to specific ranges in perl, one will often see the “..” range operator used in conjuntion with the special “$.” variable that tracks the current line number.

Restricting to a specific line number

If one wanted to delete the first number on line 3 of a file simply do the following:

perl -pe 's/[0-9]// if $. == 3' file

Restricting to a range of lines

If one wanted to delete the first number on lines 3 through 5 in a file, one would use:

perl -pe 's/[0-9]// if $. == 3 .. $. == 5' file

One can also specify multiple ranges to work on using the “||” operator. This would operate on the range of 3 through 5 as well as 8 through 10:

perl -pe 's/[0-9]// if $. == 3 .. $. == 5 || $. == 8 .. $. == 10' file

Restricting to a range of patterns

Of course one can also restrict operations to a range of patterns. This will modify lines begging with start through to a line beginning with end:

# Note: this is greedy so it will modify the aboslute first match all the way to the absolute end match)
perl -pe 's/[0-9]// if /^start/ .. /^end/' file

Delete lines with flow control

Deleting lines or ranges of lines in perl is easy using flow control and it’s very readable for those of us who are English first-language. Think of print if /pattern/ as delete lines if the pattern matches and print unless /pattern/ as delete lines unless a pattern matches.

Deleting a line using a pattern

Here’s how one would delete all lines starting with ”#”:

perl -ne 'print unless /^#/' file

Deleting a range of lines using pattern

Assuming the lines we wanted to delete were between the lines starting with “start” and “end”:

perl -ne 'print unless /^start/ .. /^end/' file

Deleting a line matching a specific line number

Assuming the line we wanted to remove was line 3:

perl -ne 'print unless $. == 3'

Deleting a range of lines

Assuming the lines we wanted to delete were 3 through 5:

perl -ne 'print unless $. == 3 .. $. == 5'

Appending a line

One can match a line and append another line. Assuming we wanted to append the line “after” after the line “before”:

perl -pe 's/(before)/$1\nafter/' file

Inserting a line

One can match a line and insert another line before it. Assuming we wanted to insert the line “before” before the line “after”:

perl -pe 's/(after)/before\n$1/'

Printing a specific line number

Printing a specific line number in a file is trivial with perl and the “$.” variable. Assuming we wanted to print line 5:

perl -ne 'print if $. == 5' file

Matching a multi-line pattern

In order to match a multi-line pattern in command-line perl, one must first enable slurp mode with undef $/. This causes perl to read in the entire stream all at once instead of reading the stream one line at a time. Next, use either the s or m modifiers. s allows the “.” character to match new-line characters and m allows “^” and “$” to match new line characers. The modifiers are not mutually exclusive and can be used together.

If one wanted to modify a file containing:

Hello

World

To instead be:

Hello Readers

One could use the following:

perl -pe 'BEGIN{undef $/}; s/Hello.*World/Hello Readers/s' file

Breaking this down:

BEGIN{undef $/}; undef the default record separator before processing (slurp mode)
s/Hello.*World/Hello Readers/s the .* pattern between Hello and World matches the new line character because the /s modifier was used

I think this about covers the sed grymoire as most of the remaining topics of it are very sed specific. While it’s not a one-to-one translation, this post covers 95% of the core concepts. Perl’s regular expression engine can also do so many things that it’s impossible for me to cover them all so I didn’t even attempt it. Anyways, I’ll try to tackle the awk grymoire next as I typically use perl in place of awk way more often than I do for sed.