Perl Grymoire (sed)
- Introduction
- Command-line perl
- The essential substitution operator
- The slash as a delimiter
- Storing the last successful pattern match
- Perl regular expressions
- Storing search pattern results
- Perl modifiers
- Separating multiple expressions for readability
- Filenames on the command line
- perl emulating grep
- perl emulating tr
- perl scripting
- perl in shell scripts
- perl version
- Restricting processing to ranges of text
- Delete lines with flow control
- Appending a line
- Inserting a line
- Printing a specific line number
- Matching a multi-line pattern
Introduction
This post is meant to be an educational post about perl as a ad-hoc replacement for sed that echos a bit of the sed grymoire. This is not as thorough as the sed grymoire due to man perlre
and man perlretut
. This is more of a “How To” swap perl for sed for common problems and is therefore focused on solving problems using perl’s regular expressions in place of sed. I hope someone finds this useful. This post is a result of a coworker complaining about not having access to GNU sed on AIX 7.2 systems for a script he was writing. Much of this post is copied and slightly modified from the sed grymoire. This was created with respect and love for the sed and awk grymoires.
Command-line perl
Perl has several commands and can emulate many UNIX tools such as sed
and awk
but most people only learn sed
and awk
. In order to use perl on the command-line, one must understand a handful of options (man perlrun
)
-n
: Often used with-e
or-E
. Causes perl to iterate over each line of a file or stream. Effectively running each line through a while loop.-e
: Evaluate one line of a program. Multiple-e
options can be specified to combine multiple expressions much like sed.-E
: Same as-e
but enables all optional features-p
: Often combined with-n
and-e
. It effectively does the same thing as the-n
switch, making-n
redundant and as a result will override-n
. Still, many people such as myself write-pne
because I never learn…-p
has the added effect over-n
in that it will implicitly print every line.-a
: Autosplit lines when used with-p
or-n
and stores results in the@F
array.-a
implicitly enables-n
and the pattern split can be specified with-F
-F
: Specify pattern to split on. Implicitly sets both-a
and-n
-l
: Enables automatic line-ending processing. This option chomps (with no arguments) the new line character off lines and stores that as the input record separator$/
as well as the output record separator$\
.
The essential substitution operator
Perl has several operators and can emulate many UNIX tools such as sed
and awk
but most people only learn GNU sed
and GNU awk
. GNU sed
and GNU awk
are not as portable as one would think and often one will run in to problems trying to port shell scripts to systems such as AIX, BSDs, etc due to those systems having their own sed
and awk
tools that may not support the same options and features.
Perl, like sed, has a substitute operator s that changes occurrences of a regular expression pattern that is matched to a new (substituted) value. A simple example is changing “day” in the “old” file to “night” in the “new” file:
perl -pe 's/day/night/' file
Or piping a stream:
echo 'day' | perl -pe 's/day/night/'
This will result in the output of “night”
Perl, like sed, changes exactly what one would tell it to. So if one executed:
echo 'Sunday' | perl -pe 's/day/night/'
This would output the word “Sunnight” because perl found the string “day” in the input.
Another important concept is that perl when using the -n
or -p
option runs a file or stream through a while loop. Suppose the input file:
one two three, one two three
four three two one
one hundred
If one used the command:
perl -pe 's/one/ONE/' file
The output would be:
ONE two three, one two three
four three two ONE
ONE hundred
As with sed
, this changed “one” to “ONE” once on each line. The first line had “one” twice, but only the first occurrence was changed. That is the default behavior. If one wanted to match all instances of the word one, one would have to use the g
modifier. I’ll discuss the g
modifier a bit later.
There are four parts to this substitute operator (see man perlre
):
s
Substitute operator/../../
Delimiterone
Regular Expression Search PatternONE
Replacement string
The search pattern is on the left hand side and the replacement string is on the right hand side.
This covers 90% of the effort needed to learn perl’s substitute operator. With this information one should be able to stop here and replace 99% of GNU sed with perl.
Slash as a delimiter
For those not familiar with sed
, the character after the s
is the delimiter. It is conventionally a slash, because this is what ed
, more
, and vi
use. It can be anything one wants. If one wants to change a pathname that contains a slash - say /usr/local/bin
to /common/bin
- one could use the backslash to escape the slash:
perl -pe 's/\/usr\/local\/bin/\/common\/bin/' file
This is what the sed grymoire refers to as a ‘picket fence’ and I have to agree, it’s ugly. It is easier to read if one uses a different character as a delimeter instead such as a slash:
perl -pe 's_/usr/local/bin_/common/bin_' file
Some people use colons:
perl -pe 's:/usr/local/bin:/common/bin:' file
Others use the ”#” character.
perl -pe 's#/usr/local/bin#/common/bin#' file
As long as it’s not in the string one is looking for, anything goes. Remember that the substitution operator requires three delimiters. If one gets a “Substitution replacement not terminated” error, it’s because a delimeter is missing.
Storing the last successful pattern match
Sometimes one wants to search for a pattern and add some characters, like parenthesis, around or near the pattern one has found. It is easy to do this if one is looking for a particular string:
perl -pe 's/abc/(abc)/' file
This won’t work if one doesn’t know exactly what one will find. How can one put the string one matched in the replacement string if one doesn’t know what it is?
The solution requires the special variable $&
. It corresponds to the last successful pattern found.
perl -pe 's/[a-z]*/($&)/' file
One can have any number of $&
variables in the replacement string. One could also double a pattern, e.g. the first number of a line:
$ echo "123 abc" | perl -pe 's/[0-9]*/$& $&/'
123 123 abc
Perl will match the first string, and make it as greedy as possible. I’ll cover that later. If one doesn’t want it to be so greedy (i.e. non-greedy the matching), one needs to put restrictions on the match.
The first match for '[0-9]*'
is the first character on the line, as this matches zero or more numbers. So if the input was “abc 123” the output would be unchanged (well, except for a space before the letters). A better way to duplicate the number is to make sure it matches a number:
$ echo "123 abc" | perl -pe 's/[0-9][0-9]*/$& $&/'
123 123 abc
The string “abc” is unchanged, because it was not matched by the regular expression. If one wanted to eliminate “abc” from the output, one must expand the regular expression to match the rest of the line and explicitly exclude part of the expression using (", ")
and $1
, which is the next topic.
Perl regular expressions
There is another way to write the above script. “[0-9]*
” matches zero or more numbers. “[0-9][0-9]*
” matches one or more numbers. The other way to do this is to use the ”+” character and use the pattern “[0-9]+
” as the ”+” is a special meta-character when using “perl regular expressions.” Perl regular expressions are a lot more powerful than POSIX or Extended regular expressions such as those available in GNU sed
and GNU awk
. In fact, they are so powerful that many languages implement support for perl-compatible regular expressions. See man perlre
for more information about perl regular expressions.
Storing search pattern results
Parenthesis remember a substring of the characters matched by the regular expression. One can use this to exclude part of the characters matched by the regular expression. The $1
is the first remembered pattern, and the $2
is the second remembered pattern.
If one wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:
perl -pe 's/(\w*).*/$1/'
This can also be acheived without regular expressions:
perl -lae 'print $F[0]'
Regular expressions are greedy, and try to match as much as possible. “\w*
” matches a detected word character, and tries to match as many word characters as possible. The “.*
” matches zero or more characters after the first match. Since the first one grabs all of the contiguous word characters (greedy!), the second matches anything else. Therefore if one types:
echo abcd123 | perl -pe 's/(\w*).*/$1/'
This will output “abcd” and delete the numbers.
If one wants to switch two words around, one can remember two patterns and change the order around:
perl -pe 's/(\w*) (\w*)/$2 $1/'
Note the space between the two remembered patterns. This is used to make sure two words are found. This can also be acheived without regular expressions using awk-like syntax”
perl -lae 'print $F[1]." ".$F[2]'
The $1
doesn’t have to be in the replacement string (in the right hand side). It can be in the pattern one is searching for (in the left hand side). If one want to eliminate duplicated words one can try something like:
perl -pe 's/([a-z]*) $1//' # This will only work on a pair of words
If one wants to detect a pair of duplicated words, one can use a non-regular expression solution:
perl -lae 'print if $F[0] eq $F[1]
Perl modifiers
One can add additional flags after the last delimiter to modify perl regular expression behaviour. See man perlre
for more information.
Global replacement modifier
Most UNIX utilities work on files, reading a line at a time. Perl, by default, is the same way. If one tells it to change a word, it will only change the first occurrence of the word on a line. One may want to make the change on every word on the line instead of just the first. As with sed, we can substitute globally using the g
modifier.
echo "Hello world. Hello universe." | perl -pe 's/Hello/Goodbye/g'
In this example, both instances of the word “Hello” are replaced with “Goodby”.
Ignore case modifier
This i
modifier makes the pattern match case-insensitive. This will match abc, aBc, ABC, AbC, etc.:
perl -ne 'print if /abc/i' file
Separating multiple expressions for readability
One method of combining multiple expressions is to use a -e
(or -E
to enable all features) before each command:
# Note the semi-colon. It's required to separate the two statements.
perl -p -e 's/a/A/;' -e 's/b/B/' file
The same can be acheived with separating expressions with semi-colons in a single statement:
perl -pe 's/a/A/;s/b/B/' file
Filenames on the command line
One can specify files on the command line if one wishes. If there is more than one argument to perl (when using -p
, -n
, or -a
options) that does not start with an option, it is assumed to be a filename. This next example will count the number of lines in three files that don’t begin with a “#:”
# Note semi-colons required to separate statements.
perl -ne '$count+=1 unless /^#/;' -e 'END{print $count,"\n";}'
Let’s break this down into pieces. First, we specify -ne
options as we don’t want the lines implicitly printed to STDOUT
as with -p
. BEGIN
and END
are special functions in perl. Any processing in these functions occur at the beginning and end of processing.
-ne '$count+=1 unless /^#/;'
we increase a variable $count by 1 if the line does not start with the # character-e 'END{print $count,"\n";}'
we print the $count variable at the end, once all lines have been processed
Of course one could write the last example with GNU grep
and GNU wc
:
grep -hv '^#' file1 file2 file3 | wc -l # not portable
And of course one can replace grep
with perl:
perl -ne 'print unless /^#/' file1 file2 file3 | wc -l
And lastly a very programmatic perl example that won’t be explained here:
# There's more than one way to do it!
perl -e 'print scalar(grep !/^#/, <>),"\n";' file1 file2 file3
perl emulating grep
perl can easily emulate the behaviour of grep
. GNU grep
(I’m not sure what version this feature was added) also has support for perl compatible regular expressions, however; the manual states that they are currently experimental.
# simple pattern match
perl -ne 'print if /pattern/' file
grep 'pattern' file
# match multiple patterns
perl -ne 'print if /pattern1|pattern2/' file
grep '\(pattern1\|pattern2\)' file # notice the annoying backslashes...
# match inverse of a pattern
perl -ne 'print unless /pattern/' file
grep -v 'pattern' file
perl emulating tr
The tr
operator can emulate the UNIX tr
command. for instance, if one wanted to. For instance, one could change all lowercase characters to uppercase:
perl -pe 'tr/[a-z]/[A-Z]/' file1
And of course, there’s more than one way to do it:
perl -pe '$_ = uc' file1
perl scripting
Perl is way more powerful than tools such as sed
or awk
and as such, perl scripting is outside the scope of this post as it’s meant to cover command-line perl as a drop-in replacement for GNU sed
only. See man perlintro
for an introduction to perl scripting.
perl in shell scripts
Perl can easily be used in place of many unix tools in shell scripts, subbing for sed
, awk
, grep
, etc in a pinch. It’s also way more portable than those tools.
perl version
Print the version of perl
Restricting processing to ranges of text
To restrict operations to specific ranges in perl, one will often see the “..
” range operator used in conjuntion with the special “$.
” variable that tracks the current line number.
Restricting to a specific line number
If one wanted to delete the first number on line 3 of a file simply do the following:
perl -pe 's/[0-9]// if $. == 3' file
Restricting to a range of lines
If one wanted to delete the first number on lines 3 through 5 in a file, one would use:
perl -pe 's/[0-9]// if $. == 3 .. $. == 5' file
One can also specify multiple ranges to work on using the “||
” operator. This would operate on the range of 3 through 5 as well as 8 through 10:
perl -pe 's/[0-9]// if $. == 3 .. $. == 5 || $. == 8 .. $. == 10' file
Restricting to a range of patterns
Of course one can also restrict operations to a range of patterns. This will modify lines begging with start
through to a line beginning with end
:
# Note: this is greedy so it will modify the aboslute first match all the way to the absolute end match)
perl -pe 's/[0-9]// if /^start/ .. /^end/' file
Delete lines with flow control
Deleting lines or ranges of lines in perl is easy using flow control and it’s very readable for those of us who are English first-language. Think of print if /pattern/
as delete lines if the pattern matches and print unless /pattern/
as delete lines unless a pattern matches.
Deleting a line using a pattern
Here’s how one would delete all lines starting with ”#”:
perl -ne 'print unless /^#/' file
Deleting a range of lines using pattern
Assuming the lines we wanted to delete were between the lines starting with “start” and “end”:
perl -ne 'print unless /^start/ .. /^end/' file
Deleting a line matching a specific line number
Assuming the line we wanted to remove was line 3:
perl -ne 'print unless $. == 3'
Deleting a range of lines
Assuming the lines we wanted to delete were 3 through 5:
perl -ne 'print unless $. == 3 .. $. == 5'
Appending a line
One can match a line and append another line. Assuming we wanted to append the line “after” after the line “before”:
perl -pe 's/(before)/$1\nafter/' file
Inserting a line
One can match a line and insert another line before it. Assuming we wanted to insert the line “before” before the line “after”:
perl -pe 's/(after)/before\n$1/'
Printing a specific line number
Printing a specific line number in a file is trivial with perl and the “$.
” variable. Assuming we wanted to print line 5:
perl -ne 'print if $. == 5' file
Matching a multi-line pattern
In order to match a multi-line pattern in command-line perl, one must first enable slurp mode with undef $/
. This causes perl to read in the entire stream all at once instead of reading the stream one line at a time. Next, use either the s
or m
modifiers. s
allows the “.
” character to match new-line characters and m
allows “^
” and “$
” to match new line characers. The modifiers are not mutually exclusive and can be used together.
If one wanted to modify a file containing:
Hello
World
To instead be:
Hello Readers
One could use the following:
perl -pe 'BEGIN{undef $/}; s/Hello.*World/Hello Readers/s' file
Breaking this down:
BEGIN{undef $/};
undef the default record separator before processing (slurp mode)s/Hello.*World/Hello Readers/s
the.*
pattern between Hello and World matches the new line character because the/s
modifier was used
I think this about covers the sed grymoire as most of the remaining topics of it are very sed specific. While it’s not a one-to-one translation, this post covers 95% of the core concepts. Perl’s regular expression engine can also do so many things that it’s impossible for me to cover them all so I didn’t even attempt it. Anyways, I’ll try to tackle the awk grymoire next as I typically use perl in place of awk
way more often than I do for sed
.