Perl Grymoire (awk)
- Introduction
- Why learn perl instead of awk?
- Command-line perl
- Basic structure for emulating awk programs
- Our first perl script
- Arithmetic expressions
- Unary arithmetic operators
- The autoincrement and autodecrement operators
- Assignment operators
- Conditional expressions
- Regular expressions
- Compound conditional expressions
- Perl builtin variables
- The default split pattern
- The output field separator variable
- Getting the number of fields
- Getting the last field
- The current line-variable
- The record separator variable
- The output record separator variable
- The current filename variable
Introduction
This post is meant to be an educational post about perl as a ad-hoc replacement for awk
that echos a bit of the awk grymoire. This is more of a “How To” swap perl for awk for common problems and is therefore focused on solving problems using as many awk-like features in perl as I know. I hope someone finds this useful. This post is a result of a coworker complaining about not having access to GNU awk on AIX 7.2 systems for a script he was writing. Much of this post is copied and slightly modified from the awk grymoire. This was created with respect and love for the sed and awk grymoires.
Why learn perl instead of awk?
Actually, I’d suggest you learn both! In fact, if you haven’t learned awk
yet, I’d suggest learning it with the awk grymoire before continuing as much of this article will reference awk
.
Command-line perl
Perl has several commands and can emulate many UNIX tools such as sed
and awk
but most people only learn sed
and awk
. In order to use perl on the command-line, one must understand a handful of options (man perlrun
)
-n
: Often used with-e
or-E
. Causes perl to iterate over each line of a file or stream. Effectively running each line through a while loop.-e
: Evaluate one line of a program. Multiple-e
options can be specified to combine multiple expressions much like sed.-E
: Same as-e
but enables all optional features-p
: Often combined with-n
and-e
. It effectively does the same thing as the-n
switch, making-n
redundant and as a result will override-n
. Still, many people such as myself write-pne
because I never learn…-p
has the added effect over-n
in that it will implicitly print every line.-a
: Autosplit lines when used with-p
or-n
and stores results in the @F array.-a
implicitly enables-n
and the pattern split can be specified with-F
-F
: Specify pattern to split on. Implicitly sets both-a
and-n
-l
: Enables automatic line-ending processing. This option chomps (with no arguments) the new line character off lines and stores that as the input record separator$/
as well as the output record separator$\
.
Basic structure for emulating awk programs
The essential organization of an awk program follows the form:
/pattern/{action}
The pattern specifies when the action is performed. The perl version would look more like this:
action if /pattern/
Perl when used with the -n
option is line-oriented as each line of a stream is itereated over as if it was run through a while-loop. The if /pattern/
will therefore perform action
on every line that matches /pattern/
. Two important functions are BEGIN
and END
. The analogous to BEGIN
and END
in awk
and any code in the BEGIN
and END
functions is processed at the beginning and end of the program as opposed to every line. For example, se the following perl one-liner:
perl -ne BEGIN { print "START\n" }; print . "\n"; END { print "-DONE-\n" } FILE
This code would print “START”, followed by every line in a file, followed by “-DONE-“. The print . "\n"
works because perl automatically stores the content of a line in the special $_
variable or default iterator variable which the print
function will print implicitly if not provided any arguments. The { print }
can also be excluded from the above example by using the -p
option instead of -n
:
perl -pe BEGIN { print "START\n" } END { print "-DONE-\n" } FILE
Our first perl script
Perl (version 5) can act very funny in order to support the behaviour of older versions of perl5 (and when I say old, I mean old, like 20+ years old). In order to enforce good behaviour and coding habits, it’s customary to use strict;
in modern perl to avoid things like bare-word variables and just weird stuff overall. If the version of perl on your destination system is old enough not to include the strict
module, such as versions older than 5.10 (maybe even older than that, not sure when strict was added but I’ve used 5.05 and it wasn’t fun), then you may want to eval "use strict; 1"
instead of just use strict
. I’m going to assume you aren’t running ancient late 90s Solaris throughout this article though as it’s the year 2020… Additionally, it’s customary to use warnings;
in order to get better debugging and error messages from your scripts. Many will also add use v5.10
in order to add support for new features such as say
. I’m going to avoid this throughout this article, not because it’s a bad feature, but sometimes you gotta learn things the old way first ;).
So, after that long-winded introduction, our first program, FileOwner.pl:
#!/usr/bin/env perl
use strict;
use warnings;
print "File\tOwner\n";
while (<>){
chomp;
my @F = split;
print "$F[-1]" . "\t" . "$F[2]\n";
}
print "-DONE-\n";
chmod +x
that file and run it in a directory that has files in it via:
ls -l | FileOwner.pl
It will output file names and owners, separated by the tab character.
The following is an explaination of the above program:
print "File\tOwner";
Obviously prints File and Owner separated by a tab characterswhile (<>){...}
Iterates over each line of STDIN (<>). In this case, the stream produced by thels -l
command.chomp;
Removes new-line character.my @F = split;
Somewhat cryptic perl here.my @F
creates an array called@F
whilesplit;
is actually callingsplit
with two arguments. The first argument, since it wasn’t specified, is implicitly the default input field separator (defaultsplit
pattern) which is equal to the regular expression/\s+/
which meanssplit
will split a string on every one-or-more spaces. The second argument tosplit
, since it wasn’t specified, is the special$_
variable or the default iterator which is also assumed.$_
stores the content of each line that we’re iterating over in thewhile {...}
loop. Lastly, note the use of the wordmy
. This is required for explicitly instantiating a variable in perl withuse strict
enabled.print "$F[-1]" . "\t" . "$F[2]\n";
print the last element in the@F
array ($F[-1]
, the file name) concatenated (.
) with a tab character (\t
), followed by the second element of the@F
array ($F[2]
, the file owner). Note that the numbers here, are one less than when a similar awk program is run. This is because perl’s array elements start at element 0. Also note that accessing elements in a perl array requires the use of the$
symbol as elements of an array must be accessed using scalar context. For more information seeman perlintro
.
NOTE: This program is incredibly flawed as it won’t handle file names with spaces properly. One should never pipe the output of ls
for parsing. The reason this example is here is because while it encourages potential bad habits, it still illustrates these concepts well and it comes straight out of the awk grymoire
This can also be expressed using command-line perl as:
ls -l | perl -lae 'BEGIN{print "File\tOwner\n"} print $F[-1]."\t"."$F[2]"; END{ print "-DONE-" }'
The command-line perl version of th script uses the special BEGIN
and END
functions which were not required in the perl program. This is required for the command-line version because -a
implicitly runs the STDIN (<>)
stream through a while {...}
loop. If BEGIN
and END
were not utilized here, "File\tOwner\n"
and "-Done-"
would be printed for every line. Additionally, the command-line version has no explicit declaration of the @F
variable. This is not required when the -a
option is used as it will automatically split each line using the default record separator /\s+/
and store the results in @F
. The default input record separator can be modified using the -F
option and providing an alternative pattern.
Note: None of the command-line perl examples will have use strict
or use warnings
enabled. They may be enabled on the command line via the -M
option or in the BEGIN
function. Typically, one does not use these modules for one-liners as they are meant to be quick and dirty. You may find your one-liners to not throw much help your way in regards to debugging. If you’re having trouble troubleshooting, pass the option -M'warnings'
.
Arithmetic expressions
There are several binary operators, similar to awk:
Operator | Type | Meaning |
+ | Arithmetic | Addition |
- | Arithmetic | Subtraction |
* | Arithmetic | Multiplication |
/ | Arithmetic | Division |
% | Arithmetic | Modulo |
Using variables with the value of “7” and “3”, perl returns the following results for each operator when using the print command:
Expression | Result |
7+3 | 10 |
7-3 | 4 |
7*3 | 21 |
7/3 | 2.33333333333333 |
7%3 | 1 |
If you’ve never worked with a programming language before, the %
or modulus operator returns the remainder after performing integer division. print 7/3
will output a floating point number if necessary. When concatenating two numbers using a “.
” character, perl will convert them to strings automatically (ex. 7 . 3
would result in the string "73"
). Perl, unlike C, only has 3 main variable types: scalars ($var
), arrays (@var
) and hashes (%var
). Scalars can be strings, integers, floating points or references to other types. Arrays and Hashes may contain scalars such as strings, integers, floats or references to arrays and/or hashes. See man perlintro
for more information.
Unary arithmetic operators
The “+” and “-“ operators can be used before variables and numbers. If X equals 4, then the statement:
my $x=4;print -$x;
This will result in the output of “-4”.
The autoincrement and autodecrement operators
Perl also supports the “++
” and “--
” operators of C. Both increment or decrement variables by one. The operator can only be used with a single variable, and can be before or after the variable:
my $x=4;print $x++," ",++$x;
This would print the numbers 3 and 5. These operators are also assignment operators, and can be used by themselves on a line:
$x++;
--y;
Assignment operators
Variables can be assigned new values with the assignment operators. Knowing “++
” and “--
”, the other assignment statement is simply:
my $variable = arithmetic expression
Certain operators have precedence over others. Parenthesis can be used to control grouping of operations. The statement:
my $x=1+2*3 . 4;print $x;
Is the same as:
my $x = (1 + (2 * 3)) . "4";print $x;
Both result in “74”. For more information about operator precedence, see man perlop
.
Notice spaces can be added for readability. Perl, like awk, has special assignment operators, which combine a calculation with an assignment. Instead of saying:
$x=$x+2;
One can instead say:
$x+=2;
Operator | Meaning |
+= | Add result to variable |
-= | Subtract result from variable |
*= | Multiply variable by result |
/= | Divide variable by result |
%= | Apply modulo to variable |
Conditional expressions
Perl can also handle conditional expressions and has robust logical operators to do so. if
, unless
, while
, etc, evaluate an expression to true or false. A value of 0 (or undef) is evaluated to false
while other value is evaluated to true
.
Operator | Meaning |
== | is numerically equal |
!= | is not numerically equal |
> | is numerically greater than |
>= | is numerically greater than or equal to |
< | is numerically less than |
<= | is numerlically less than or equal to |
eq | is stringwise equal |
ne | is not stringwise equal |
gt | is stringwise greater than |
ge | is stringwise greater than or equal to |
lt | is stringwise less than |
le | is stringwise less than or equal to |
Regular expressions
Two operators are used to compare strings and regular expressions:
Operator | Meaning |
=~ | left-side matches provided right-side regular expression |
!~ | left-side does not match provided right-side regular expression |
NOTE: smart match ~~
also exists and can do a lot of things but it’s behaviour is pretty unpredictable as it has changed many times since it was first introduced in perl. It’s probably best just to avoid it.
Compound conditional expressions
Multiple conditional expressions can be compounded. One can combine two conditional expressions with the “and” (&&
) or “or” (||
) operators. One can also just type the english and
or or
, however; they have much lower precedence than their symbolic equivalents. Additionally, truthfullness can be inversed using the !
character in comparisons.
Perl Builtin Variables
Perl has many builtin variables that are useful for text processing. The builtin variables are all very symbolic but typically have awk-like equivalents if use English;
is set. See man perlvar
for a list of all pre-defined variables. The following table describes a number of variables useful for creating awk-like programs:
Variable | use English | Meaning |
<> | N/A | standard input |
$_ | $ARG | Default pattern searching space |
@_ | @ARG | Array containing all arguments passed to a subroutine |
$$ | $PID | Process id of current executing program |
$0 | $PROGRAM_NAME | Name of the current executing program |
$, | $OFS | Output field separator. Default is undef |
$. | $INPUT_LINE_NUMBER or $NR | Current line number of last file handle or stream accessed |
$/ | $INPUT_RECORD_SEPARATOR or $RS | Input record separator. Defaults to newline character |
$\ | $OUTPUT_RECORD_SEPARATOR or $ORS | Output record separator. Defaults to undef |
NOTE: Unlike awk, there is no $FS or default “input field separator” variable. The input field separator in perl is actually the default split pattern which defaults to the regular expression /\s+/
on the command line when the -a
option is specified or in a perl program when split is called without specifying a pattern. The field separator can be specified on the command-line by specifying a pattern for the -F
option and in a program as the first argument to the split function.
The default split pattern
Perl can easily parse man system administration files, however; many of these files do not use the /\s+/
pattern as a delimeter. Many of them use colons ”:” instead. As with awk, one can easily set to a ”:” character with the -F option:
perl -F':' -lae 'print $F[0].":no shell!" if $F[-1] eq "/usr/sbin/nologin"' /etc/passwd
The equivalent perl program would be:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>){
chomp;
my @F = split /:/;
print $F[0] . ":no shell!\n" if $F[-1] == "/usr/sbin/nologin";
}
The new script could then be executed using script.pl /etc/passwd
.
An advantage of writing a script is that you could use multiple patterns with the split
function. This would be somewhat rare but given the following file:
ONE 1 I
TWO 2 II
#START
THREE:3:III
FOUR:4:IV
FIVE:5:V
#STOP
SIX 6 VI
SEVEN 7 VII
You could process it with a perl script such as the following:
#!/usr/bin/env perl
use strict;
use warnings;
my $FS = qr/\s+/;
while (<>){
chomp;
if ($_ eq '#START'){
$FS = qr/:/;
}elsif($_ eq '#STOP'){
$FS = qr/\s+/;
}else{
# print the Roman number in column 3
my @F = split /$FS/;
print $F[2]."\n";
}
}
In the above script we set a variable $FS
that stores a default pattern /\s+/
. It’s enclosed in qr/.../
which is used to quote regex. When we reach the line in the file matching ‘#START’ we proceed to change the $FS pattern to /:/
and when the line matches ‘#STOP’ we change the $FS pattern back to /\s+/
. When neither ‘#START’ or ‘#STOP’ is matched, we proceed to split the line using the pattern we stored in $FS
and print the value of the 3rd column (element 2 of our @F
array).
NOTE: the use of eq
instead of ==
for comparing strings. Don’t make the same mistake I did when writing this example lol…
NOTE: Unlike awk
, $FS
is not a special perl variable. We’ve simply set it to store a pattern for use in our program. One cannot override the default split pattern in perl (except when using the -F
option on the command-line). $FS
is used here to mimic awk
. We could have just as easily stored this pattern in any variable.
The Output Field Separator Variable
Before starting, please note that there is a major difference between the following:
print "one"."two";
print "one","two";
The example line, print "one"."two";
concatenates the words “one” and “two” while the second sends the words “one” and “two” as arguments to print
. The output field separator variable will not affect the first example because it’s concatenating two strings as opposed to passing two strings as arguments.
Let’s say you wanted to copy the /etc/shadow
file but exclude all the hashed passwords. This could be accomplished on the command-line with the following:
sudo perl -F':' -ae 'BEGIN{$,=":"} $F[1]=""; print @F;' /etc/shadow
An equivalent perl program would be:
#!/usr/bin/env perl
use strict;
use warnings;
$, = ':';
while (<>){
chomp;
my @F = split /:/,$_,-1;
$F[1] = "";
print @F;print "\n";
}
In the one-liner we excluded the -l
option as it isn’t required and will exclude empty fields. In the script version, we specified a -1
argument on the split
function in order for it to not drop empty fields. We run two split statements to avoid adding an extra colon character that would occur if we passed the “\n” new-line character as another argument to the print statement that is printing the array.
Getting the number of fields
Getting the number of fields in perl is easy. It’s equal to the length of the @F
array + 1. This can be retrieved automatically using the scalar function. For example:
perl -F':' -ae 'print scalar @F,"\n"' /etc/passwd | head -n1
The above example results in 7.
Getting the last field
The last field of a line (last element of an array of a split line) can easily be retrieved by using a -1 array index.
perl -F':' -ae 'print $F[-1],"\n"' /etc/passwd | head -n1
On my system the first line of /etc/passwd
is “root:x:0:0:root:/root:/bin/bash” so the above example returns “/bin/bash”.
The current line variable
$.
holds the value of the current line. Therefore, to get the number of lines in a file, emulating the behaviour of wc -l FILE
, you could use:
perl -ne 'END{print $.," ".$ARGV,"\n"}' FILE
The record separator variable
Normally, perl does not slurp files or streams and reads them one line at a time. The default record separator variable is $/
. You can enable a slurp-like behaviour by clearing this variable. The command-line version would looks something like this:
# print the second and third line of a file
perl -e 'undef $/;@F=split /\n/,<>; print $F[1],"\n";print $F[2],"\n"' FILE
Notice in the above example, we did not use the -a
option. That’s because it will enforce reading the file line-by-line which we do not want. In a script, the example would look something like this:
#!/usr/bin/env perl
use strict;
use warnings;
undef $/; # Enable slurp
my @F = split /\n/,<>;
# Print the second and third lines
print $F[1],"\n";
print $F[2],"\n";
The output record separator variable
The default output record separator variable, unlike awk, is undefined. It’s common (if not using new features and the say
function as opposed to print), to set $\ to the newline character ‘\r\n’.
The current filename variable
When using command-line perl, the current filename can be obtained with the $ARGV
variable. For example:
perl -pe 'print "Processing file: ",$ARGV,"\n" if $. == 1'
Note that I didn’t use the special BEGIN
function here because it’s executed even before $ARGV
is set. The equivalent script would look something like the following:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>){
print "Processing file: ".$ARGV."\n" if $. == 1;
print ,"\n";
}
Note that once again, $ARGV
is printed when $. == 1
. This is because $ARGV
is not initialized until STDIN (<>
) is read.
I think this is a good place to leave things. While I haven’t covered every single topic in the awk grymoire, I think that the remaining topics are not that commonly used in day-to-day system administration as they cover things like Trigonometry, strftime
for date formatting (which is incredibly useful but I’ve never seen anyone use awk for this. See perldoc POSIX
for how to use strftime
in perl as it’s more portable than the date
command), arrays (which we’ve covered pretty extensively here since perl relies on them), etc. To continue learning perl scripting for things other than emulating awk, see man perlintro
.