cd ~

Perl Grymoire (awk)

Introduction
Why learn perl instead of awk?
Command-line perl
Basic structure for emulating awk programs
Our first perl script
Arithmetic expressions
Unary arithmetic operators
The autoincrement and autodecrement operators
Assignment operators
Conditional expressions
Regular expressions
Compound conditional expressions
Perl builtin variables
The default split pattern
The output field separator variable
Getting the number of fields
Getting the last field
The current line-variable
The record separator variable
The output record separator variable
The current filename variable

Introduction

This post is meant to be an educational post about perl as a ad-hoc replacement for awk that echos a bit of the awk grymoire. This is more of a “How To” swap perl for awk for common problems and is therefore focused on solving problems using as many awk-like features in perl as I know. I hope someone finds this useful. This post is a result of a coworker complaining about not having access to GNU awk on AIX 7.2 systems for a script he was writing. Much of this post is copied and slightly modified from the awk grymoire. This was created with respect and love for the sed and awk grymoires.

Why learn perl instead of awk?

Actually, I’d suggest you learn both! In fact, if you haven’t learned awk yet, I’d suggest learning it with the awk grymoire before continuing as much of this article will reference awk.

Command-line perl

Perl has several commands and can emulate many UNIX tools such as sed and awk but most people only learn sed and awk. In order to use perl on the command-line, one must understand a handful of options (man perlrun)

-n: Often used with -e or -E. Causes perl to iterate over each line of a file or stream. Effectively running each line through a while loop.
-e: Evaluate one line of a program. Multiple -e options can be specified to combine multiple expressions much like sed.
-E: Same as -e but enables all optional features
-p: Often combined with -n and -e. It effectively does the same thing as the -n switch, making -n redundant and as a result will override -n. Still, many people such as myself write -pne because I never learn… -p has the added effect over -n in that it will implicitly print every line.
-a: Autosplit lines when used with -p or -n and stores results in the @F array. -a implicitly enables -n and the pattern split can be specified with -F
-F: Specify pattern to split on. Implicitly sets both -a and -n
-l: Enables automatic line-ending processing. This option chomps (with no arguments) the new line character off lines and stores that as the input record separator $/ as well as the output record separator $\.

Basic structure for emulating awk programs

The essential organization of an awk program follows the form:

/pattern/{action}

The pattern specifies when the action is performed. The perl version would look more like this:

action if /pattern/

Perl when used with the -n option is line-oriented as each line of a stream is itereated over as if it was run through a while-loop. The if /pattern/ will therefore perform action on every line that matches /pattern/. Two important functions are BEGIN and END. The analogous to BEGIN and END in awk and any code in the BEGIN and END functions is processed at the beginning and end of the program as opposed to every line. For example, se the following perl one-liner:

perl -ne BEGIN { print "START\n" }; print . "\n"; END { print "-DONE-\n" } FILE

This code would print “START”, followed by every line in a file, followed by “-DONE-“. The print . "\n" works because perl automatically stores the content of a line in the special $_ variable or default iterator variable which the print function will print implicitly if not provided any arguments. The { print } can also be excluded from the above example by using the -p option instead of -n:

perl -pe BEGIN { print "START\n" } END { print "-DONE-\n" } FILE

Our first perl script

Perl (version 5) can act very funny in order to support the behaviour of older versions of perl5 (and when I say old, I mean old, like 20+ years old). In order to enforce good behaviour and coding habits, it’s customary to use strict; in modern perl to avoid things like bare-word variables and just weird stuff overall. If the version of perl on your destination system is old enough not to include the strict module, such as versions older than 5.10 (maybe even older than that, not sure when strict was added but I’ve used 5.05 and it wasn’t fun), then you may want to eval "use strict; 1" instead of just use strict. I’m going to assume you aren’t running ancient late 90s Solaris throughout this article though as it’s the year 2020… Additionally, it’s customary to use warnings; in order to get better debugging and error messages from your scripts. Many will also add use v5.10 in order to add support for new features such as say. I’m going to avoid this throughout this article, not because it’s a bad feature, but sometimes you gotta learn things the old way first ;).

So, after that long-winded introduction, our first program, FileOwner.pl:

#!/usr/bin/env perl
use strict;
use warnings;

print "File\tOwner\n";
while (<>){
  chomp;
  my @F = split;
  print "$F[-1]" . "\t" . "$F[2]\n";
}
print "-DONE-\n";

chmod +x that file and run it in a directory that has files in it via: ls -l | FileOwner.pl

It will output file names and owners, separated by the tab character.

The following is an explaination of the above program:

print "File\tOwner"; Obviously prints File and Owner separated by a tab characters
while (<>){...} Iterates over each line of STDIN (<>). In this case, the stream produced by the ls -l command.
chomp; Removes new-line character.
my @F = split; Somewhat cryptic perl here. my @F creates an array called @F while split; is actually calling split with two arguments. The first argument, since it wasn’t specified, is implicitly the default input field separator (default split pattern) which is equal to the regular expression /\s+/ which means split will split a string on every one-or-more spaces. The second argument to split, since it wasn’t specified, is the special $_ variable or the default iterator which is also assumed. $_ stores the content of each line that we’re iterating over in the while {...} loop. Lastly, note the use of the word my. This is required for explicitly instantiating a variable in perl with use strict enabled.
print "$F[-1]" . "\t" . "$F[2]\n"; print the last element in the @F array ($F[-1], the file name) concatenated (.) with a tab character (\t), followed by the second element of the @F array ($F[2], the file owner). Note that the numbers here, are one less than when a similar awk program is run. This is because perl’s array elements start at element 0. Also note that accessing elements in a perl array requires the use of the $ symbol as elements of an array must be accessed using scalar context. For more information see man perlintro.

NOTE: This program is incredibly flawed as it won’t handle file names with spaces properly. One should never pipe the output of ls for parsing. The reason this example is here is because while it encourages potential bad habits, it still illustrates these concepts well and it comes straight out of the awk grymoire

This can also be expressed using command-line perl as:

ls -l | perl -lae 'BEGIN{print "File\tOwner\n"} print $F[-1]."\t"."$F[2]"; END{ print "-DONE-" }'

The command-line perl version of th script uses the special BEGIN and END functions which were not required in the perl program. This is required for the command-line version because -a implicitly runs the STDIN (<>) stream through a while {...} loop. If BEGIN and END were not utilized here, "File\tOwner\n" and "-Done-" would be printed for every line. Additionally, the command-line version has no explicit declaration of the @F variable. This is not required when the -a option is used as it will automatically split each line using the default record separator /\s+/ and store the results in @F. The default input record separator can be modified using the -F option and providing an alternative pattern.

Note: None of the command-line perl examples will have use strict or use warnings enabled. They may be enabled on the command line via the -M option or in the BEGIN function. Typically, one does not use these modules for one-liners as they are meant to be quick and dirty. You may find your one-liners to not throw much help your way in regards to debugging. If you’re having trouble troubleshooting, pass the option -M'warnings'.

Arithmetic expressions

There are several binary operators, similar to awk:

Operator	Type	Meaning
+	Arithmetic	Addition
-	Arithmetic	Subtraction
*	Arithmetic	Multiplication
/	Arithmetic	Division
%	Arithmetic	Modulo

Using variables with the value of “7” and “3”, perl returns the following results for each operator when using the print command:

Expression	Result
7+3	10
7-3	4
7*3	21
7/3	2.33333333333333
7%3	1

If you’ve never worked with a programming language before, the % or modulus operator returns the remainder after performing integer division. print 7/3 will output a floating point number if necessary. When concatenating two numbers using a “.” character, perl will convert them to strings automatically (ex. 7 . 3 would result in the string "73"). Perl, unlike C, only has 3 main variable types: scalars ($var), arrays (@var) and hashes (%var). Scalars can be strings, integers, floating points or references to other types. Arrays and Hashes may contain scalars such as strings, integers, floats or references to arrays and/or hashes. See man perlintro for more information.

Unary arithmetic operators

The “+” and “-“ operators can be used before variables and numbers. If X equals 4, then the statement: my $x=4;print -$x; This will result in the output of “-4”.

The autoincrement and autodecrement operators

Perl also supports the “++” and “--” operators of C. Both increment or decrement variables by one. The operator can only be used with a single variable, and can be before or after the variable:

my $x=4;print $x++," ",++$x;

This would print the numbers 3 and 5. These operators are also assignment operators, and can be used by themselves on a line:

$x++;
--y;

Assignment operators

Variables can be assigned new values with the assignment operators. Knowing “++” and “--”, the other assignment statement is simply:

my $variable = arithmetic expression

Certain operators have precedence over others. Parenthesis can be used to control grouping of operations. The statement:

my $x=1+2*3 . 4;print $x;

Is the same as:

my $x = (1 + (2 * 3)) . "4";print $x;

Both result in “74”. For more information about operator precedence, see man perlop.

Notice spaces can be added for readability. Perl, like awk, has special assignment operators, which combine a calculation with an assignment. Instead of saying:

$x=$x+2;

One can instead say:

$x+=2;

Operator	Meaning
+=	Add result to variable
-=	Subtract result from variable
*=	Multiply variable by result
/=	Divide variable by result
%=	Apply modulo to variable

Conditional expressions

Perl can also handle conditional expressions and has robust logical operators to do so. if, unless, while, etc, evaluate an expression to true or false. A value of 0 (or undef) is evaluated to false while other value is evaluated to true.

Operator	Meaning
==	is numerically equal
!=	is not numerically equal
>	is numerically greater than
>=	is numerically greater than or equal to
<	is numerically less than
<=	is numerlically less than or equal to
eq	is stringwise equal
ne	is not stringwise equal
gt	is stringwise greater than
ge	is stringwise greater than or equal to
lt	is stringwise less than
le	is stringwise less than or equal to

Regular expressions

Two operators are used to compare strings and regular expressions:

Operator	Meaning
=~	left-side matches provided right-side regular expression
!~	left-side does not match provided right-side regular expression

NOTE: smart match ~~ also exists and can do a lot of things but it’s behaviour is pretty unpredictable as it has changed many times since it was first introduced in perl. It’s probably best just to avoid it.

Compound conditional expressions

Multiple conditional expressions can be compounded. One can combine two conditional expressions with the “and” (&&) or “or” (||) operators. One can also just type the english and or or, however; they have much lower precedence than their symbolic equivalents. Additionally, truthfullness can be inversed using the ! character in comparisons.

Perl Builtin Variables

Perl has many builtin variables that are useful for text processing. The builtin variables are all very symbolic but typically have awk-like equivalents if use English; is set. See man perlvar for a list of all pre-defined variables. The following table describes a number of variables useful for creating awk-like programs:

Variable	use English	Meaning
<>	N/A	standard input
$_	$ARG	Default pattern searching space
@_	@ARG	Array containing all arguments passed to a subroutine
$$	$PID	Process id of current executing program
$0	$PROGRAM_NAME	Name of the current executing program
$,	$OFS	Output field separator. Default is undef
$.	$INPUT_LINE_NUMBER or $NR	Current line number of last file handle or stream accessed
$/	$INPUT_RECORD_SEPARATOR or $RS	Input record separator. Defaults to newline character
$\	$OUTPUT_RECORD_SEPARATOR or $ORS	Output record separator. Defaults to undef

NOTE: Unlike awk, there is no $FS or default “input field separator” variable. The input field separator in perl is actually the default split pattern which defaults to the regular expression /\s+/ on the command line when the -a option is specified or in a perl program when split is called without specifying a pattern. The field separator can be specified on the command-line by specifying a pattern for the -F option and in a program as the first argument to the split function.

The default split pattern

Perl can easily parse man system administration files, however; many of these files do not use the /\s+/ pattern as a delimeter. Many of them use colons ”:” instead. As with awk, one can easily set to a ”:” character with the -F option:

perl -F':' -lae 'print $F[0].":no shell!" if $F[-1] eq "/usr/sbin/nologin"' /etc/passwd

The equivalent perl program would be:

#!/usr/bin/env perl
use strict;
use warnings;

while (<>){
  chomp;
  my @F = split /:/;
  print $F[0] . ":no shell!\n" if $F[-1] == "/usr/sbin/nologin";
}

The new script could then be executed using script.pl /etc/passwd.

An advantage of writing a script is that you could use multiple patterns with the split function. This would be somewhat rare but given the following file:

ONE 1 I

TWO 2 II

#START

THREE:3:III

FOUR:4:IV

FIVE:5:V

#STOP

SIX 6 VI

SEVEN 7 VII

You could process it with a perl script such as the following:

#!/usr/bin/env perl
use strict;
use warnings;

my $FS = qr/\s+/;
while (<>){
  chomp;
  if ($_ eq '#START'){
    $FS = qr/:/;
  }elsif($_ eq '#STOP'){
    $FS = qr/\s+/;
  }else{
    # print the Roman number in column 3
    my @F = split /$FS/;
    print $F[2]."\n";
  }
}

In the above script we set a variable $FS that stores a default pattern /\s+/. It’s enclosed in qr/.../ which is used to quote regex. When we reach the line in the file matching ‘#START’ we proceed to change the $FS pattern to /:/ and when the line matches ‘#STOP’ we change the $FS pattern back to /\s+/. When neither ‘#START’ or ‘#STOP’ is matched, we proceed to split the line using the pattern we stored in $FS and print the value of the 3rd column (element 2 of our @F array).

NOTE: the use of eq instead of == for comparing strings. Don’t make the same mistake I did when writing this example lol…

NOTE: Unlike awk, $FS is not a special perl variable. We’ve simply set it to store a pattern for use in our program. One cannot override the default split pattern in perl (except when using the -F option on the command-line). $FS is used here to mimic awk. We could have just as easily stored this pattern in any variable.

The Output Field Separator Variable

Before starting, please note that there is a major difference between the following:

print "one"."two";

print "one","two";

The example line, print "one"."two"; concatenates the words “one” and “two” while the second sends the words “one” and “two” as arguments to print. The output field separator variable will not affect the first example because it’s concatenating two strings as opposed to passing two strings as arguments.

Let’s say you wanted to copy the /etc/shadow file but exclude all the hashed passwords. This could be accomplished on the command-line with the following:

sudo perl -F':' -ae 'BEGIN{$,=":"} $F[1]=""; print @F;' /etc/shadow

An equivalent perl program would be:

#!/usr/bin/env perl
use strict;
use warnings;

$, = ':';
while (<>){
  chomp;
  my @F = split /:/,$_,-1;
  $F[1] = "";
  print @F;print "\n";
}

In the one-liner we excluded the -l option as it isn’t required and will exclude empty fields. In the script version, we specified a -1 argument on the split function in order for it to not drop empty fields. We run two split statements to avoid adding an extra colon character that would occur if we passed the “\n” new-line character as another argument to the print statement that is printing the array.

Getting the number of fields

Getting the number of fields in perl is easy. It’s equal to the length of the @F array + 1. This can be retrieved automatically using the scalar function. For example:

perl -F':' -ae 'print scalar @F,"\n"' /etc/passwd | head -n1

The above example results in 7.

Getting the last field

The last field of a line (last element of an array of a split line) can easily be retrieved by using a -1 array index.

perl -F':' -ae 'print $F[-1],"\n"' /etc/passwd | head -n1

On my system the first line of /etc/passwd is “root:x:0:0:root:/root:/bin/bash” so the above example returns “/bin/bash”.

The current line variable

$. holds the value of the current line. Therefore, to get the number of lines in a file, emulating the behaviour of wc -l FILE, you could use:

perl -ne 'END{print $.," ".$ARGV,"\n"}' FILE

The record separator variable

Normally, perl does not slurp files or streams and reads them one line at a time. The default record separator variable is $/. You can enable a slurp-like behaviour by clearing this variable. The command-line version would looks something like this:

# print the second and third line of a file
perl -e 'undef $/;@F=split /\n/,<>; print $F[1],"\n";print $F[2],"\n"' FILE

Notice in the above example, we did not use the -a option. That’s because it will enforce reading the file line-by-line which we do not want. In a script, the example would look something like this:

#!/usr/bin/env perl
use strict;
use warnings;

undef $/; # Enable slurp
my @F = split /\n/,<>;

# Print the second and third lines
print $F[1],"\n";
print $F[2],"\n";

The output record separator variable

The default output record separator variable, unlike awk, is undefined. It’s common (if not using new features and the say function as opposed to print), to set $\ to the newline character ‘\r\n’.

The current filename variable

When using command-line perl, the current filename can be obtained with the $ARGV variable. For example:

perl -pe 'print "Processing file: ",$ARGV,"\n" if $. == 1'

Note that I didn’t use the special BEGIN function here because it’s executed even before $ARGV is set. The equivalent script would look something like the following:

#!/usr/bin/env perl
use strict;
use warnings;

while (<>){
  print "Processing file: ".$ARGV."\n" if $. == 1;
  print ,"\n";
}

Note that once again, $ARGV is printed when $. == 1. This is because $ARGV is not initialized until STDIN (<>) is read.

I think this is a good place to leave things. While I haven’t covered every single topic in the awk grymoire, I think that the remaining topics are not that commonly used in day-to-day system administration as they cover things like Trigonometry, strftime for date formatting (which is incredibly useful but I’ve never seen anyone use awk for this. See perldoc POSIX for how to use strftime in perl as it’s more portable than the date command), arrays (which we’ve covered pretty extensively here since perl relies on them), etc. To continue learning perl scripting for things other than emulating awk, see man perlintro.