[Appendix B] B.2 Language Summary for awk

B.2 Language Summary for awk

This section summarizes how awk processes input records and describes the various syntactic elements that make up an awk program.

Each line of input is split into fields. By default, the field delimiter is one or more spaces and/or tabs. You can change the field separator by using the -F command-line option. Doing so also sets the value of FS. The following command-line changes the field separator to a colon:

awk -F: -f awkscr /etc/passwd

You can also assign the delimiter to the system variable FS. This is typically done in the BEGIN procedure, but can also be passed as a parameter on the command line.

awk -f awkscr FS=: /etc/passwd

Each input line forms a record containing any number of fields. Each field can be referenced by its position in the record. "$1" refers to the value of the first field; "$2" to the second field, and so on. "$0" refers to the entire record. The following action prints the first field of each input line:

{ print $1 }

The default record separator is a newline. The following procedure sets FS and RS so that awk interprets an input record as any number of lines up to a blank line, with each line being a separate field.

BEGIN { FS = "\n"; RS = "" }

It is important to know that when RS is set to the empty string, newline always separates fields, in addition to whatever value FS may have. This is discussed in more detail in both The AWK Programming Language and Effective AWK Programming.

B.2.2 Format of a Script

An awk script is a set of pattern-matching rules and actions:

pattern { action }

An action is one or more statements that will be performed on those input lines that match the pattern. If no pattern is specified, the action is performed for every input line. The following example uses the print statement to print each line in the input file:

{ print }

If only a pattern is specified, then the default action consists of the print statement, as shown above.

Function definitions can also appear:

function name (parameter list) { statements }

This syntax defines the function name, making available the list of parameters for processing in the body of the function. Variables specified in the parameter-list are treated as local variables within the function. All other variables are global and can be accessed outside the function. When calling a user-defined function, no space is permitted between the name of the function and the opening parenthesis. Spaces are allowed in the function's definition. User-defined functions are described in Chapter 9, Functions.

B.2.2.1 Line termination

A line in an awk script is terminated by a newline or a semicolon. Using semicolons to put multiple statements on a line, while permitted, reduces the readability of most programs. Blank lines are permitted between statements.

Program control statements (do, if, for, or while) continue on the next line, where a dependent statement is listed. If multiple dependent statements are specified, they must be enclosed within braces.

if (NF > 1) {
	name = $1
	total += $2
}

You cannot use a semicolon to avoid using braces for multiple statements.

You can type a single statement over multiple lines by escaping the newline with a backslash (\). You can also break lines following any of the following characters:

, { && ||

Gawk also allows you to continue a line after either a "?" or a ":". Strings cannot be broken across a line (except in gawk, using "\" followed by a newline).

B.2.2.2 Comments

A comment begins with a "#" and ends with a newline. It can appear on a line by itself or at the end of a line. Comments are descriptive remarks that explain the operation of the script. Comments cannot be continued across lines by ending them with a backslash.

B.2.3 Patterns

A pattern can be any of the following:

/regular expression/
relational expression
BEGIN
END
pattern, pattern

Regular expressions use the extended set of metacharacters and must be enclosed in slashes. For a full discussion of regular expressions, see Chapter 3, Understanding Regular Expression Syntax.
Relational expressions use the relational operators listed under "Expressions" later in this chapter.
The BEGIN pattern is applied before the first line of input is read and the END pattern is applied after the last line of input is read.
Use ! to negate the match; i.e., to handle lines not matching the pattern.
You can address a range of lines, just as in sed:
pattern, pattern
Patterns, except BEGIN and END, can be expressed in compound forms using the following operators:
&& Logical And
|| Logical Or
Sun's version of nawk (SunOS 4.1.x) does not support treating regular expressions as parts of a larger Boolean expression. E.g., "/cute/ && /sweet/" or "/fast/ || /quick/" do not work.
In addition the C conditional operator ?: (pattern ? pattern : pattern) may be used in a pattern.
Patterns can be placed in parentheses to ensure proper evaluation.
BEGIN and END patterns must be associated with actions. If multiple BEGIN and END rules are written, they are merged into a single rule before being applied.

B.2.4 Regular Expressions

Table 13.2 summarizes the regular expressions as described in Chapter 3. The metacharacters are listed in order of precedence.

Table B.1: Regular Expression Metacharacters
Special
Characters	Usage
c	Matches any literal character c that is not a metacharacter.
\	Escapes any metacharacter that follows, including itself.
^	Anchors following regular expression to the beginning of string.
$	Anchors preceding regular expression to the end of string.
.	Matches any single character, including newline.
[...]	Matches any one of the class of characters enclosed between the brackets. A circumflex (^) as the first character inside brackets reverses the match to all characters except those listed in the class. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in a class is a member of the class. All other metacharacters lose their meaning when specified as members of a class, except \, which can be used to escape ], even if it is not first.
r1\|r2	Between two regular expressions, r1 and r2, it allows either of the regular expressions to be matched.
(r1)(r2)	Used for concatenating regular expressions.
r*	Matches any number (including zero) of the regular expression that immediately precedes it.
r+	Matches one or more occurrences of the preceding regular expression.
r?	Matches 0 or 1 occurrences of the preceding regular expression.
(r)	Used for grouping regular expressions.

Regular expressions can also make use of the escape sequences for accessing special characters, as defined in the section "Escape sequences" later in this appendix.

Note that ^ and $ work on strings; they do not match against newlines embedded in a record or string.

Within a pair of brackets, POSIX allows special notations for matching non-English characters. They are described in Table 13.3.

Table B.2: POSIX Character List Facilities
Notation	Facility
[.symbol.]	Collating symbols. A collating symbol is a multi-character sequence that should be treated as a unit.
[=equiv=]	Equivalence classes. An equivalence class lists a set of characters that should be considered equivalent, such as "e" and "è".
[:class:]	Character classes. Character class keywords describe different classes of characters such as alphabetic characters, control characters, and so on.
[:alnum:]	Alphanumeric characters
[:alpha:]	Alphabetic characters
[:blank:]	Space and tab characters
[:cntrl:]	Control characters
[:digit:]	Numeric characters
[:graph:]	Printable and visible (non-space) characters
[:lower:]	Lowercase characters
[:print:]	Printable characters
[:punct:]	Punctuation characters
[:space:]	Whitespace characters
[:upper:]	Uppercase characters
[:xdigit:]	Hexadecimal digits

Note that these facilities (as of this writing) are still not widely implemented.

B.2.5 Expressions

An expression can be made up of constants, variables, operators and functions. A constant is a string (any sequence of characters) or a numeric value. A variable is a symbol that references a value. You can think of it as a piece of information that retrieves a particular numeric or string value.

B.2.5.1 Constants

There are two types of constants, string and numeric. A string constant must be quoted while a numeric constant is not.

B.2.5.2 Escape sequences

The escape sequences described in Table 13.4 can be used in strings and regular expressions.

Table B.3: Escape Sequences
Sequence	Description
\a	Alert character, usually ASCII BEL character
\b	Backspace
\f	Formfeed
\n	Newline
\r	Carriage return
\t	Horizontal tab
\v	Vertical tab
\ddd	Character represented as 1 to 3 digit octal value
\xhex	Character represented as hexadecimal value[1]
\c	Any literal character c (e.g., `\"` for `"`)[2]

[1] POSIX does not provide "\x", but it is commonly available.
[2] Like ANSI C, POSIX leaves it purposely undefined what you get when you put a backslash before any character not listed in the table. In most awks, you just get that character.

B.2.5.3 Variables

There are three kinds of variables: user-defined, built-in, and fields. By convention, the names of built-in or system variables consist of all capital letters.

The name of a variable cannot start with a digit. Otherwise, it consists of letters, digits, and underscores. Case is significant in variable names.

A variable does not need to be declared or initialized. A variable can contain either a string or numeric value. An uninitialized variable has the empty string ("") as its string value and 0 as its numeric value. Awk attempts to decide whether a value should be processed as a string or a number depending upon the operation.

The assignment of a variable has the form:

var = expr

It assigns the value of the expression to var. The following expression assigns a value of 1 to the variable x.

x = 1

The name of the variable is used to reference the value:

{ print x }

prints the value of the variable x. In this case, it would be 1.

See the section "System Variables" below for information on built-in variables. A field variable is referenced using $n, where n is any number 0 to NF, that references the field by position. It can be supplied by a variable, such as $NF meaning the last field, or constant, such as $1 meaning the first field.

B.2.5.4 Arrays

An array is a variable that can be used to store a set of values. The following statement assigns a value to an element of an array:

array[index] = value

In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number.

An associative array makes an "association" between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array.

You can use the special for loop to read all the elements of an associative array.

for (item in array)

The index of the array is available as item, while the value of an element of the array can be referenced as array[item].

You can use the operator in to test that an element exists by testing to see if its index exists.

if (index in array)

tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].

You can also delete individual elements of the array using the delete statement.

B.2.5.5 System variables

Awk defines a number of special variables that can be referenced or reset inside a program, as shown in Table 13.5 (defaults are listed in parentheses).

Table B.4: Awk System Variables
Variable	Description
ARGC	Number of arguments on command line
ARGV	An array containing the command-line arguments
CONVFMT	String conversion format for numbers (%.6g). (POSIX)
ENVIRON	An associative array of environment variables
FILENAME	Current filename
FNR	Like `NR`, but relative to the current file
FS	Field separator (a blank)
NF	Number of fields in current record
NR	Number of the current record
OFMT	Output format for numbers (%.6g)
OFS	Output field separator (a blank)
ORS	Output record separator (a newline)
RLENGTH	Length of the string matched by `match()` function
RS	Record separator (a newline)
RSTART	First position in the string matched by `match()` function
SUBSEP	Separator character for array subscripts (\034)

B.2.5.6 Operators

Table 13.6 lists the operators in the order of precedence (low to high) that are available in awk.

Table B.5: Operators
Operators	Description
= += -= = /= %= ^= *=	Assignment
?:	C conditional expression
\|\|	Logical OR
&&	Logical AND
~ !~	Match regular expression and negation
< <= > >= != ==	Relational operators
(blank)	Concatenation
+ -	Addition, subtraction
* / %	Multiplication, division, and modulus
+ - !	Unary plus and minus, and logical negation
^ **	Exponentiation
++ --	Increment and decrement, either prefix or postfix
$	Field reference

NOTE: While "**" and "**=" are common extensions, they are not part of POSIX awk.

B.2.6 Statements and Functions

An action is enclosed in braces and consists of one or more statements and/or expressions. The difference between a statement and a function is that a function returns a value, and its argument list is specified within parentheses. (The formal syntactical difference does not always hold true: printf is considered a statement, but its argument list can be put in parentheses; getline is a function that does not use parentheses.)

Awk has a number of predefined arithmetic and string functions. A function is typically called as follows:

return = function(arg1,arg2)

where return is a variable created to hold what the function returns. (In fact, the return value of a function can be used anywhere in an expression, not just on the right-hand side of an assignment.) Arguments to a function are specified as a comma-separated list. The left parenthesis follows after the name of the function. (With built-in functions, a space is permitted between the function name and the parentheses.)


B.1 Command-Line Syntax		B.3 Command Summary for awk