This section summarizes how awk processes input records and describes the various syntactic elements that make up an awk program.
Each line of input is split into fields. By default, the field delimiter is one or more spaces and/or tabs. You can change the field separator by using the -F command-line option. Doing so also sets the value of FS. The following command-line changes the field separator to a colon:
awk -F: -f awkscr /etc/passwd
You can also assign the delimiter to the system variable FS. This is typically done in the BEGIN procedure, but can also be passed as a parameter on the command line.
awk -f awkscr FS=: /etc/passwd
Each input line forms a record containing any number of fields. Each field can be referenced by its position in the record. "$1" refers to the value of the first field; "$2" to the second field, and so on. "$0" refers to the entire record. The following action prints the first field of each input line:
{ print $1 }
The default record separator is a newline. The following procedure sets FS and RS so that awk interprets an input record as any number of lines up to a blank line, with each line being a separate field.
BEGIN { FS = "\n"; RS = "" }
It is important to know that when RS is set to the empty string, newline always separates fields, in addition to whatever value FS may have. This is discussed in more detail in both The AWK Programming Language and Effective AWK Programming.
An awk script is a set of pattern-matching rules and actions:
pattern { action }
An action is one or more statements that will be performed on those input lines that match the pattern. If no pattern is specified, the action is performed for every input line. The following example uses the print statement to print each line in the input file:
{ print }
If only a pattern is specified, then the default action consists of the print statement, as shown above.
Function definitions can also appear:
function name (parameter list) { statements }
This syntax defines the function name, making available the list of parameters for processing in the body of the function. Variables specified in the parameter-list are treated as local variables within the function. All other variables are global and can be accessed outside the function. When calling a user-defined function, no space is permitted between the name of the function and the opening parenthesis. Spaces are allowed in the function's definition. User-defined functions are described in Chapter 9, Functions.
A line in an awk script is terminated by a newline or a semicolon. Using semicolons to put multiple statements on a line, while permitted, reduces the readability of most programs. Blank lines are permitted between statements.
Program control statements (do, if, for, or while) continue on the next line, where a dependent statement is listed. If multiple dependent statements are specified, they must be enclosed within braces.
if (NF > 1) { name = $1 total += $2 }
You cannot use a semicolon to avoid using braces for multiple statements.
You can type a single statement over multiple lines by escaping the newline with a backslash (\). You can also break lines following any of the following characters:
, { && ||
Gawk also allows you to continue a line after either a "?" or a ":". Strings cannot be broken across a line (except in gawk, using "\" followed by a newline).
A comment begins with a "#" and ends with a newline. It can appear on a line by itself or at the end of a line. Comments are descriptive remarks that explain the operation of the script. Comments cannot be continued across lines by ending them with a backslash.
A pattern can be any of the following:
/regular expression/
relational expression
BEGIN
END
pattern, pattern
Regular expressions use the extended set of metacharacters and must be enclosed in slashes. For a full discussion of regular expressions, see Chapter 3, Understanding Regular Expression Syntax.
Relational expressions use the relational operators listed under "Expressions" later in this chapter.
The BEGIN pattern is applied before the first line of input is read and the END pattern is applied after the last line of input is read.
Use ! to negate the match; i.e., to handle lines not matching the pattern.
You can address a range of lines, just as in sed:
pattern, pattern
Patterns, except BEGIN and END, can be expressed in compound forms using the following operators:
&& | Logical And |
|| | Logical Or |
Sun's version of nawk (SunOS 4.1.x) does not support treating regular expressions as parts of a larger Boolean expression. E.g., "/cute/ && /sweet/" or "/fast/ || /quick/" do not work.
In addition the C conditional operator ?: (pattern ? pattern : pattern) may be used in a pattern.
Patterns can be placed in parentheses to ensure proper evaluation.
BEGIN and END patterns must be associated with actions. If multiple BEGIN and END rules are written, they are merged into a single rule before being applied.
Table 13.2 summarizes the regular expressions as described in Chapter 3. The metacharacters are listed in order of precedence.
Special | |
---|---|
Characters | Usage |
c | Matches any literal character c that is not a metacharacter. |
\ | Escapes any metacharacter that follows, including itself. |
^ | Anchors following regular expression to the beginning of string. |
$ | Anchors preceding regular expression to the end of string. |
. | Matches any single character, including newline. |
[...] | Matches any one of the class of characters enclosed between the brackets. A circumflex (^) as the first character inside brackets reverses the match to all characters except those listed in the class. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in a class is a member of the class. All other metacharacters lose their meaning when specified as members of a class, except \, which can be used to escape ], even if it is not first. |
r1|r2 | Between two regular expressions, r1 and r2, it allows either of the regular expressions to be matched. |
(r1)(r2) | Used for concatenating regular expressions. |
r* | Matches any number (including zero) of the regular expression that immediately precedes it. |
r+ | Matches one or more occurrences of the preceding regular expression. |
r? | Matches 0 or 1 occurrences of the preceding regular expression. |
(r) | Used for grouping regular expressions. |
Regular expressions can also make use of the escape sequences for accessing special characters, as defined in the section "Escape sequences" later in this appendix.
Note that ^ and $ work on strings; they do not match against newlines embedded in a record or string.
Within a pair of brackets, POSIX allows special notations for matching non-English characters. They are described in Table 13.3.
Notation | Facility |
---|---|
[.symbol.] | Collating symbols. A collating symbol is a multi-character sequence that should be treated as a unit. |
[=equiv=] | Equivalence classes. An equivalence class lists a set of characters that should be considered equivalent, such as "e" and "è". |
[:class:] | Character classes. Character class keywords describe different classes of characters such as alphabetic characters, control characters, and so on. |
[:alnum:] | Alphanumeric characters |
[:alpha:] | Alphabetic characters |
[:blank:] | Space and tab characters |
[:cntrl:] | Control characters |
[:digit:] | Numeric characters |
[:graph:] | Printable and visible (non-space) characters |
[:lower:] | Lowercase characters |
[:print:] | Printable characters |
[:punct:] | Punctuation characters |
[:space:] | Whitespace characters |
[:upper:] | Uppercase characters |
[:xdigit:] | Hexadecimal digits |
Note that these facilities (as of this writing) are still not widely implemented.
An expression can be made up of constants, variables, operators and functions. A constant is a string (any sequence of characters) or a numeric value. A variable is a symbol that references a value. You can think of it as a piece of information that retrieves a particular numeric or string value.
There are two types of constants, string and numeric. A string constant must be quoted while a numeric constant is not.
The escape sequences described in Table 13.4 can be used in strings and regular expressions.
Sequence | Description |
---|---|
\a | Alert character, usually ASCII BEL character |
\b | Backspace |
\f | Formfeed |
\n | Newline |
\r | Carriage return |
\t | Horizontal tab |
\v | Vertical tab |
\ddd | Character represented as 1 to 3 digit octal value |
\xhex | Character represented as hexadecimal value[1] |
\c | Any literal character c (e.g., \" for ")[2] |
[1] POSIX does not provide "\x", but it is commonly available.
[2] Like ANSI C, POSIX leaves it purposely undefined what you get when you put a backslash before any character not listed in the table. In most awks, you just get that character.
There are three kinds of variables: user-defined, built-in, and fields. By convention, the names of built-in or system variables consist of all capital letters.
The name of a variable cannot start with a digit. Otherwise, it consists of letters, digits, and underscores. Case is significant in variable names.
A variable does not need to be declared or initialized. A variable can contain either a string or numeric value. An uninitialized variable has the empty string ("") as its string value and 0 as its numeric value. Awk attempts to decide whether a value should be processed as a string or a number depending upon the operation.
The assignment of a variable has the form:
var = expr
It assigns the value of the expression to var. The following expression assigns a value of 1 to the variable x.
x = 1
The name of the variable is used to reference the value:
{ print x }
prints the value of the variable x. In this case, it would be 1.
See the section "System Variables" below for information on built-in variables. A field variable is referenced using $n, where n is any number 0 to NF, that references the field by position. It can be supplied by a variable, such as $NF meaning the last field, or constant, such as $1 meaning the first field.
An array is a variable that can be used to store a set of values. The following statement assigns a value to an element of an array:
array[index] = value
In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number.
An associative array makes an "association" between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array.
You can use the special for loop to read all the elements of an associative array.
for (item in array)
The index of the array is available as item, while the value of an element of the array can be referenced as array[item].
You can use the operator in to test that an element exists by testing to see if its index exists.
if (index in array)
tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].
You can also delete individual elements of the array using the delete statement.
Awk defines a number of special variables that can be referenced or reset inside a program, as shown in Table 13.5 (defaults are listed in parentheses).
Variable | Description |
---|---|
ARGC | Number of arguments on command line |
ARGV | An array containing the command-line arguments |
CONVFMT | String conversion format for numbers (%.6g). (POSIX) |
ENVIRON | An associative array of environment variables |
FILENAME | Current filename |
FNR | Like NR, but relative to the current file |
FS | Field separator (a blank) |
NF | Number of fields in current record |
NR | Number of the current record |
OFMT | Output format for numbers (%.6g) |
OFS | Output field separator (a blank) |
ORS | Output record separator (a newline) |
RLENGTH | Length of the string matched by match() function |
RS | Record separator (a newline) |
RSTART | First position in the string matched by match() function |
SUBSEP | Separator character for array subscripts (\034) |
Table 13.6 lists the operators in the order of precedence (low to high) that are available in awk.
Operators | Description |
---|---|
= += -= *= /= %= ^= **= | Assignment |
?: | C conditional expression |
|| | Logical OR |
&& | Logical AND |
~ !~ | Match regular expression and negation |
< <= > >= != == | Relational operators |
(blank) | Concatenation |
+ - | Addition, subtraction |
* / % | Multiplication, division, and modulus |
+ - ! | Unary plus and minus, and logical negation |
^ ** | Exponentiation |
++ -- | Increment and decrement, either prefix or postfix |
$ | Field reference |
NOTE: While "**" and "**=" are common extensions, they are not part of POSIX awk.
An action is enclosed in braces and consists of one or more statements and/or expressions. The difference between a statement and a function is that a function returns a value, and its argument list is specified within parentheses. (The formal syntactical difference does not always hold true: printf is considered a statement, but its argument list can be put in parentheses; getline is a function that does not use parentheses.)
Awk has a number of predefined arithmetic and string functions. A function is typically called as follows:
return = function(arg1,arg2)
where return is a variable created to hold what the function returns. (In fact, the return value of a function can be used anywhere in an expression, not just on the right-hand side of an assignment.) Arguments to a function are specified as a comma-separated list. The left parenthesis follows after the name of the function. (With built-in functions, a space is permitted between the function name and the parentheses.)