sed & awk

sed & awkSearch this book
Previous: 7.6 ExpressionsChapter 7
Writing Scripts for awk
Next: 7.8 Relational and Boolean Operators
 

7.7 System Variables

There are a number of system or built-in variables defined by awk. Awk has two types of system variables. The first type defines values whose default can be changed, such as the default field and record separators. The second type defines values that can be used in reports or processing, such as the number of fields found in the current record, the count of the current record, and others. These are automatically updated by awk; for example, the current record number and input file name.

There are a set of default values that affect the recognition of records and fields on input and their display on output. The system variable FS defines the field separator. By default, its value is a single space, which tells awk that any number of spaces and/or tabs separate fields. FS can also be set to any single character, or to a regular expression. Earlier, we changed the field separator to a comma in order to read a list of names and addresses.

The output equivalent of FS is OFS, which is a space by default. We'll see an example of redefining OFS shortly.

Awk defines the variable NF to be the number of fields for the current input record. Changing the value of NF actually has side effects. The interactions that occur when $0, the fields, and NF are changed is a murky area, particularly when NF is decreased.[7] Increasing it creates new (empty) fields, and rebuilds $0, with the fields separated by the value of OFS. In the case where NF is decreased, gawk and mawk rebuild the record, and the fields that were above the new value of NF are set equal to the empty string. The Bell Labs awk does not change $0.

[7] Unfortunately, the POSIX standard isn't as helpful here as it should be.

Awk also defines RS, the record separator, as a newline. RS is a bit unusual; it's the only variable where awk only pays attention to the first character of the value.

The output equivalent to RS is ORS, which is also a newline by default. In the next section, "Working with Multiline Records," we'll show how to change the default record separator. Awk sets the variable NR to the number of the current input record. It can be used to number records in a list. The variable FILENAME contains the name of the current input file. The variable FNR is useful when multiple input files are used as it provides the number of the current record relative to the current input file.

Typically, the field and record separators are defined in the BEGIN procedure because you want these values set before the first input line is read. However, you can redefine these values anywhere in the script. In POSIX awk, assigning a new value to FS has no effect on the current input line; it only affects the next input line.

NOTE: Prior to the June 1996 release of Bell Labs awk, versions of awk for UNIX did not follow the POSIX standard in this regard. In those versions, if you have not yet referenced an individual field, and you set the field separator to a different value, the current input line is split into fields using the new value of FS. Thus, you should test how your awk behaves, and if at all possible, upgrade to a correct version of awk.

Finally, POSIX added a new variable, CONVFMT, which is used to control number-to-string conversions. For example,

str = (5.5 + 3.2) " is a nice value"

Here, the result of the numeric expression 5.5 + 3.2 (which is 8.7) must be converted to a string before it can be used in the string concatenation. CONVFMT controls this conversion. Its default value is "%.6g", which is a printf-style format specification for floating-point numbers. Changing CONVFMT to "%d", for instance, would cause all numbers to be converted to strings as integers. Prior to the POSIX standard, awk used OFMT for this purpose. OFMT does the same job, but controlling the conversion of numeric values when using the print statement. The POSIX committee wanted to separate the tasks of output conversion from simple string conversion. Note that numbers that are integers are always converted to strings as integers, no matter what the values of CONVFMT and OFMT may be.

Now let's look at some examples, beginning with the NR variable. Here's a revised print statement for the script that calculates student averages:

print NR ".", $1, avg

Running the revised script produces the following output:

1. john 87.4
2. andrea 86
3. jasper 85.6

After the last line of input is read, NR contains the number of input records that were read. It can be used in the END action to provide a report summary. Here's a revised version of the phonelist.awk script.

# phonelist.awk -- print name and phone number. 
# input file -- name, company, street, city, state and zip, phone
BEGIN { FS = ", *" }  # comma-delimited fields
{ print $1 ", " $6 } 
END { 	print ""
	print NR, "records processed." }

This program changes the default field separator and uses NR to print the total number of records printed. Note that this program uses a regular expression for the value of FS. This program produces the following output:

John Robinson, 696-0987
Phyllis Chapman, 879-0900

2 records processed.

The output field separator (OFS) is generated when a comma is used to separate the arguments in a print statement. You may have wondered what effect the comma has in the following expression:

print NR ".", $1, avg

By default, the comma causes a space (the default value of OFS) to be output. For instance, you could redefine OFS to be a tab in a BEGIN action. Then the preceding print statement would produce the following output:

1.      john    87.4
2.      andrea  86
3.      jasper  85.6

This is especially useful if the input consists of tab-separated fields and you want to generate the same kind of output. OFS can be redefined to be a sequence of characters, such as a comma followed by a space.

Another commonly used system variable is NF, which is set to the number of fields for the current record. As we'll see in the next section, you can use NF to check that a record has the same number of fields that you expect. You can also use NF to reference the last field of each record. Using the "$" field operator and NF produces that reference. If there are six fields, then "$NF" is the same as "$6." Given a list of names, such as the following:

John Kennedy
Lyndon B. Johnson
Richard Milhouse Nixon
Gerald R. Ford
Jimmy Carter
Ronald Reagan
George Bush
Bill Clinton

you will note that the last name is not the same field number for each record. You could print the last name of each President using "$NF."[8]

[8] This scheme breaks down for Martin Van Buren; fortunately, our list contains only recent U.S. presidents.

These are the basic system variables, the ones most commonly used. There are more of them, as listed in Appendix B, and we'll introduce new system variables as needed in the chapters that follow.

7.7.1 Working with Multiline Records

All of our examples have used input files whose records consisted of a single line. In this section, we show how to read a record where each field consists of a single line.

Earlier, we looked at an example of processing a file of names and addresses. Let's suppose that the same data is stored on file in block format. Instead of having all the information on one line, the person's name is on one line, followed by the company's name on the next line and so on. Here's a sample record:

John Robinson
Koren Inc.
978 Commonwealth Ave.
Boston
MA 01760
696-0987

This record has six fields. A blank line separates each record.

To process this data, we can specify a multiline record by defining the field separator to be a newline, represented as "\n", and set the record separator to the empty string, which stands for a blank line.

BEGIN { FS = "\n"; RS = "" }

We can print the first and last fields using the following script:

# block.awk - print first and last fields 
# $1 = name; $NF = phone number

BEGIN { FS = "\n"; RS = "" }

{ print $1, $NF }

Here's a sample run:

$ awk -f block.awk phones.block
John Robinson 696-0987
Phyllis Chapman 879-0900
Jeffrey Willis 914-636-0000
Alice Gold (707) 724-0000
Bill Gold 1-707-724-0000

The two fields are printed on the same line because the default output separator (OFS) remains a single space. If you want the fields to be output on separate lines, change OFS to a newline. While you're at it, you probably want to preserve the blank line between records, so you must specify the output record separator ORS to be two newlines.

OFS = "\n"; ORS = "\n\n"

7.7.2 Balance the Checkbook

This is a simple application that processes items in your check register. While not necessarily the easiest way to balance the checkbook, it is amazing how quickly you can build something useful with awk.

This program presumes you have entered in a file the following information:

1000
125	Market          125.45
126	Hardware Store   34.95
127	Video Store       7.45
128	Book Store       14.32
129	Gasoline         16.10

The first line contains the beginning balance. Each of the other lines represent information from a single check: the check number, a description of where it was spent, and the amount of the check. The three fields are separated by tabs.

The core task of the script is that it must get the beginning balance and then deduct the amount of each check from that balance. We can provide detail lines for each check to compare against the check register. Finally, we can print the ending balance. Here it is:

# checkbook.awk
BEGIN { FS = "\t" }

#1 Expect the first record to have the starting balance.
NR == 1 { print "Beginning Balance: \t" $1
	balance = $1
	next		# get next record and start over
}

#2 Apply to each check record, subtracting amount from balance.
{	print $1, $2, $3
	print balance -= $3
}

Let's run this program and look at the results:

$ awk -f checkbook.awk checkbook.test
Beginning Balance:      1000
125 Market 125.45
874.55
126 Hardware Store 34.95
839.6
127 Video Store 7.45
832.15
128 Book Store 14.32
817.83
129 Gasoline 16.10
801.73

The report is difficult to read, but later we will learn to fix the format using the printf statement. What's important is to confirm that the script is doing what we want. Notice, also, that getting this far takes only a few minutes in awk. In a programming language such as C, it would take you much longer to write this program; for one thing, you might have many more lines of code; and you'd be programming at a much lower level. There are any number of refinements that you'd want to make to this program to improve it, and refining a program takes much longer. The point is that with awk you are able to isolate and implement the basic functionality quite easily.


Previous: 7.6 Expressionssed & awkNext: 7.8 Relational and Boolean Operators
7.6 ExpressionsBook Index7.8 Relational and Boolean Operators

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System