[Chapter 2] 2.5 Using sed and awk Together

2.5 Using sed and awk Together

In UNIX, pipes can be used to pass the output from one program as input to the next program. Let's look at a few examples that combine sed and awk to produce a report. The sed script that replaced the postal abbreviation of a state with its full name is general enough that it might be used again as a script file named nameState:

$ cat nameState
s/ CA/, California/
s/ MA/, Massachusetts/
s/ OK/, Oklahoma/
s/ PA/, Pennsylvania/
s/ VA/, Virginia/

Of course, you'd want to handle all states, not just five, and if you were running it on documents other than mailing lists, you should make sure that it does not make unwanted replacements.

The output for this program, using the input file list, is the same as we have already seen. In the next example, the output produced by nameState is piped to an awk program that extracts the name of the state from each record.

$ sed -f nameState list | awk -F, '{ print $4 }'
 Massachusetts
 Virginia
 Oklahoma
 Pennsylvania
 Massachusetts
 Virginia
 California
 Massachusetts

The awk program is processing the output produced by the sed script. Remember that the sed script replaces the abbreviation with a comma and the full name of the state. In effect, it splits the third field containing the city and state into two fields. "$4" references the fourth field.

What we are doing here could be done completely in sed, but probably with more difficulty and less generality. Also, since awk allows you to replace the string you match, you could achieve this result entirely with an awk script.

While the result of this program is not very useful, it could be passed to sort | uniq -c, which would sort the states into an alphabetical list with a count of the number of occurrences of each state.

Now we are going to do something more interesting. We want to produce a report that sorts the names by state and lists the name of the state followed by the name of each person residing in that state. The following example shows the byState program.

#! /bin/sh
awk -F, '{ 
	print $4 ", " $0 
	}' $* | 
sort |
awk -F, '
$1 == LastState { 
      print "\t" $2
}
$1 != LastState { 
      LastState = $1
      print $1
}'

This shell script has three parts. The program invokes awk to produce input for the sort program and then invokes awk again to test the sorted input and determine if the name of the state in the current record is the same as in the previous record. Let's see the script in action:

$ sed -f nameState list | byState
 California
	 Amy Wilde
 Massachusetts
	 Eric Adams
	 John Daggett
	 Sal Carpenter
 Oklahoma
	 Orville Thomas
 Pennsylvania
	 Terry Kalkas
 Virginia
	 Alice Ford
	 Hubert Sims

The names are sorted by state. This is a typical example of using awk to generate a report from structured data.

To examine how the byState program works, let's look at each part separately. It's designed to read input from the nameState program and expects "$4" to be the name of the state. Look at the output produced by the first line of the program:

$ sed -f nameState list | awk -F, '{ print $4 ", " $0 }'
 Massachusetts, John Daggett, 341 King Road, Plymouth, Massachusetts
 Virginia, Alice Ford, 22 East Broadway, Richmond, Virginia
 Oklahoma, Orville Thomas, 11345 Oak Bridge Road, Tulsa, Oklahoma
 Pennsylvania, Terry Kalkas, 402 Lans Road, Beaver Falls, Pennsylvania
 Massachusetts, Eric Adams, 20 Post Road, Sudbury, Massachusetts
 Virginia, Hubert Sims, 328A Brook Road, Roanoke, Virginia
 California, Amy Wilde, 334 Bayshore Pkwy, Mountain View, California
 Massachusetts, Sal Carpenter, 73 6th Street, Boston, Massachusetts

The sort program, by default, sorts lines in alphabetical order, looking at characters from left to right. In order to sort records by state, and not names, we insert the state as a sort key at the beginning of the record. Now the sort program can do its work for us. (Notice that using the sort utility saves us from having to write sort routines inside awk.)

The second time awk is invoked we perform a programming task. The script looks at the first field of each record (the state) to determine if it is the same as in the previous record. If it is not the same, the name of the state is printed followed by the person's name. If it is the same, then only the person's name is printed.

$1 == LastState { 
      print "\t" $2
}
$1 != LastState { 
      LastState = $1
      print $1
      print "\t" $2 
}'

There are a few significant things here, including assigning a variable, testing the first field of each input line to see if it contains a variable string, and printing a tab to align the output data. Note that we don't have to assign to a variable before using it (because awk variables are initialized to the empty string). This is a small script, but you'll see the same kind of routine used to compare index entries in a much larger indexing program in Chapter 12, Full-Featured Applications. However, for now, don't worry too much about understanding what each statement is doing. Our point here is to give you an overview of what sed and awk can do.

In this chapter, we have covered the basic operations of sed and awk. We have looked at important command-line options and introduced you to scripting. In the next chapter, we are going to look at regular expressions, something both programs use to match patterns in the input.


2.4 Using awk		3. Understanding Regular Expression Syntax