sed & awk

sed & awkSearch this book
Previous: 4.3 Testing and Saving OutputChapter 4
Writing sed Scripts
Next: 4.5 Getting to the PromiSed Land
 

4.4 Four Types of sed Scripts

In this section, we are going to look at four types of scripts, each one illustrating a typical sed application.

4.4.1 Multiple Edits to the Same File

The first type of sed script demonstrates making a series of edits in a file. The example we use is a script that converts a file created by a word processing program into a file coded for troff.

One of the authors once did a writing project for a computer company, here referred to as BigOne Computer. The document had to include a product bulletin for "Horsefeathers Software." The company promised that the product bulletin was online and that they would send it. Unfortunately, when the file arrived, it contained the formatted output for a line printer, the only way they could provide it. A portion of that file (saved for testing in a file named horsefeathers) follows.

                  HORSEFEATHERS SOFTWARE PRODUCT BULLETIN

  DESCRIPTION
+   ___________

  BigOne Computer  offers three  software packages from the  suite  
  of Horsefeathers  software products  --  Horsefeathers  Business
  BASIC, BASIC  Librarian,  and LIDO.  These software products can
  fill  your    requirements    for    powerful,    sophisticated,
  general-purpose business  software providing you with a base for
  software customization or development.

  Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne  
  machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
  is a full screen program editor, which also provides the ability

Note that the text has been justified with spaces added between words. There are also spaces added to create a left margin.

We find that when we begin to tackle a problem using sed, we do best if we make a mental list of all the things we want to do. When we begin coding, we write a script containing a single command that does one thing. We test that it works, then we add another command, repeating this cycle until we've done all that's obvious to do. ("All that's obvious" because the list is not always complete, and the cycle of implement-and-test often adds other items to the list.)

It may seem to be a rather tedious process to work this way and indeed there are a number of scripts where it's fine to take a crack at writing the whole script in one pass and then begin testing it. However, the one-step-at-a-time technique is highly recommended for beginners because you isolate each command and get to easily see what is working and what is not. When you try to do several commands at once, you might find that when problems arise you end up recreating the recommended process in reverse; that is, removing commands one by one until you locate the problem.

Here is a list of the obvious edits that need to be made to the Horsefeathers Software bulletin:

  1. Replace all blank lines with a paragraph macro (.LP).

  2. Remove all leading spaces from each line.

  3. Remove the printer underscore line, the one that begins with a "+".

  4. Remove multiple blank spaces that were added between words.

The first edit requires that we match blank lines. However, in looking at the input file, it wasn't obvious whether the blank lines had leading spaces or not. As it turns out, they do not, so blank lines can be matched using the pattern "^$". (If there were spaces on the line, the pattern could be written "^ *$".) Thus, the first edit is fairly straightforward to accomplish:

s/^$/.LP/

It replaces each blank line with ".LP". Note that you do not escape the literal period in the replacement section of the substitute command. We can put this command in a file named sedscr and test the command as follows:

$ sed -f sedscr horsefeathers
                  HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
  DESCRIPTION
+   ___________
.LP
  BigOne Computer  offers three  software packages from the  suite
  of Horsefeathers  software products  --  Horsefeathers  Business
  BASIC, BASIC  Librarian,  and LIDO.  These software products can
  fill  your    requirements    for    powerful,    sophisticated,
  general-purpose business  software providing you with a base for
  software customization or development.
.LP
  Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne
  machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
  is a full screen program editor, which also provides the ability

It is pretty obvious which lines have changed. (It is frequently helpful to cut out a portion of a file to use for testing. It works best if the portion is small enough to fit on the screen yet is large enough to include different examples of what you want to change. After all edits have been applied successfully to the test file, a second level of testing occurs when you apply them to the complete, original file.)

The next edit that we make is to remove the line that begins with a "+" and contains a line-printer underscore. We can simply delete this line using the delete command, d. In writing a pattern to match this line, we have a number of choices. Each of the following would match that line:

/^+/
/^+ /
/^+  */
/^+  *__*/

As you can see, each successive regular expression matches a greater number of characters. Only through testing can you determine how complex the expression needs to be to match a specific line and not others. The longer the pattern that you define in a regular expression, the more comfort you have in knowing that it won't produce unwanted matches. For this script, we'll choose the third expression:

/^+  */d

This command will delete any line that begins with a plus sign and is followed by at least one space. The pattern specifies two spaces, but the second is modified by "*", which means that the second space might or might not be there.

This command was added to the sed script and tested but since it only affects one line, we'll omit showing the results and move on. The next edit needs to remove the spaces that pad the beginning of a line. The pattern for matching that sequence is very similar to the address for the previous command.

s/^  *//

This command removes any sequence of spaces found at the beginning of a line. The replacement portion of the substitute command is empty, meaning that the matched string is removed.

We can add this command to the script and test it.

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer  offers three  software packages from the  suite
of Horsefeathers  software products  --  Horsefeathers  Business
BASIC, BASIC  Librarian,  and LIDO.  These software products can
fill  your    requirements    for    powerful,    sophisticated,
general-purpose business  software providing you with a base for
software customization or development.
.LP
Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne
machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
is a full screen program editor, which also provides the ability

The next edit attempts to deal with the extra spaces added to justify each line. We can write a substitute command to match any string of consecutive spaces and replace it with a single space.

s/  */ /g

We add the global flag at the end of the command so that all occurrences, not just the first, are replaced. Note that, like previous regular expressions, we are not specifying how many spaces are there, just that one or more be found. There might be two, three, or four consecutive spaces. No matter how many, we want to reduce them to one.[3]

[3] This command will also match just a single space. But since the replacement is also a single space, such a case is effectively a "no-op."

Let's test the new script:

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer offers three software packages from the suite
of Horsefeathers software products -- Horsefeathers Business
BASIC, BASIC Librarian, and LIDO. These software products can
fill your requirements for powerful, sophisticated,
general-purpose business software providing you with a base for
software customization or development.
.LP
Horsefeathers BASIC is BASIC optimized for use on the BigOne
machine with UNIX or MS-DOS operating systems. BASIC Librarian
is a full screen program editor, which also provides the ability

It works as advertised, reducing two or more spaces to one. On closer inspection, though, you might notice that the script removes a sequence of two spaces following a period, a place where they might belong.

We could perfect our substitute command such that it does not make the replacement for spaces following a period. The problem is that there are cases when three spaces follow a period and we'd like to reduce that to two. The best way seems to be to write a separate command that deals with the special case of a period followed by spaces.

s/\.  */.  /g

This command replaces a period followed by any number of spaces with a period followed by two spaces. It should be noted that the previous command reduces multiple spaces to one, so that only one space will be found following a period.[4] Nonetheless, this pattern works regardless of how many spaces follow the period, as long as there is at least one. (It would not, for instance, affect a filename of the form test.ext if it appeared in the document.) This command is placed at the end of the script and tested:

[4] The command could therefore be simplified to:

s/\. /.  /g

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer offers three software packages from the suite 
of Horsefeathers software products -- Horsefeathers Business 
BASIC, BASIC Librarian, and LIDO.  These software products can
fill your requirements for powerful, sophisticated,
general-purpose business software providing you with a base for
software customization or development.
.LP
Horsefeathers BASIC is BASIC optimized for use on the BigOne 
machine with UNIX or MS-DOS operating systems.  BASIC Librarian
is a full screen program editor, which also provides the ability

It works. Here's the completed script:

s/^$/.LP/
/^+  */d
s/^  *//
s/  */ /g
s/\.  */.  /g

As we said earlier, the next stage would be to test the script on the complete file (hf.product.bulletin), using testsed, and examine the results thoroughly. When we are satisfied with the results, we can use runsed to make the changes permanent:

$ runsed hf.product.bulletin
done

By executing runsed, we have overwritten the original file.

Before leaving this script, it is instructive to point out that although the script was written to process a specific file, each of the commands in the script is one that you might expect to use again, even if you don't use the entire script again. In other words, you may well write other scripts that delete blank lines or check for two spaces following a period. Recognizing how commands can be reused in other situations reduces the time it takes to develop and test new scripts. It's like a singer learning a song and adding it to his or her repetoire.

4.4.2 Making Changes Across a Set of Files

The most common use of sed is in making a set of search-and-replacement edits across a set of files. Many times these scripts aren't very unusual or interesting, just a list of substitute commands that change one word or phrase to another. Of course, such scripts don't need to be interesting as long as they are useful and save doing the work manually.

The example we look at in this section is a conversion script, designed to modify various "machine-specific" terms in a UNIX documentation set. One person went through the documentation set and made a list of things that needed to be changed. Another person worked from the list to create the following list of substitutions.

s/ON switch/START switch/g
s/ON button/START switch/g
s/STANDBY switch/STOP switch/g
s/STANDBY button/STOP switch/g
s/STANDBY/STOP/g
s/[cC]abinet [Ll]ight/control panel light/g
s/core system diskettes/core system tape/g
s/TERM=542[05] /TERM=PT200 /g
s/Teletype 542[05]/BigOne PT200/g
s/542[05] terminal/PT200 terminal/g
s/Documentation Road Map/Documentation Directory/g
s/Owner\/Operator Guide/Installation and Operation Guide/g
s/AT&T 3B20 [cC]omputer/BigOne XL Computer/g
s/AT&T 3B2 [cC]omputer/BigOne XL Computer/g
s/3B2 [cC]omputer/BigOne XL Computer/g
s/3B2/BigOne XL Computer/g

The script is straightforward. The beauty is not in the script itself but in sed's ability to apply this script to the hundreds of files comprising the documentation set. Once this script is tested, it can be executed using runsed to process as many files as there are at once.

Such a script can be a tremendous time-saver, but it can also be an opportunity to make big-time mistakes. What sometimes happens is that a person writes the script, tests it on one or two out of the hundreds of files and concludes from that test that the script works fine. While it may not be practical to test each file, it is important that the test files you do choose be both representative and exceptional. Remember that text is extremely variable and you cannot typically trust that what is true for a particular occurrence is true for all occurrences.

Using grep to examine large amounts of input can be very helpful. For instance, if you wanted to determine how "core system diskettes" appears in the documents, you could grep for it everywhere and pore over the listing. To be thorough, you should also grep for "core," "core system," "system diskettes," and "diskettes" to look for occurrences split over multiple lines. (You could also use the phrase script in Chapter 6 to look for occurrences of multiple words over consecutive lines.) Examining the input is the best way to know what your script must do.

In some ways, writing a script is like devising a hypothesis, given a certain set of facts. You try to prove the validity of the hypothesis by increasing the amount of data that you test it against. If you are going to be running a script on multiple files, use testsed to run the script on several dozen files after you've tested it on a smaller sample. Then compare the temporary files to the originals to see if your assumptions were correct. The script might be off slightly and you can revise it. The more time you spend testing, which is actually rather interesting work, the less chance you will spend your time unraveling problems caused by a botched script.

4.4.3 Extracting Contents of a File

One type of sed application is used for extracting relevant material from a file. In this way, sed functions like grep, with the additional advantage that the input can be modified prior to output. This type of script is a good candidate for a shell script.

Here are two examples: extracting a macro definition from a macro package and displaying the outline of a document.

4.4.3.1 Extracting a macro definition

troff macros are defined in a macro package, often a single file that's located in a directory such as /usr/lib/macros. A troff macro definition always begins with the string ".de", followed by an optional space and the one- or two-letter name of the macro. The definition ends with a line beginning with two dots (..). The script we show in this section extracts a particular macro definition from a macro package. (It saves you from having to locate and open the file with an editor and search for the lines that you want to examine.)

The first step in designing this script is to write one that extracts a specific macro, in this case, the BL (Bulleted List) macro in the -mm package.[5]

[5] We happen to know that the -mm macros don't have a space after the ".de" command.

$ sed -n  "/^\.deBL/,/^\.\.$/p" /usr/lib/macros/mmt
.deBL
.if\\n(.$<1 .)L \\n(Pin 0 1n 0 \\*(BU
.if\\n(.$=1 .LB 0\\$1 0 1 0 \\*(BU
.if\\n(.$>1 \{.ie !\w^G\\$1^G .)L \\n(Pin 0 1n 0 \\*(BU 0 1
.el.LB 0\\$1 0 1 0 \\*(BU 0 1 \}
..

Sed is invoked with the -n option to keep it from printing out the entire file. With this option, sed will print only the lines it is explicitly told to print via the print command. The sed script contains two addresses: the first matches the start of the macro definition ".deBL" and the second matches its termination, ".." on a line by itself. Note that dots appear literally in the two patterns and are escaped using the backslash.

The two addresses specify a range of lines for the print command, p. It is this capability that distinguishes this kind of search script from grep, which cannot match a range of lines.

We can take this command line and make it more general by placing it in a shell script. One obvious advantage of creating a shell script is that it saves typing. Another advantage is that a shell script can be designed for more general usage. For instance, we can allow the user to supply information from the command line. In this case, rather than hard-code the name of the macro in the sed script, we can use a command-line argument to supply it. You can refer to each argument on the command line in a shell script by positional notation: the first argument is $1, the second is $2, and so on. Here's the getmac script:

#! /bin/sh
# getmac -- print mm macro definition for $1 
sed -n "/^\.de$1/,/^\.\.$/p" /usr/lib/macros/mmt

The first line of the shell script forces interpretation of the script by the Bourne shell, using the "#!" executable interpreter mechanism available on all modern UNIX systems. The second line is a comment that describes the name and purpose of the script. The sed command, on the third line, is identical to the previous example, except that "BL" is replaced by "$1", a variable representing the first command-line argument. Note that the double quotes surrounding the sed script are necessary. Single quotes would not allow interpretation of "$1" by the shell.

This script, getmac, can be executed as follows:

$ getmac BL

where "BL" is the first command-line argument. It produces the same output as the previous example.

This script can be adapted to work with any of several macro packages. The following version of getmac allows the user to specify the name of a macro package as the second command-line argument.

#! /bin/sh
# getmac - read macro definition for $1 from package $2
file=/usr/lib/macros/mmt
mac="$1"
case $2 in
 -ms) file="/work/macros/current/tmac.s";;
 -mm) file="/usr/lib/macros/mmt";;
 -man) file="/usr/lib/macros/an";;
esac
sed -n "/^\.de *$mac/,/^\.\.$/p" $file

What is new here is a case statement that tests the value of $2 and then assigns a value to the variable file. Notice that we assign a default value to file so if the user does not designate a macro package, the -mm macro package is searched. Also, for clarity and readability, the value of $1 is assigned to the variable mac.

In creating this script, we discovered a difference among macro packages in the first line of the macro definition. The -ms macros include a space between ".de" and the name of the macro, while -mm and -man do not. Fortunately, we are able to modify the pattern to accommodate both cases.

/^\.de *$mac/

Following ".de", we specify a space followed by an asterisk, which means the space is optional.

The script prints the result on standard output, but it can easily be redirected into a file, where it can become the basis for the redefinition of a macro.

4.4.3.2 Generating an outline

Our next example not only extracts information; it modifies it to make it easier to read. We create a shell script named do.outline that uses sed to give an outline view of a document. It processes lines containing coded section headings, such as the following:

.Ah "Shell Programming"

The macro package we use has a chapter heading macro named "Se" and hierarchical headings named "Ah", "Bh", and "Ch". In the -mm macro package, these macros might be "H", "H1", "H2", "H3", etc. You can adapt the script to whatever macros or tags identify the structure of a document. The purpose of the do.outline script is to make the structure more apparent by printing the headings in an indented outline format.

The result of do.outline is shown below:

$ do.outline ch13/sect1
CHAPTER  13 Let the Computer Do the Dirty Work
     A.  Shell Programming 
          B.  Stored Commands
          B.  Passing Arguments to Shell Scripts
          B.  Conditional Execution
          B.  Discarding Used Arguments
          B.  Repetitive Execution
          B.  Setting Default Values
          B.  What We've Accomplished

It prints the result to standard output (without, of course, making any changes within the files themselves).

Let's look at how to put together this script. The script needs to match lines that begin with the macros for:

  • Chapter title (.Se)

  • Section heading (.Ah)

  • Subsection heading (.Bh)

We need to make substitutions on those lines, replacing macros with a text marker (A, B, for instance) and adding the appropriate amount of spacing (using tabs) to indent each heading. (Remember, the "·" denotes a tab character.)

Here's the basic script:

sed -n '
s/^\.Se /CHAPTER /p
s/^\.Ah /·A. /p
s/^\.Bh /··B.  /p' $*

do.outline operates on all files specified on the command line ("$*"). The -n option suppresses the default output of the program. The sed script contains three substitute commands that replace the codes with the letters and indent each line. Each substitute command is modified by the p flag that indicates the line should be printed.

When we test this script, the following results are produced:

CHAPTER  "13" "Let the Computer Do the Dirty Work"
     A.  "Shell Programming" 
          B.  "Stored Commands"
          B.  "Passing Arguments to Shell Scripts"

The quotation marks that surround the arguments to a macro are passed through. We can write a substitute command to remove the quotation marks.

s/"//g

It is necessary to specify the global flag, g, to catch all occurrences on a single line. However, the key decision is where to put this command in the script. If we put it at the end of the script, it will remove the quotation marks after the line has already been output. We have to put it at the top of the script and perform this edit for all lines, regardless of whether or not they are output later in the script.

sed -n '
s/"//g
s/^\.Se /CHAPTER /p
s/^\.Ah /·A. /p
s/^\.Bh /··B.  /p' $*

This script now produces the results that were shown earlier.

You can modify this script to search for almost any kind of coded format. For instance, here's a rough version for a LaTeX file:

sed -n '
s/[{}]//g
s/\\section/·A. /p
s/\\subsection/··B.  /p' $*

4.4.4 Edits To Go

Let's consider an application that shows sed in its role as a true stream editor, making edits in a pipeline - edits that are never written back into a file.

On a typewriter-like device (including a CRT), an em-dash is typed as a pair of hyphens (--). In typesetting, it is printed as a single, long dash ( - ). troff provides a special character name for the em-dash, but it is inconvenient to type "\(em".

The following command changes two consecutive dashes into an em-dash.

s/--/\\(em/g

We double the backslashes in the replacement string for \(em, since the backslash has a special meaning to sed.

Perhaps there are cases in which we don't want this substitute command to be applied. What if someone is using hyphens to draw a horizontal line? We can refine this command to exclude lines containing three or more consecutive hyphens. To do this, we use the ! address modifier:

/---/!s/--/\\(em/g

It may take a moment to penetrate this syntax. What's different is that we use a pattern address to restrict the lines that are affected by the substitute command, and we use ! to reverse the sense of the pattern match. It says, simply, "If you find a line containing three consecutive hyphens, don't apply the edit." On all other lines, the substitute command will be applied.

We can use this command in a script that automatically inserts em-dashes for us. To do that, we will use sed as a preprocessor for a troff file. The file will be processed by sed and then piped to troff.

sed '/---/!s/--/\\(em/g' file | troff

In other words, sed changes the input file and passes the output directly to troff, without creating an intermediate file. The edits are made on-the-go, and do not affect the input file. You might wonder why not just make the changes permanently in the original file? One reason is simply that it's not necessary - the input remains consistent with what the user typed but troff still produces what looks best for typeset-quality output. Furthermore, because it is embedded in a larger shell script, the transformation of hyphens to em-dashes is invisible to the user, and not an additional step in the formatting process.

We use a shell script named format that uses sed for this purpose. Here's what the shell script looks like:

#! /bin/sh
eqn=  pic=  col=
files=  options=  roff="ditroff -Tps"
sed="| sed '/---/!s/--/\\(em/g'"
while [ $# -gt 0 ]
do
   case $1 in
     -E) eqn="| eqn";;
     -P) pic="| pic";;
     -N) roff="nroff"  col="| col"  sed= ;;
     -*) options="$options $1";;
      *) if [ -f $1 ]
         then files="$files $1"
         else echo "format: $1: file not found"; exit 1
         fi;;
   esac
   shift
done
eval "cat $files $sed | tbl $eqn $pic | $roff $options $col | lp"

This script assigns and evaluates a number of variables (prefixed by a dollar sign) that construct the command line that is submitted to format and print a document. (Notice that we've set up the -N option for nroff so that it sets the sed variable to the empty string, since we only want to make this change if we are using troff. Even though nroff understands the \(em special character, making this change would have no actual effect on the output.)

Changing hyphens to em-dashes is not the only "prettying up" edit we might want to make when typesetting a document. For example, most keyboards do not allow you to type open and close quotation marks (" and " as opposed to "and"). In troff, you can indicate a open quotation mark by typing two consecutive grave accents, or "backquotes" (``), and a close quotation mark by typing two consecutive single quotes (''). We can use sed to change each doublequote character to a pair of single open-quotes or close-quotes (depending on context), which, when typeset, will produce the appearance of a proper "double quote."

This is a considerably more difficult edit to make, since there are many separate cases involving punctuation marks, space, and tabs. Our script might look like this:

s/^"/``/
s/"$/''/
s/"? /''? /g
s/"?$/''?/g
s/ "/ ``/g
s/" /'' /g
s/·"/·``/g
s/"·/''·/g
s/")/'')/g
s/"]/'']/g
s/("/(``/g
s/\["/\[``/g
s/";/'';/g
s/":/'':/g
s/,"/,''/g
s/",/'',/g
s/\."/.\\\&''/g
s/"\./''.\\\&/g
s/\\(em\\^"/\\(em``/g
s/"\\(em/''\\(em/g
s/\\(em"/\\(em``/g
s/@DQ@/"/g

The first substitute command looks for a quotation mark at the beginning of a line and changes it to an open-quote. The second command looks for a quotation mark at the end of a line and changes it to a close-quote. The remaining commands look for the quotation mark in different contexts, before or after a punctuation mark, a space, a tab, or an em-dash. The last command allows us to get a real doublequote (@DQ@) into the troff input if we need it. We put these commands in a "cleanup" script, along with the command changing hyphens to dashes, and invoke it in the pipeline that formats and prints documents using troff.


Previous: 4.3 Testing and Saving Outputsed & awkNext: 4.5 Getting to the PromiSed Land
4.3 Testing and Saving OutputBook Index4.5 Getting to the PromiSed Land

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System