Handling Errors

Systematically handling errors in scripts

The Problem

Scripts are programs too! Some of the commands executed in your script will fail and if you don't test for that, just as if you don't test for functions' error conditions in any other programming languages, then you can't really complain if bad things happen.

An Example

Let's take a particularly cheesy example, a tidy-up script:

#! /bin/bash

cd $1
rm -r *

Oh, the horror!

The obvious problem of not passing an argument to this script is not really what we're interested in here though it's just as catastrophic. If you don't pass an argument then the cd $1 line will become simply cd.

cd with no arguments means "change to the user's home directory". Thereon, the script will happily delete all the user's files and directories. Not clever.

However, we're interested in potential errors. There are two obvious errors:

$1 is not a valid directory
* expands to nothing and rm fails.

The first problem is the bad one. If cd is given an invalid argument (not a directory, not a directory you are allowed to cd into, etc.) then it fails and prints out an error message. What it doesn't do is change the working directory anywhere. The net result of that is that rm will immediately start to delete as many of the files and directories it has permission to in the current directory. Not so clever if you were root and ran the script from /.

How many people check whether cd succeeded or failed?

The second problem, if * fails to expand to any filenames and returns itself such that rm tries to delete the file literally called * isn't quite such a big problem in this script although it is a problem and we should be handling it with more care. It's not such a big problem here because it's the last thing the script does and therefore the script will exit with the exit status of rm (which happens to be 2) and the last thing the user will see is the error message from rm:

rm: *: No such file or directory

Again, to repeat, rm really is complaining that there is no file called * not that there are no files! The difference here is subtle as the effect is the same.

We can't really let the script get away with not handling rm failing as we don't know that someone else is expecting rm to have succeeded. Perhaps they might be expecting files to have been there and the fact that rm has failed because the files aren't there should be considered a big problem.

Whatever has gone wrong in the script we should look to exit the script before any more problems start cascading from this one and most important of all we should make sure that whomever called us knows that we failed by issuing an suitable message and calling exit with a non-zero value.

Solution 1

Everyone knows (hopefully) that in a shell script $? contains the exit status of the last run command. So we can test that:

cd $1
if [[ $? -ne 0 ]] ; then
    echo "cd \$1 ($1): failed" >&2
    exit 1
fi

Note

$? contains the exit status of the last foreground pipeline if we want to be pernickety. Which we do.

Actually, it would be neat if we could exit with the same value that cd exited with as it might be useful for whomever has to debug this script. There is a problem here:

cd $1
if [[ $? -ne 0 ]] ; then
    echo "cd \$1 ($1): failed with $?" >&2
    exit $?
fi

This will exit with 0. Why? The problem is that $? is the exit status of the last command run (foreground pipeline...yeah, yeah) and having tested the value to see that it's non-zero we run echo. The trouble is echo is a command and it succeeded and promptly reset $? back to 0. To be safe we need to capture the value of $? as soon as possible:

cd $1
err=$?
if [[ $err -ne 0 ]] ; then
    echo "cd \$1 ($1): failed with $err" >&2
    exit $err
fi

Which does what we want.

But you have to be honest and say that that looks like a lot of work.

Solution 2

Everyone is less aware that if allows for much more complicated expressions. In fact, the syntax is:

if *list* ; then *list*; ... fi

A *list* is a sequence of pipelines (separated by ;, &, ||, and &&) and a pipeline is a number of simple commands (separated by |). In other words, the expression you pass to if isn't limited to simple tests but in essence can be arbitrary commands like cd and rm.

We can revisit our script:

if cd $1 ; then
    err=$?
    echo "cd \$1 ($1): failed with $err" >&2
    exit $err
fi

Although, to be honest, it doesn't look like we've done anything more useful than show off that we've read the manual about if.

Really, the problem is that both cd and rm already tell us what the problem is (No such file or directory) we don't need to echo that. All we really want to do it quit before the error has any knock on effects. We could then reduce our script to:

if cd $1 ; then
    exit $?
fi

Which is better but it still means wrapping every command in your script in a test. There must be a better way.

Solution 3

Traps! What? Aren't traps associated with signals? Yes but there's a few pseudo-traps that have been squeezed in and one of them is our kiddie.

The ERR trap is raised whenever a simple command exits with a non-zero exit status. That sounds good but what is a simple command? The easy answer is anything that isn't a shell conditional or loop control operator. Uh? if, while, case, [[, (( etc.. All those things that control the flow or perform some test and that aren't a doing something command. Look them up in the Bash man page under Compound Commands.

So this is looking pretty good.

Trap Tricks

When traps are raised an expression is evaluated. So, in the simplest case you might say:

% trap 'echo oops' ERR
% true
% false
oops
% cd /
% cd /bad
bash: cd: /bad: No such file or directory
oops

This is looking good. Now, what about $??

% trap 'echo "command called exit ($?)"' ERR
% false
command called exit (1)
% cd /bad
bash: cd: /bad: No such file or directory
command called exit (1)

This is looking very good. You might be getting a bad feeling about trying to get the quoting right in the expression. We can fix that by making a call to a function instead:

handle_ERR ()
{
    typeset what=$1

    echo "command called exit (${what})"
    exit ${what}
}

trap 'handle_ERR $?' ERR

Warning

If you've just typed that blindly in and run cd /bad you'll have been thrown out of your shell. Sorry!

You might want to change the exit ${what} line to say echo exit ${what} while you're testing on the command line.

Of course, if we have lots of cd commands in our script:

cd here
...
cd here

we won't be sure which one had the error in it. Unless we knew what line we were at in the script...

handle_ERR ()
{
    typeset what=$1
    typeset where=$2

    echo "command called exit (${what}) at line ${where}"
    exit ${what}
}

trap 'handle_ERR $? $LINENO' ERR

What if we're running a lot of programs all with the same error handling functionality, how do we tell them apart?

PROGRAM="${0##*/}"

handle_ERR ()
{
    typeset what=$1
    typeset where=$2

    echo "${PROGRAM} called exit (${what}) at line ${where}"
    exit ${what}
}

And if we were running on several different machines at any old time of the day?

PROGRAM="${0##*/}"
HOSTNAME="$(uname -n)"

handle_ERR ()
{
    typeset what=$1
    typeset where=$2

    echo "$(date +'%b %d %T') ${HOSTNAME} [$$]: ${PROGRAM} called exit (${what}) at line ${where}"
    exit ${what}
}

we might then see:

% cd /bad
bash: cd: /bad: No such file or directory
Feb 17 12:28:11 hostname [23593]: scriptname called exit (1) at line 19

Now that's looking a whole lot better!

Special Cases

You'll be pleased to know that the ERR trap is not raised in conditional expressions, ie. if and while statements so you don't get the ERR` trap raised because of the ``false here:

if false ; then
    echo "You will not read this"
else
    echo "This is expected"
fi

Problems

The small print for the ERR trap is that it only applies to simple statements. More annoyingly, in the small print somewhere else you'll recall reading that the ERR trap is not inherited by shell functions, command substitutions, and commands executed in a subshell environment.

Inheritance

In Bash (3 & 4) you can use set -E to overcome the inheritance problem. In Bash 2 and Ksh you'll have to manually set the trap at the start of every function and subshell!

Compound Statements

This is a very subtle problem as for the most part the ERR trap will do what you want:

if true ; then
    cd /bad
fi

will raise the ERR trap as you might expect at the cd line.

Most of the time, the right thing seems to happen. Except here:

% trap 'echo "command called exit ($?)"' ERR
% ( false )
% echo $?
1

Note

Unless you're using Bash 4 which has fixed/changed this behaviour.

The subshell hasn't issued the expected trap message (we haven't run set -E), this shell hasn't issued the trap message (as a subshell is a Compound Command) and yet the subshell exited with a non-zero exit status.

What can we do? Nothing, sadly. We'll have to revert to the old-school techniques of checking $?.

Downsides

Yes, there can be downsides. Sometimes you know a command is going to fail:

ls foo*

Well, perhaps you didn't know this was a failure but ls will exit with a status of 2 if the thing you asked it to list does not exist.

Prior to using the ERR trap we would have blithely passed over this inconvenience and accepted that the message appearing on stderr was good enough for us to understand and cope with. Now, the script will exit.

Just running ls in a script (presumably to see what files exist for reporting/debug purposes) doesn't look very purposeful another example might to to capture the output of any reports:

reports=$(cat *.report)

Here, we might have been using the semantic of

at this point we don't really care if there are any reports or not but let's capture the reports if any exist

However, our ERR trap will defeat us and the script will exit. Not good.

We can fix it by a little shell tweakery. We can suffix the existing simple command with || true:

reports=$(cat *.report || true)

where the original simple command has been replaced with a list. The exit status of the list depends on the operators used but in this case either cat *.report will be successful or true is successful. Either way, even though cat might have failed (and printed an error message) the process substitution as a whole will be successful and the ERR trap will not be raised.

Of course, in this instance you would be a lot better off testing the results of filename expansion as you can give a much clearer report, if required, an choose whether to run cat at all (and therefore avoid its distracting error message).

Caveats

Many people will advise against using either the ERR trap or its sibling set -e (exit on error) for reasons including ones we've just demonstrated to our advantage! What's the problem?

|| Operator

The list OR operator || suppresses the ERR trap in all but the last pipeline. Let's take an example:

% trap 'echo "command called exit ($?)"; exit $?' ERR
% set -E
% ( echo a ; false ; echo b)
a
command called exit (1)

but:

% ( echo a ; false ; echo b) || echo fallback
a
b

Here, the ERR trap has been suppressed in the subshell to the left of the || and so, even though false fails the failure is ignored and the subshell can evaluate echo b.

Note

echo b succeeds and as the last command of the subshell, its exit status becomes that of the subshell itself, ie the subshell succeeds and therefore the || operator has no need to execute echo fallback. Try a final false command in the subshell.

You cannot even reset the ERR trap or equivalent:

% ( set -e ; echo a ; false ; echo b) || echo fallback
a
b

So, if your script is using || operators then you will have to resort to some more traditional error checking (see Solution 2 for example).

Note

The suppression of ERR handling occurs whether your pipeline involves simple commands, subshells, functions, etc..

(In)compatibility

Various versions of Bash have implemented different behaviours. You're on your own on this front.

Arithmetic

let and (( have an implementation where for an Arithmetic Expression that evaluates to 0 (zero) they will exit non-zero. This will be picked up by Bash as an error.

Note

This is true for Bash 4.1 and later. Previously ((a=0)) would not trigger an ERR trap whereas let a=0 would.

POSIX

Incompatibilities and Arithmetic Expressions can trace their roots back to POSIX.

Philosophy

There is an element of philosophy to be brought to bear with respect to the use of automatic error handling. Some people see automatic error handling as wrong. That is:

% rm *.tar

should fail if it wants to fail. If they were concerned about rm failing then they would write the appropriate error handling code.

That is a perfectly acceptable viewpoint although one that I cannot see the value in. To my mind, scripting is the lazy man's typing. If I ran a command at the prompt and it failed I would be concerned as to why. I would want to know the reason why it failed and what, if anything, I should be doing next time in advance of the command to ensure that it didn't fail.

At the keyboard I can see the error message and react accordingly, almost certainly by not performing the next task immediately. Perhaps I might perform some checks or pre-requisites and maybe re-run the rm. In a script, there's no stopping because a command failed, it'll just run the next line of code regardless of how critical the failure might have been.

Of course there are some command invocations where you don't care whether the command succeeds or fails and you can handle those appropriately (|| true). However, as my script runs, if a command fails then I have not made the appropriate preparations and I don't want other commands to continue until I have handled this case.

There are a number of problems, particularly the suppressed error handling with || operators, but for the most part automatically stopping processing when a command fails is a better default behaviour. It represents a safety net to catch the unexpected.

Document Actions