Jason L Causey

Fixing Pointer-Related Segmentation Faults using GDB

This post was originally written for the A-State CS Department wiki on March 22, 2014. It provides a beginner’s overview of using gdb.

So, your program crashed...

While testing a program involving pointers, it is common to experience a program crash related to a bad pointer value somewhere in the code. On Linux systems, the message displayed when your program crashes is often “Segmentation Fault” — meaning the actual “crime” your code committed (as judged by the OS) was to try to access memory outside the “segment” in which it was allocated. Without getting to far into the details of how operating systems handle memory, suffice it to say that this shouldn’t have happened unless you have mis-handled a pointer (or an array access) in some way.

Segmentation faults and other crashes related to pointer mis-handling can be tricky to track down using program tracing techniques (like adding “debug output”) alone. To find the source of these problems, it really helps to be able to “see” exactly what the program was trying to do at the moment that the crash occurred. Debuggers are special programs designed to help us do exactly that. One such debugger that is available on our class virtual servers is the GNU Project Debugger, known as “GDB” and accessed through the ‘gdb’ command.

What is GDB?

The following was taken from the official GNU Project Debugger website, under the heading “What is GDB?”:

GDB, the GNU Project debugger, allows you to see what is going on ‘inside’ another program while it executes – or what another program was doing at the moment it crashed.

GDB can do four main kinds of things (plus other things in support of these) to help you catch bugs in the act:

  • Start your program, specifying anything that might affect its behavior.
  • Make your program stop on specified conditions.
  • Examine what has happened, when your program has stopped.
  • Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.

The program being debugged can be written in Ada, C, C++, Objective-C, Pascal (and many other languages). Those programs might be executing on the same machine as GDB (native) or on another machine (remote). GDB can run on most popular UNIX and Microsoft Windows variants.

Learn by Example

GDB has a substantial user manual, and provides an extensive set of commands for controlling the debugging session. In reality, you can get a lot of benefit from GDB with only a small subset of the total available command set. This will be shown by working through an example below, given some code that is crashing with a Segmentation Fault error. The commands that will be used in this example are:

run : Start program execution from the beginning of the program. Allows command-line arguments to be listed, and also allows basic I/O redirection.
backtrace : If (or when) a program has crashed, this command will show trace of where you are currently (which functions you are in). The “trace” consists of the current function call stack, where each call (referred to as a “frame”) is shown in order from most recent to earliest (which will be the entry into main()), and each is numbered for reference.
backtrace full : Just like backtrace, but local variables are also shown for each frame.

The Program

The program for this example will consist of a single header file (olist_with_bug.h) and a single source file (debugexample.cpp). The header file contains a class definition and implementation for a singly-linked ordered list. The source file contains a main() function and some code to fill the list, print it to the screen, remove some elements, and print it out again. The source code for each file is shown below:

olist_with_bug.h

/**
 * @file olist.h
 * Defines OList Ordered Linked List
 *
 * OList is a minimally functional singly-linked ordered list
 * designed to contain integer values.
 */

#ifndef OLLIST_H
#define OLLIST_H

#include<iostream>
#include<string>

/**
 * OList is a minimally functional singly-linked list of integer
 * values.  
 */
class OList{
    public:
        // ctor, dtor
        OList()     : head(0) , m_count(0)  {                   }                       
        ~OList()                            { erase();          }                       

        // methods
        void          insert(int value);                                                    /// insert a new item
        bool          remove(int value);                                                    /// remove an item, if it is there
        bool          contains(int value) const;                                            /// see if the list contains an item
        int           count() const         { return m_count;   }                           /// get the current number of items in the list
        bool          is_empty() const      { return head == 0; }                           /// see if the list is empty
        void          erase();                                                              /// erase all values in the list
        std::ostream& print(std::ostream& stream=std::cout, std::string delim=", ") const;  /// print the list to the specified stream

    private:
        /**
         * container node for an integer and a pointer to the next item in the list.
         */
        class Node {
            public:
                // ctor
                Node(int value) 
                    : data(value), next(0)  { }                                             /// Node requires data value to construct
                
                // Attributes:
                int   data;                                                                 /// data value
                Node* next;                                                                 /// pointer to the next Node in the list
        };

        Node* _find(int value) const;
        Node* _find(int value, Node*& previous) const;

        Node* head;
        int   m_count;
};

/**
 * inserts a new data value in the ordered list
 * @param value the new data item (an integer) to insert
 */
void OList::insert(int value){
    OList::Node* previous;
    OList::Node* newnode = new Node(value);
    _find(value, previous);
    if(!is_empty()){
        newnode->next  = previous->next;
        previous->next = newnode;
    }
    else{
        head           = newnode;
    }
    m_count++;
}

/**
 * remove an item from the list if it is there
 * @param  value item to remove
 * @return       true if item was found and removed, false otherwise
 */
bool OList::remove(int value){
    OList::Node* previous;
    OList::Node* victim = _find(value, previous);
    bool  found  = victim != 0;
    if(victim){
        previous->next = victim->next;
        if(victim == head){
            head = victim->next;
        }
        delete victim;
        m_count--;
    }
    return found;
}

/**
 * returns true if the list contains the value
 * @param  value item to look for
 * @return       true if item is found in the list, false otherwise
 */
bool OList::contains(int value) const {
    return _find(value) != 0;
}

/**
 * erase the entire list, leaving it empty
 */
void OList::erase(){
    for(OList::Node* current = head; current != 0;){
        OList::Node* victim = current;
        current             = current->next;
        delete victim;
    }
    m_count = 0;
    head    = 0;
}

/**
 * print the list to the stream specified by the first parameter, 
 * delimited by the string given in the second parameter
 * @param  stream std::ostream stream to send output to
 * @param  delim  string to place between items (defaults to comma-separated)
 * @return        the modified stream object is returned
 */
std::ostream& OList::print(std::ostream& stream, std::string delim) const{
    for(OList::Node* current = head; current != 0; current=current->next){
        stream << current->data << (current->next != 0 ? delim : "");
    }
    return stream;
}

/**
 * (internal use) - finds location of a value in the list, 
 * @param  value value to find
 * @return       pointer to the Node where `value` was found, or NULL if not found
 */
OList::Node* OList::_find(int value) const{
    OList::Node* n;
    return _find(value, n);
}

/**
 * (internal use) - finds location of a value in the list, along with
 * the node immediately to its "left" (the previous node) in the list.
 * In cases where the node isn't in the list, the previous pointer will
 * be set to the node immediately to the "left" of where the node should
 * be located in the list.  This makes the function useful for determining
 * where a node should be added in the list.
 * @param           value    value to find
 * @param[out]      previous pointer to the previous Node in the list (or NULL if the item's position is at head)
 * @return                   pointer to the Node where `value` was found, or NULL if not found
 */
OList::Node* OList::_find(int value, Node*& previous)const {
    OList::Node* current = head;
    previous             = 0;
    while(current && current->data < value){
        previous = current;
        current  = current->next;
    }
    return current && current->data == value ? current : 0;
}

#endif

debugexample.cpp

#include<iostream>
#include<cstdlib>
#include<ctime>
#include<string>

//#include "olist_with_bug.h"
#include "olist_with_bug.h"


int main()
{
    // Un-comment the appropriate line below to
    srand(time(0));     // produce different pseudo-random values on every run
    //  ... or ... 
    //srand(5);           // Produce the same set of "random" values every time.
    OList l;
    
    // Add some values
    for(int i = 0; i < 30; i++){
        l.insert(rand() % 100);
    }
    std::cout << "Initial List: ";
    l.print() << "\n\n";

    // Remove some values:
    for(int i = 0; i < 30; i++){
        l.remove(rand() % 100);
    }

    std::cout << "Final List: ";
    l.print() << "\n\n";

    return 0;
}

Compile and Run

At this point, the usual thing to do is to try to compile and run the program as usual. The corresponding terminal session is shown below:

exampleuser@exampleserver:~/OList_with_error$ g++ -W -Wall -ansi -pedantic debugexample.cpp 
exampleuser@exampleserver:~/OList_with_error$ ./a.out 
Initial List: 3, 7, 8, 12, 17, 20, 21, 22, 24, 25, 25, 30, 33, 33, 36, 48, 53, 53, 57, 61, 63, 64, 71, 78, 82, 85, 86, 87, 90, 92

Final List: 3, 7, 8, 12, 17, 20, 21, 25, 25, 30, 33, 33, 53, 53, 57, 61, 63, 64, 71, 82, 87, 90, 92

exampleuser@exampleserver:~/OList_with_error$ ./a.out 
Segmentation fault

Notice here that on the first run, everything went OK. The main program adds data values at random to the list, then tries to remove them at random... In some cases, the program may run as expected. But in other cases, the program may crash with a Segmentation fault, and it did that on the second run.

This behavior is typical of the way these sorts of errors behave in the real world. It is often the case that these errors only show up intermittently; reproducing them during testing can sometimes be difficult. A good approach is to make sure that your testing code exercises all possible “edge conditions” in the source - for an ordered list, that would mean adding to an empty list, adding a value that would be inserted at the head, and adding a value that would be inserted at the end of the list (in addition to the “normal” case of the value being added in the middle somewhere).

Compiling for Debugging

The compiler flags shown above did not produce an executable file that was ideal for debugging. To be most useful, the debugger will need information like the original source code, the line numbers, and the symbol (variable and function) names. This information is removed from the executable produced by the compiler by default (and for good reason -- it increases the output file size significantly). But since the program crashed and a debugging session would be helpful, the program should now be re-compiled in such a way that the compiler will produce a debugger-friendly executable.

Since the debugger used in this example is GDB, the compiler flag -ggdb will be added to the g++ command (the -g flag tells the compiler to produce debugging information, and the -ggdb flag says to produce that information optimized for the GDB debugger). The full command is:

g++ -W -Wall -ansi -pedantic -ggdb debugexample.cpp

No additional compiler output is shown on the console window, but the size of the executable file a.out has increased (on the system being used to produce this example, it was ~13KiB before and ~40KiB after).

The Debugging Session

Now that the program has been compiled with debugging information included, a debugging session can be initiaited. The command to start GDB is ‘gdb’, and the program’s executable name should be passed as a command-line argument. The full command to start debugging this example (whose executable is named a.out) is shown below:

gdb a.out

This assumes that a.out is in the current working directory. The actual terminal session (with the command and its output) is shown below:

exampleuser@exampleserver:~/OList_with_error$ gdb a.out
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/exampleuser/OList_with_error/a.out...done.
(gdb) 

At this point, GDB has loaded the executable file and is ready for a command. The (gdb) at the end is GDB’s own command prompt. First thing’s first - run the program to try to reproduce the error:

(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 

Program received signal SIGSEGV, Segmentation fault.
0x08048969 in OList::insert (this=0xbffff6b4, value=35) at olist_with_bug.h:65
65          newnode->next  = previous->next;
(gdb) 

In this case, the error occurred on the first try. If it had not, the run command could be repeated until the crash occurred.

Notice that the function call where the Segmentation fault occurred is shown (OList::insert()) and the actual parameters are also shown. At this point it is clear that the object being operated on (this) is located at address 0xbffff6b4, and the value passed to the value parameter was 35. There is nothing obviously wrong with either of those actual parameters, so continue looking...

The line of code that was actually being executed when the SIGSEGV occurred was line 65 of olist_with_bug.h, which contains the code:

newnode->next  = previous->next;

So, something bad happened on that line of code. Notice that there are two pointer dereferences (by way of the -> operator) on this line. Either of those could be to blame. We need to know what values are contained in the newnode and previous pointers. To find out, the command backtrace full can be used to show the function call stack as well as the values of any local variables along the way:

(gdb) backtrace full
#0  0x08048969 in OList::insert (this=0xbffff6b4, value=35) at olist_with_bug.h:65
        previous = 0x0
        newnode = 0x804c038
#1  0x08048c63 in main () at debugexample.cpp:20
        i = 3
        l = {head = 0x804c008, m_count = 3}
(gdb) 

From this output, it is clear that previous is NULL (contains the value 0x0, which is zero expressed as hexidecimal). Therefore, it was the attempt to dereference previous (in previous->next) that caused the fault. Dereferencing a NULL pointer is always illegal.

So the question is: “How did previous come to be NULL in the first place?”

To answer this, we can consult the source code itself. The code from the insert() function is shown below:

void OList::insert(int value){
    OList::Node* previous;
    OList::Node* newnode = new Node(value);
    _find(value, previous);
    if(!is_empty()){
        newnode->next  = previous->next;        // how is `previous` NULL here?
        previous->next = newnode;
    }
    else{
        newnode->next  = head;
        head           = newnode;
    }
    m_count++;
}

Looking at the path taken to arrive at that particular line, we see that previous is intended to receive its value on the line:

    _find(value, previous);

Which is line 63 in the file. This means that the value being placed in previous by the _find() method is NULL. So the next thing to do is examine the _find() method to see if there is a bug there, or perhaps it is just being used incorrectly. The _find() method (along with its documentation) is:

/**
 * (internal use) - finds location of a value in the list, along with
 * the node immediately to its "left" (the previous node) in the list.
 * In cases where the node isn't in the list, the previous pointer will
 * be set to the node immediately to the "left" of where the node should
 * be located in the list.  This makes the function useful for determining
 * where a node should be added in the list.
 * @param       value    value to find
 * @param[out]  previous pointer to the previous Node in the list (or NULL if the item's position is at head)
 * @return               pointer to the Node where `value` was found, or NULL if not found
 */
OList::Node* OList::_find(int value, Node*& previous)const {
    OList::Node* current = head;
    previous             = 0;
    while(current && current->data < value){
        previous = current;
        current  = current->next;
    }
    return current && current->data == value ? current : 0;
}

From reading the method’s documentation, it seems that previous will indeed be set to NULL if the item’s position is at the head of the list. This makes sense — there is nothing to the logical “left” of the head of a singly-linked list, after all.

So it probably isn’t an error that the _find() method is setting previous to NULL; it just means that the item that is being inserted belongs to the logical “left” of the current head of the list (thus becoming the new head item). Has this situation been properly handled in the insert() method? Take another look:

void OList::insert(int value){
    OList::Node* previous;
    OList::Node* newnode = new Node(value);
    _find(value, previous);                  
    if(!is_empty()){
        newnode->next  = previous->next;        
        previous->next = newnode;
    }
    else{
        head           = newnode;
    }
    m_count++;
}

Now that it is clear that previous will be set to NULL any time the new node belongs at the head of the list, it is worth listing the situations in which that will be the case:

or

Examining the code carefully shows that the “empty list” case has been planned for, but the “new smallest item” case has not. In either of these cases, the new node’s next pointer should be pointed to the current head (which may be NULL if the list is empty, but that is not a problem). Then, the head pointer must be pointed to the new node (the item being inserted becomes the first item). In both cases, the previous pointer will be set to NULL by the call to _find(), so that can be used as the condition in the if, replacing the check for “not empty”:

    if(previous != 0){      // replaces "if(!is_empty()){"

The else clause must also be modified to properly set the new node’s next pointer (for the “new first item” case):

    else{
        newnode->next  = head;      // link in `newnode` in case list is non-empty
        head           = newnode;
    }

This means that the modifed insert() function should look like:

void OList::insert(int value){
    OList::Node* previous;
    OList::Node* newnode = new Node(value);
    _find(value, previous);                  
    if(previous != 0){             // if list is non-empty and new node isn't a minimum
        newnode->next  = previous->next;        
        previous->next = newnode;
    }
    else{
        newnode->next  = head;     // link in `newnode` in case list is non-empty
        head           = newnode;
    }
    m_count++;
}

Now it is once again time to compile and test the code. The GDB debugging session is still running, so it should be stopped before re-compiling:

(gdb) quit
A debugging session is active.

    Inferior 1 [process 5617] will be killed.

Quit anyway? (y or n) y
exampleuser@exampleserver:~/OList_with_error$

The last line shown indicates that the shell prompt is back and waiting for a command. The compile (with debugger output) and test session (testing directly in GDB just in case) is shown below:

exampleuser@exampleserver:~/OList_with_error$ g++ -W -Wall -ansi -pedantic -ggdb debugexample.cpp
exampleuser@exampleserver:~/OList_with_error$ gdb a.out
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/exampleuser/OList_with_error/a.out...done.
(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 
Initial List: 4, 7, 10, 11, 15, 20, 24, 25, 29, 35, 35, 36, 37, 39, 39, 41, 42, 48, 56, 57, 65, 67, 68, 73, 73, 82, 82, 86, 89, 96

Final List: 4, 7, 10, 11, 15, 20, 25, 29, 35, 35, 37, 39, 39, 48, 56, 57, 65, 68, 73, 73, 82, 89, 96

[Inferior 1 (process 6529) exited normally]
(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 
Initial List: 0, 3, 13, 20, 20, 22, 23, 33, 34, 35, 38, 44, 46, 48, 49, 53, 58, 60, 62, 64, 68, 70, 70, 74, 81, 90, 90, 94, 95, 99

Final List: 0, 3, 13, 20, 20, 22, 23, 33, 34, 35, 38, 44, 46, 48, 49, 53, 60, 62, 68, 70, 70, 74, 81, 90, 99

[Inferior 1 (process 6533) exited normally]
(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 
Initial List: 1, 6, 6, 7, 9, 11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 31, 34, 38, 42, 44, 57, 65, 66, 68, 69, 69, 70, 73, 90, 94

Final List: 1, 6, 6, 7, 9, 11, 14, 16, 17, 18, 24, 31, 34, 44, 65, 66, 73, 90, 94

[Inferior 1 (process 6534) exited normally]
(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 
Initial List: 1, 6, 14, 15, 16, 17, 19, 26, 34, 37, 38, 39, 40, 42, 43, 47, 50, 50, 51, 52, 57, 58, 63, 63, 69, 69, 70, 72, 80, 98

Final List: 1, 6, 17, 19, 26, 34, 37, 38, 39, 40, 42, 43, 47, 50, 50, 51, 57, 58, 63, 69, 69, 70, 72, 98

[Inferior 1 (process 6537) exited normally]
(gdb) run
Starting program: /home/exampleuser/OList_with_error/a.out 
Initial List: 1, 4, 6, 12, 16, 17, 18, 21, 22, 27, 37, 42, 42, 43, 46, 48, 50, 51, 51, 56, 57, 61, 65, 74, 77, 92, 94, 94, 95, 96


Program received signal SIGSEGV, Segmentation fault.
0x080489d5 in OList::remove (this=0xbffff6b4, value=1) at olist_with_bug.h:85
85          previous->next = victim->next;
(gdb) 

As you can see, the program ran well four times before crashing with another Segmentation fault on the fifth run!

This error is in a different place (line 85 in the remove() function). Once again, the actual parameters look reasonable. There are two pointers being dereferenced in that line, so a backtrace with local variable values might help shed some light on what happened:

(gdb) backtrace full
#0  0x080489d5 in OList::remove (this=0xbffff6b4, value=1) at olist_with_bug.h:85
        previous = 0x0
        victim = 0x804c0c8
        found = true
#1  0x08048d29 in main () at debugexample.cpp:27
        i = 29
        l = {head = 0x804c0c8, m_count = 20}
(gdb) 

Once again, the previous pointer seems to be to blame, containing a NULL value at the time it is being dereferenced. Back to the source code! The remove() function looks like:

bool OList::remove(int value){
    OList::Node* previous;
    OList::Node* victim = _find(value, previous);
    bool  found  = victim != 0;
    if(victim){
        previous->next = victim->next;
        if(victim == head){
            head = victim->next;
        }
        delete victim;
        m_count--;
    }
    return found;
}

And once again the _find() helper method is being used. From the earlier experience, it is clear that previous will be set to NULL if the value being searched for isn’t found or if it is found at the head of the list. A quick look a the documentation for the _find() method’s return value indicates that if the value is not found, the return value is also NULL. That leaves only one situation: The value being searched for must be at the beginning of the list.

Like before, it seems that the code is handling the “value not found” and “empty list” special case just fine (in fact, those cases are the same when trying to remove a value). It is the special case where the value is at the beginning of the list that is being ignored. In that case, the item at the head of the list is being removed... The code is already checking for this on the line:

        if(victim == head){

All that is needed is to take advantage of this already-existing check to specify that the previous pointer should only be accessed when the “victim” node is not the head node. This is most easily accomplished by adding an else clause for that if statement and moving the line causing the crash into it. The modified remove() function looks like:

bool OList::remove(int value){
    OList::Node* previous;
    OList::Node* victim = _find(value, previous);
    bool  found  = victim != 0;
    if(victim){
        if(victim == head){
            head = victim->next;
        }
        else{
            previous->next = victim->next;  // only if `victim` is not `head`!
        }
        delete victim;
        m_count--;
    }
    return found;
}

After making the code adjustment, stop the GDB session using the quit command, then re-compile and test the program. At this point the program no longer crashes under any sequence of data additions and deletions! The Segmentation fault errors have been tracked down and squashed thanks to some help from GDB.

Don’t forget to re-compile without the additional debugging information if you are intending to distribute your object/executable file (to reduce the file size).

Other Useful GDB Commands

This example only made use of the run and backtrace full commands from GDB; it is compelling that only two commands can help speed the correction of subtle coding errors so dramatically, but sometimes these two just aren’t quite enough. Adding just a few more commands to your GDB arsenal will allow you to debug a majority of simple (common) programming errors of all kinds:

run : Start program execution from the beginning of the program. Allows command-line arguments to be listed, and also allows basic I/O redirection.
backtrace : If (or when) a program has crashed, this command will show trace of where you are currently (which functions you are in). The “trace” consists of the current function call stack, where each call (referred to as a “frame”) is shown in order from most recent to earliest (which will be the entry into main()), and each is numbered for reference.
backtrace full : Just like backtrace, but local variables are also shown for each frame.
list : List the source code surrounding the current execution point, or in the currently selected frame.
print : Prints the value contained in a variable, or in some cases the result returned by a function that has been called in the current frame. print also has several options available to select the desired representation of the value.
break : create a breakpoint, which is a point at which execution of the program will stop during your debugging session so that you can examine local variables, etc.
step : “steps into” next line of code - will drop into a function call
next : like step but will not drop into a function call
continue : continue running the program as usual, until next breakpoint is encountered or program exits


Epsilon Comparison

This post was originally written for the A-State CS Department wiki in November, 2013.

The Numerical (In)accuracy Problem

Computers use binary (base-2) numbers internally. This means that our familiar decimal (base-10) numbers must be converted to binary before they can be stored in the computer. The problem is that some—many, actually—base-10 real numbers do not have an exact representation in base-2. This may seem odd until you consider that we deal with lots of numbers that do not have an exact representation in base-10 either. Two examples are $\pi$ and $e$. Neither of these values can be written exactly in decimal no matter how many digits you allow after the decimal point. Numbers of this variety are called irrational numbers. It turns out that many numbers that are easily written in decimal become irrational numbers when translated to binary. A simple example is the number 0.1. It is easy enough to write in decimal, but in binary it becomes impossible to represent exactly.

So what do we do when we can’t exactly represent a number? We round. If you were asked to write down $\pi$, what would you write? You would probably write down some value between 3.14 and 3.14159, depending on how precise you wanted to be. Neither number is exactly correct, but both are reasonable representations given the number of digits used. In a computer, there is a limit to the number of binary digits (bits) that may be used to represent a number. Because of this, numbers like 0.1 must be rounded to some limited precision. This rounding produces a “pretty good”, but inexact representation of 0.1.

At this point, you are probably asking yourself “why should I care?”, and that is a reasonable question. The answer is: Because it will cause errors in your programs! To help illustrate why, open the interactive Python interpreter and follow along with the following example:

First, let’s see how 0.1 is approximated in Python (Python is used for this example because it’s default numeric output mode shows more decimal places than C++). At the interactive prompt, type 0.1 and <ENTER>. You should see something like the following:

>>> 0.1
0.10000000000000001

Hmm. It looks like 0.1 gained an extra 1 in the 17^th^ decimal position! Now enter this relational expression and note the result:

0.1 == 3.1 - 3

Here’s the output that was produced on the computer used to write this page:

>>> 0.1 == 3.1 - 3
False

Why was the result False? To see the answer, let’s enter each operand into the interpreter separately and look at the result:

>>> 0.1
0.10000000000000001
>>> 3.1 - 3
0.10000000000000009

Notice that 0.1 is represented internally as 0.10000000000000001 (as we saw before). Now notice that 3.1 - 3 (which should also be 0.1) is represented internally as 0.10000000000000009. There is a difference in the 17^th^ digit following the decimal point! When we compare these two values with the == operator, the expression will return False since the two numbers (internally) are not exactly the same! We refer to this as rounding error, and if we want to compare real numbers for equality in a computer program, we must learn how to deal with rounding error properly.

A C++ Example

In C++, showing the inaccuracy is a little more difficult due to differences in the internal representation of floating-point values. The following example may or may not show the “error”, but it is likely to do so:

double pi1 = acos(-1);
double pi2 = (pi1 * (0.1 * 0.1)) / (0.1 * 0.1);
if(pi1 == pi2){
    cout << "OK, " << pi1 << " == " << pi2 << "!\n";
}
else{
    cout << "An ERROR occurred: " << pi1 << " != " << pi2 << "!\n";
}

Here, a very accurate value for $\pi$ is calculated using the inverse cosine function acos(). Then, this version of $\pi$ is used to calculate another representation (by calculating the area of a circle of radius 0.1 and then solving for $\pi$ by dividing out the $r^2$ term). In theory, these two values of $\pi$ should be exactly the same — but in the process of doing imprecise floating-point operations (the operations involving 0.1), rounding error was (possibly) introduced. Therefore, it is likely that the following output is produced:

An error occurred: 3.14159 != 3.14159!

Even worse, since the default behavior of the << operator in C++ is to round the output, you cannot even see the difference that caused the two values to be considered “not equal” (it was beyond the 5^th^ decimal place).

The Right Way to Compare Real Numbers for Equality

Ok, if we can’t use the == operator by itself to compare real numbers, how can we hope to determine if two real number values are equal? The answer is that we must accept the fact that we can’t compare real numbers exactly with a computer—not by using the built-in float type anyway. Once we have accepted that limitation, we can find a way to compare real numbers that takes the rounding error into account. We will decide on some amount of error that we will consider acceptable, then we will determine if the two values are reasonably close to equal. This makes sense in the real world, since you would probably say that both 3.1416 and 3.14159 are reasonable representations of $\pi$. To determine if two numbers are functionally equal (or “close enough”), we can ask the following question:

“Is the difference between the two values so small that we can ignore it?”

If the answer to this question is “Yes”, we will consider the numbers equal. Otherwise, we will consider them to be unequal. To answer this question, we can use an expression like the following (which compares 0.1 and 3.1 - 3 ):

In Python

abs(0.1 - (3.1 - 3)) < 1e-11

In C++

fabs(0.1 - (3.1 - 3)) < 1e-11

In the expression above, we take the difference of the two values (by subtracting them), then we look at the magnitude of that number (by taking its absolute value with abs() (in Python), or fabs() (in C++) to remove any negative sign). If the magnitude of the difference of the two numbers is less than a very small value ($1\times10^{-11}$ in this case), we consider the numbers equal. We usually refer to the small number representing allowable error as EPSILON. Thus, a comparison of this type is usually referred to as an EPSILON comparison. At the Python interactive prompt, enter the expression above and note the result. You should see something like the following:

>>> abs(0.1 - (3.1 - 3)) < 1e-11
True

For the C++ example, consider the same code testing the two values of $\pi$ as shown above, but now using an EPSILON comparison:

double pi1 = acos(-1);
double pi2 = (pi1 * (0.1 * 0.1)) / (0.1 * 0.1);

if(fabs(pi1 - pi2) < 1e-11){
    cout << "OK, " << pi1 << " == " << pi2 << "!\n";
}
else{
    cout << "An ERROR occurred: " << pi1 << " != " << pi2 << "!\n";
}    

Now, you can be confident that the outcome will always be as expected:

OK, 3.14159 == 3.14159!