Research on Silent Failures

in Scientific Software

 

Software fault, error, and failure definitions for Scientific Software

 

Acknowledged Error: given the constraints of software environments, acknowledged error in computer output is either unavoidable or intentionally introduced to make a problem tractable.  Sources of acknowledged error include simplifying assumptions, finite precision calculations, and domain discretizations.  In principle, an acknowledged error can be measured because its origins are fully identified.

Unacknowledged error:  error that results from blunders or mistakes such as programming mistakes, input data errors, and problems from compiler optimization or language libraries. There are no straightforward methods for estimating, bounding, or ordering the contributions of unacknowledged errors.

Failure: computer output that includes the execution of an unacknowledged error.

Detected Failure: a failure that forces a program's output error to exceed some detection boundary.

Terminal Failure: a failure that causes the executing program to produce an illegal output or to abort. Terminal failures are generally easy to detect because their outputs are unambiguously wrong.

Silent Failure: any failure that is not a terminal failure or a detected failure.

We have been looking at the characteristics of silent failures for scientific software by generalizing a well-known testing technique, Mutation Testing, into a technique we call Mutation Sensitivity Testing (MST).

 

Dan Hook has created a mutation generator MATMUTE for MATLAB code. The current development version can be found on sourceforge at matmute.sourceforge.net

 

Our paper “Mutation Sensitivity Testing” will be published in IEEE CiSE. All data files, scripts, and programs (excepting the MATLAB and Python interpreters) needed to generate the mutants and carry out the analysis for this paper are available here.

 

The files for the graphs and analysis for a paper submitted to IEEE TSE are found here.