Software Reliability


A failure is defined as the unacceptable departure of program operation from program requirements.

A fault is the software defect that causes a failure.

An error is the programmer action or omission that results in a fault.

Software reliability is the probability of failure-free operation of a software component or system in a specified environment for a specified time.


The Faces of Failure

Failures range from annoying to catastrophic.

Sonny Walker, 19, withdrew money from a Richmond Saving Credit Union bank machine on December 25, 1994. On the same day, someone used a stolen card at the same machine. The clock in the video camera indicated Walker had withdrawn his money at the same time as the fraud occurred, so the bank forwarded his photo to the RCMP. The clock had been off by about one hour - which apparently took the bank until September to realize. (SEN 21(2), March 1996, p. 16-17)


Failures result in different levels of service loss.

The New York Stock Exchange opened a hour late on Dec. 18, 1995. The weekend had been spent upgrading the system software. However, at 9:15 A.M. , It was discovered that there were serious communications problems in the software. The problem was fixed by 10 A.M. And the market reopened at 10:30 A.M. The Chicago Mercantile Exchange, Boston Stock Exchange, and Philadelphia Stock Exchange all waited until NYSE opened.
(SEN 21(2), March 1996, p. 16)

Failures have different repair times.

The FAA's Fremont (Oakland) air-traffic control center was "off-the-air" August 9, 1995 due to a power outage. Both the archaic system and the even more archaic backup system were down. The center lost radar and radio contact with airborne planes. The Oakland center covers something like 18-million square-miles. (SEN 20(5), December 1995, p. 12)

In suburban Chicago, the main computer used by air-traffic controllers for the busy midsection of the United States was out of service on July 24, 1995 , for the third time in a week. The control system handles almost 10 thousand passing flights a day. (SEN 20(5), December 1995, p. 12)

CNN reported ATC primary computers and their backup went down for several hours the morning of September 12, 1995, in Chicago. (SEN 20(5), December 1995, p. 12)

Failure repair can introduce new errors.

Chemical Bank's ATMs were out of commission for more than five hours on July 20, 1994. A routine file update was botched, overloading the computer system. (SEN 19(4), p. 6)

San Francisco has been trying for the past three years to upgrade its 911 system, but computer outages and unanswered calls remain rampant. On October 12, 1995 the dispatch system crashed for over 30 minutes in the midst of a search for an armed suspect. The dispatch system was installed two months ago as a temporary fix to the recurrent problems. Dispatchers are not able to answer between 100 and 200 calls a day. (SEN 21(2), March 1996, p. 19)

Reliability Measures

MTBF = MTTF + MTTR

Mean time between failure = Mean time to failure + Mean time to repair

Some contend that MTBF is a better measure than defects/KLOC because defect density is less of a concern than failures.

Availability = MTTF/(MTTF + MTTR) *100%

Reliability Models

1. Models that predict reliability as a function of chronological time.

2. Models that predict reliability as a function of elapsed processing time.

3. Model that are based on estimates of the number of errors (defects) in the software.

These models are based on stochastic processes, usually in the form of a probability distribution.

Seeding Model

Seeding models attempt to provide an estimate of the number of defects in a program. A program is randomly seeded with a number of known errors. The program is tested using the standard testing strategies.

The result -

Found j real errorsFound k seeded errors
Total real errorsTotal seeded errors

Based on these proportions you can estimate the number of errors in the software.

The assumptions:

1) Real and seeded faults have same distribution.

2) Seeded errors are realistic.

Error Prone Modules

Finding numerous errors in a piece of software is not cause to celebrate. The probability of an defect in the software increases with the number of defects found.

Clearly different programmer/designer characteristics lead to differing defect densities in software. If we find through testing high error rates in a module then we should scrutinize that module even more.

Myers, reporting on the OS/370 project, found that 47% of the faults were associated with 4% of the modules.

N-version Programming

Different versions of the software are developed by different groups using the same specification. The versions are executed in parallel and the results are "voted" on.

Software Safety

Software safety is a software quality assurance activity that focuses on the identification and assessment of hazards or potential hazards related to the use of software in a particular context. A modeling and analysis process is conducted to identify hazards and categorize them by criticality and risk. Once the hazards are identified, requirements can be specified for the software (that is undesired events and the desired system response can be described.).