BenchmarkingSoftware Metrics

Capers Jones on The Mess of Software Metrics

What follows are selected excerpts from Capers Jones' whitepaper, The Mess of Software Metrics:


The software industry is one of the largest, wealthiest, and most important industries in the modern world.  The software industry is also troubled by very poor quality and very high cost structures due to the expense of software development, maintenance, and endemic problems with poor quality control.

Accurate measurements of software development and maintenance costs and accurate measurement of quality would be extremely valuable.  But as of 2014 the software industry labors under a variety of non-standard and highly inaccurate measures compounded by very sloppy measurement practices.  For that matter, there is little empirical data about the efficacy of software standards themselves.

The industry also lacks effective basic definitions for “software productivity” and “software quality” and uses a variety of ambiguous definitions that are difficult to predict before software is released and difficult to measure after the software is released.  This paper suggests definitions for both economic software productivity and software quality that are both predictable and measurable.


The software industry has become one of the largest and most successful industries in history.  However software applications are among the most expensive and error-prone manufactured objects in history. 

Software needs a careful analysis of economic factors and much better quality control than is normally accomplished.  In order to achieve these goals, software also needs accurate and reliable metrics and good measurement practices.  Unfortunately the software industry lacks both circa 2014.

This short paper deals with some of the most glaring problems of software metrics and suggests a metrics and measurement suite that can actually explore software economics and software quality with precision.  The suggested metrics can be predicted prior to development and then measured after release.

Defining Software Productivity and Software Quality

For more than 200 years the standard economic definition of productivity has been, “Goods or services produced per unit of labor or expense.”  This definition is used in all industries, but has been hard to use in the software industry.  For software there is ambiguity in what constitutes our “goods or services.”

The oldest unit for software “goods” was a “line of code” or LOC.  More recently software goods have been defined as “function points.”   Even more recent definitions of goods include “story points” and “use case points.”   The pros and cons of these units have been discussed and some will be illustrated in the appendices. 

Another important topic taken from manufacturing economics has a big impact on software productivity that is not yet well understood even in 2014: fixed costs.

A basic law of manufacturing economics that is valid for all industries including software is the following:  “When a development process has a high percentage of fixed costs, and there is a decline in the number of units produced, the cost per unit will go up.”

When a “line of code” is selected as the manufacturing unit and there is a switch from a low-level language such as assembly to a high level language such as Java, there will be a reduction in the number of units developed. 

But the non-code tasks of requirements and design act like fixed costs.  Therefore the cost per line of code will go up for high-level languages.  This means that LOC is not a valid metric for measuring economic productivity as proven in Appendix B.

For software there are two definitions of productivity that match standard economic concepts:

  1. Producing a specific quantity of deliverable units for the lowest number of work hours.
  2. Producing the largest number of deliverable units in a standard work period such as an hour, month, or year.

In definition 1 deliverable goods are constant and work hours are variable.

In definition 2 deliverable goods are variable and work periods are constant.

The common metric “work hours per function point” is a good example of productivity definition 1.  The metrics “function points per month” and “lines of code per month” are examples of definition 2. 

However for “lines of code” the fixed costs of requirements and design will cause apparent productivity to be reversed, with low-level languages seeming better than high-level languages, as shown by the 79 languages listed in Appendix B.

Definition 2 will also encounter the fact that the number of work hours per month varies widely from country to country.  For example India works 190 hours per month while the Netherlands work only 116 hours per month.   This means that productivity definitions 1 and 2 will not be the same.  A given number of work hours would take fewer calendar months in India than in the Netherlands due to the larger number of monthly work hours.

The quality standard ISO/IEC 9126 includes a list of words such as portability, maintainability, reliability, and maintainability.  It is astonishing that there is no discussion of defects or bugs.  Worse, the ISO/IEC definitions are almost impossible to predict before development and are not easy to measure after release nor are they quantified.  It is obvious that an effective quality measure needs to be predictable, measurable, and quantifiable. 

An effective definition for software quality that can be both predicted before applications are built and then measured after applications are delivered is:  “Software quality is the absence of defects which would either cause the application to stop working, or cause it to produce incorrect results.” 

This definition has the advantage of being applicable to all software deliverables including requirements, architecture, design, code, documents, and even test cases.

If software quality focuses on the prevention or elimination of defects, there are some effective corollary metrics that are quite useful.

The “defect potential” of a software application is defined as the sum total of bugs or defects that are likely to be found in requirements, architecture, design, source code, documents, and “bad fixes” or secondary bugs found in bug repairs themselves.   The “defect potential” metric originated in IBM circa 1973 and is fairly widely used among technology companies.

The “defect detection efficiency” (DDE) is the percentage of bugs found prior to release of the software to customers.

The “defect removal efficiency” (DRE) is the percentage of bugs found and repaired prior to release of the software to customers.

DDE and DRE were developed in IBM circa 1973 but are widely used by technology companies in every country.  As of 2014 the average DRE for the United States is just over 90%. 

(DRE is normally measured by comparing internal bugs against customer reported bugs for the first 90 days of use.  If developers found 90 bugs and users reported 10 bugs, the total is 100 bugs and DRE would be 90%.)

Read the full whitepaper here:

Show More