20 February 2023

Statistical Method for Proficiency Testing

STATISTICAL METHOD FOR PROFICIENCY TESTING

B.1

GENERAL

 

Proficiency Test results can appear in  many forms, spanning a wide range of data types and underlying statistical distributions. The statistical methods used to analyse the results needs to be appropriate for each situation, and so are too varied to be specified in this International Standard. ISO 13528 describes preferred specific methods for each of the situations discussed below, but also states that other methods may be used as along as they are statistically valid and are fully described to participants. Some of the methods in ISO 13528, especially for homogeneity and stability testing, are modified slightly in the IUPAC2) Technical Report “The International Harmonized Protocol for the Proficiency Testing of Analytical Chemistry Laboratories.” These documents also present guidance on design and visual data analysis. Other references may be consulted for specific types of Proficiency Testing Schemes, e.g. measurement comparison schemes for calibration.

 

The methods discussed in this annex and in the reference documents cover the fundamental steps common to nearly all Proficiency Testing Schemes, i.e.

a.     Determination of the assigned value,

b.     Calculation of performance statistics,

c.      Evaluation of performance, and

d.     Preliminary determination of Proficiency Test item homogeneity and stability

With new Proficiency Testing schemes, initial agreement between results is often poor, due to new questions, new forms, artificial test items, poor agreement of test or measurement methods, or variable measurement procedures. Coordinators may have to use robust indicators of relative performance (such as percentiles) until agreement improves. Statistical methods may need to be refined once participant agreement has improved and Proficiency Testing is well established.

This annex does not consider statistical methods for analytical studies other than for treatment of Proficiency Test data. Different methods may be needed to implement the other uses of interlaboratory comparison data listed in the introduction.

B.2

DETERMINATION OF THE ASSIGNED VALUE AND ITS UNCERTAINTY

 

B.2.1

There are various procedures available for the establishment of assigned values. The Most common procedures are listed below in an order that, in most cases, will result in increasing uncertainty for the assigned value. These procedures involve the use of:

a.     Known values – with results determined by specific Proficiency Test item formulation (e.g. manufacture or dilution)

b.     Certified reference values – as determined by definitive test or measurement methods (for quantitative tests);

c.      Reference values – as determined by analysis, measurement or comparison of the Proficiency Test item alongside a reference material or standard, traceable to a national or international standard;

d.     Consensus values from expert participants – experts (which may, in some situations, be reference laboratories) should have demonstrable competence in the determination of the measurements under test, using validated methods known to be highly accurate and comparable to methods in general use;

e.     Consensus values from participants – using statistical methods described in ISO 13528 and the IUPAC International Harmonized Protocol and with consideration of the effect of outliers.

B.2.2

Assigned values should be determined to evaluate participants fairly, yet to encourage agreement among test or measurement methods. This is accomplished through selection of common comparison groups and the use of common assigned values, wherever possible.

B.2.3

Procedures for determining the uncertainty of assigned values are discussed in detail in ISO 13528 and the IUPAC International Harmonized Protocol, for each common statistic used (as mentioned above). Additional information on uncertainty is also provided in ISO/IEC Guide 98–3.

B.2.4

Statistical methods for determining the assigned value for qualitative data (also called “categorical” or “nominal” values), or semi–quantitative values (also called “ordinal” values) are not discussed in ISO 13528 or the IUPAC International Harmonized Protocol. In general, these assigned values need to be determined by expert judgement or manufacture. In some cases, a proficiency testing provider may use a consensus value, as defined by agreement of a predetermined majority percentage of responses (e.g. 80% or more). However, the percentage used should be determined based on objectives for the Proficiency Testing Scheme and the level of competence and experience of the participants.

B.2.5

Outliers are statistically treated as described below,

a.     Obvious blunders, such as those with incorrect units, decimal point errors, and the result for a different Proficiency test item should be removed from the data set and treated separately. These results should not be subject to outlier tests or robust statistical methods.

b.     When participants’ results are used to determine assigned values, statistical methods should be in place to minimize the influence of outliers. This can be accomplished with robust statistical methods or by removing outliers prior to calculation. In larger or routine Proficiency Testing schemes, it may be possible to have automated outlier screens, if justified by objective evidence of effectiveness.

c.      If results are removed as outliers, they should be removed only by calculations of summary statistics. These results should still be evaluated within the Proficiency Testing scheme and be given the appropriate performance evaluation.

Note

ISO 13528 describes a specific robust method for determination of the consensus mean and standard deviation, without the need for outlier removal.

B.2.6

Other considerations are outlined below,

a.     Ideally, if assigned values are determined by participant consensus, the Proficiency Testing provider should have a procedure to establish the trueness of the assigned values and for reviewing the distribution of the data.

b.     The proficiency testing provider should have criteria for the acceptability of an assigned value in terms of its uncertainty. In ISO 13528 and in the IUPAC International Harmonized Protocol, criteria are provided that are based on a goal to limit the effect that uncertainty in the assigned value has on the evaluation, i.e. the criteria limit the probability of having a participant receive an unacceptable evaluation because of uncertainty in the assigned value.

B.3

CALCULATION OF PERFORMANCE STATISTICS

 

B.3.1

Performance for Quantitative Results

 

B.3.1.1

Proficiency Test results often need to be transformed into a performance statistic, in order to aid interpretation and to allow comparison with defined objectives. The purpose is to measure the deviation from the assigned value in a manner that allows comparison with performance criteria. Statistical methods may range from no processing required to complex statistical transformations.

B.3.1.2

Performance statistics should be meaningful to participants. Therefore, statistics should be appropriate for the relevant tests and be well understood or traditional within a particular field.

B.3.1.3

Commonly used statistics for quantitative results are listed below, in order of increasing degree of transformation of participant’s results.

a.     The difference, D, is calculated using Equation (B.1):

D = (x – X)

where

x – is the participant’s result;

X – is the assigned value

b.     The percent difference, D%, is calculated using equation (B.2):

D% = (x – X) x 100

               X

c.      The z scores are calculated using Equation (B.3)

z = x – X

         δ

where δ is the standard deviation for Proficiency Assessment,

-        a fitness for purpose goal for performance, as determined by expert judgement or regulatory mandate (prescribed value);

-        an estimate from previous rounds of Proficiency Testing or expectations based on experience (by perception);

-        an estimate from a statistical model (general model);

-        the results of a precision experiment; or

-        participant results, i.e. a traditional or robust standard deviation based on participant results.

d.     The zeta score, ζ, is calculated using Equation (B.4), where calculation is very similar to the En number (see e) below, except that standard uncertainties are used rather than expanded uncertainties. This allows the same interpretation as for traditional z scores.

ζ =           x – X____      

      √ ulab2 + uav2

Where

ulab2 is the combined standard uncertainty of a participant’s result;

uav2 is the standard uncertainty of the assigned value

e.     En numbers are calculated using Equation (B.5)

En =           x – X____      

           √ Ulab2 + Uref2

Ulab is the expanded uncertainty of a participant’s result;

Uref is the expanded uncertainty of the reference laboratory’s assigned value

Note 1

The formula in Equations (B.4) and (B.5) are correct only x and X are independent.

Note 2

For additional statistical approaches, see ISO 13528 and the IUPAC International Harmonized Protocol.

B.3.1.4

The aspects below should be taken into consideration,

a.     The simple difference between the participant’s result and the assigned value may be adequate to determine performance, and is most easily understood by participants. The quantity (x – X) is called “estimate of laboratory bias” in ISO 5725–4 and ISO 13528.

b.     The percent difference is independent of the magnitude of the assigned value and is well understood by participants.

c.      Percentiles or ranks are useful for highly disperse or skewed results, ordinal responses, or when there are a limited number of different responses. This method should be used with caution.

d.     Transformed results may be preferred, or necessary, depending on the nature of the test. For example, dilution–based results are a form of geometric scale, transformable by logarithms.

e.     If consensus is used to determine δ, the estimate of variability should be reliable, i.e. based on enough observations to reduce influence of outliers and achieve sufficiently low uncertainty.

f.      If scores consider the participant’s reported estimates of measurement uncertainty (e.g. with En scores or zeta scores), these will only be meaningful if the uncertainty estimates are determined in a consistent manner by all participants, such as in accordance with the principles in ISO/IEC Guide 98 – 3.

B.3.2

Performance for Qualitative and Semi–quantitative Results

 

B.3.2.1

For qualitative or semi–quantitative results, if statistical methods are used, they must be appropriate for the nature of the responses. For qualitative data (also called “categorical” data), the appropriate technique is to compare a participant’s result with the assigned value. If they are identical, then performance is acceptable. If they are not identical, then expert judgement is needed to determine if the result is fit for its intended use. In some situations, the proficiency testing provider may review the results from participants and determine that a proficiency testing item was not suitable for evaluation, or that the assigned value was not correct. These determinations should be part of the plan for the scheme and understood by the participants in advance of the operation of the scheme.

B.3.2.2

For semi–quantitative results (also called “ordinal” results), the techniques used for qualitative data (B.3.2.1) are appropriate. Ordinal results include, for example, responses such as grades or rankings, sensory evaluations, or strength of a chemical reaction (e.g. 1+, 2+, 3+, etc.). Sometimes these responses are given as numbers, e.g. 1 = Poor, 2 = Unsatisfactory, 3 = Satisfactory, 4 = Good, 5 = Very Good.

B.3.2.3

It is not appropriate to calculate usual summary statistics for ordinal data, even if the results are numerical. This is because the numbers are not on an interval scale, i.e., the difference between 1 and 2, in some objective sense, may not be the same as the difference between 3 and 4, so averages and standard deviations cannot be interpreted. Therefore, it is not appropriate to use evaluation statistics such as z scores for semi–quantitative results. Specific statistics, such as rank or order statistics, designed for ordinal data, should be used.

B.3.2.4

It is appropriate to list the distribution of results from all participants (or produce a graph), along with the number or percentage or results in each category, and to provide summary measures, such as the modes (most common responses) and range (lowest and highest response). It may also be appropriate to evaluate results as acceptable based on closeness to the assigned value, e.g. results within plus or minus one response from the assigned value may be fit for the purpose of the measurement. In some situations, it may be appropriate to evaluate performance based on percentiles, e.g. the 5% of results farthest from the mode or farthest from the assigned value may be determined to be unacceptable. This should be based on the Proficiency Testing scheme plan (i.e. fitness for purpose) and understood by participants in advance.

B.3.3

Combined Performance Scores

 

Performance may be evaluated on the basis of more than one result in a single Proficiency Testing round. This occurs when there is more than one Proficiency Test item for a particular measurement, or a family of related measurements. This would be done to provide a more comprehensive evaluation of performance.

Graphical methods, such as the Youden Plot or a plot showing Mandel’s h–statistics, are effective tools for interpreting performance (ISO 13528).

In general, averaged performance scores are discouraged because they can mask poor performance on one or more Proficiency Test items that should be investigated. The most commonly combined performance score is simply the number (or percentage) of results determined to be acceptable.

B.4

EVALUATION OF PERFORMANCE

 

B.4.1

Initial Performance

 

 

B.4.1.1

Criteria for Performance Evaluation should be established after taking into account whether the performance measure involves certain features. The features for performance evaluation are the following:

a.     Expert consensus, where the advisory group, or other qualified experts, directly determine whether reported results are fit for their intended purposes; agreement of experts is the typical way to assess results for qualitative tests;

b.     Fitness for purpose, predetermined criteria that consider, for example, method performance specifications and participant’s recognized level of operation;

c.      Statistical determination for scores, i.e. where criteria be appropriate for each score; common examples of application of scores are as follows:

1.     For z scores and zeta scores (for simplicity, only “z” is indicated in the examples below, but “ζ” may be substituted for “z” in each case):

-        |z|≤ 2,0 – indicates “satisfactory” performance and generates no signal;

-        2,0 <|z|< 3,0 – indicates “questionable performance and generates a warning signal;

-        |z|≥ 3,0 – indicates “unsatisfactory” performance and generates an action signal;

2.     For En numbers

-        |En|≤ 1,0 – indicates “satisfactory” performance and generates no signal;

-        |En|> 1,0 – indicates “unsatisfactory” performance and generates an action signal.

B.4.1.2

For split–sample designs, an objective may be to identify in results inadequate calibration or large random fluctuation, or both. In these cases, evaluations should be based on an adequate number of results and across a wide range of concentrations. Graphical presentations are useful for identifying and describing these problems, and are designed in ISO 13528. These graphs should use differences between results on the vertical axis, rather than plots of results from one participant versus another, because of problems of scale. One key consideration is whether results from one of the participants have, or can be expected to have, lower measurement uncertainty. In this case, those results, are the best estimate of the actual level of measurement. If both participants have approximately the same measurement uncertainty, the average of the pair of results is the preferred estimate of actual level.

B.4.1.3

Graphs should be used whenever possible to show performance (e.g. histograms, error bar charts, ordered z score charts), as described in ISO 13528 and the IUPAC International Harmonized Protocol. These charts can be used to show:

a.     Distribution of participant values;

b.     Relationship between results on multiple Proficiency Test items;

c.      Comparative distributions for different methods

B.4.2

Monitoring Performance over time

 

B.4.2.1

A Proficiency Test scheme can include procedures to monitor performance over time. The procedures should allow participants to see the variability in their performance, whether there are general trends or inconsistencies, and where the performance varies randomly.

B.4.2.2

Graphical methods should be used to facilitate interpretation by a wider variety of readers. Traditional “Shewhart” control charts are useful, particularly for self – improvement purposes. Data listings and summary statistics allow more detailed review. Standardized performance scores used to evaluate performance, such as the z score, should be used for these graphs and tables. ISO 13528 presents additional examples and graphical tools.

B.4.2.3

Where a consensus standard deviation is used as the standard deviation for Proficiency Testing, caution should be taken when monitoring performance over time, as the participant group can change, and can have unknown effects on the scores. It is also common for the interlaboratory standard deviation to decrease over time, as participants become familiar with the Proficiency Testing scheme or as methodology improves. This could cause an apparent increase in z scores, when a participant’s individual performance has not changed.

B.5

DEMONSTRATION OF PROFICIENCY TEST ITEM HOMOGENEITY AND STABILITY

 

B.5.1

The requirements of this International Standard call for a demonstration of “sufficient homogeneity” with valid statistical methods, including a statistically random selection of a representative number of samples. Procedures for this are detailed in ISO 13528 and the IUPAC International Harmonized Protocol. These documents define “sufficient homogeneity” relative to the evaluation interval for the Proficiency Testing scheme, and so the recommendations are based on allowances for uncertainty due to inhomogeneity relative to the evaluation interval. While ISO 13528 places a strict limit on inhomogeneity and instability to limit the effect on uncertainty and therefore the effect it has on the evaluation, the IUPAC International Harmonized Protocol expands the criteria to allow a statistical test of the estimate of inhomogeneity and instability, relative to the same criterion recommended in ISO 13528.

B.5.2

There are different needs for requirements in ISO Guide 34 and ISO Guide 35, which are for determining reference values for certified reference materials, including their uncertainties. ISO Guide 35 uses statistical analysis of variance to estimate the “bottle–to–bottle” variability and “within–bottle” variability (as appropriate), and subsequently uses those variances as components of the uncertainty of the assigned value. Given the need to estimate components accurately for certified reference materials, the number of randomly selected samples may exceed what is needed for Proficiency Testing, where the main objective is to check for unexpected inconsistencies in batches of manufactured proficiency test items.

B.5.3

Stability is normally checked to ensure that the measurements did not change during the course of the round. A specified in ISO 13528, the IUPAC International Harmonized Protocol and ISO Guide 35, proficiency test items should be tested under the variety of conditions that occur in the normal operation of a proficiency testing scheme, e.g. conditions of shipping and handling when distributed to participants. The criterion for acceptable instability is the same as the criterion for inhomogeneity in ISO 13528, although typically with fewer tests or measurements.

SELECTION AND USE OF PROFICIENCY TESTING

C.1

GENERAL

 

This annex establishes principles for the selection and use of Proficiency Testing schemes by participants and other interested parties. This annex is also intended to promote the harmonized use of Proficiency Testing schemes by interested parties (e.g. accreditation bodies, regulatory bodies, or customers of the participant). Since results from Proficiency Testing schemes may be used in the evaluation of a participant’s performance, it is important that both the interested parties and participants have confidence in the development and operation of the Proficiency Testing schemes.

It is also important for participants to have a clear understanding of the policies of the interested parties for participation in such Proficiency Testing schemes, and their policies and procedures for following up any unsatisfactory results from a proficiency test round. However, apart from specific requirements from regulatory bodies, it is the responsibility of the participants themselves to select the appropriate proficiency testing scheme and to evaluate their results correctly.

It should be recognized, however, that interested parties also take into account the suitability of test data produced from activities other than Proficiency Testing schemes, including, for example, results of participant’s own internal quality control procedure with control samples, comparison with split–made data from other participants and performance on tests of certified reference materials. Therefore, when selecting a Proficiency Testing scheme, the participant should take into consideration the other quality control activities that are available or have already been performed.

C.2

SELECTION OF PROFICIENCY TESTING SCHEMES

 

C.2.1

Laboratories (and other types of participants) need to select Proficiency Testing schemes that are appropriate for their scope of testing or scope of calibration. The Proficiency Testing schemes selected should comply with the requirements of this International Standard.

C.2.2

In selecting a Proficiency Testing scheme, the following factors should be considered:

a.     The tests, measurements or calibrations involved should match the types of tests, measurements or calibrations performed by the participants;

b.     The availability to interested parties of details about the scheme design, procedures for establishment of assigned values, instructions to participants, statistical treatment of data, and the Final Summary Report;

c.      The frequency at which the Proficiency Testing scheme is operated;

d.     The suitability of the organizational logistics for the Proficiency Testing scheme (e.g. timing, location, sample stability considerations, distribution arrangements) relevant to the group of participants proposed for the Proficiency Testing scheme;

e.     The suitability of acceptance criteria (i.e. for judging successful performance in the Proficiency Test);

f.      The costs;

g.     The Proficiency Testing provider’s policy on maintaining participant’s confidentiality;

h.     The Proficiency Testing provider’s policy on maintaining participant’s confidentiality;

i.       The timescale for reporting of results and for analysis of performance data;

j.       The characteristics that instil confidence in the suitability of Proficiency Test items (e.g. homogeneity, stability, and, where appropriate, metrological traceability to national or international standards);

k.     Its conformance with this International Standard

Note

Some Proficiency Testing schemes can include tests which are not exact match for the tests performed by the participants (e.g. the use of different national standard for the same determination), but it can still be technically justified to participate in the Proficiency Testing scheme if the treatment of the data allows for consideration of any significant differences in test methodology or other factors.

C.3

POLICIES ON PARTICIPATION IN PROFICIENCY TESTING SCHEMES

 

C.3.1

If relevant, interested parties should document their policies for participation in Proficiency Testing schemes; such documented policies should be publicly available to laboratories and other interested parties.

C.3.2

Issues which should be addressed in participation policies for specific Proficiency Testing schemes include:

a.     Whether participation in specific Proficiency Testing schemes is mandatory or voluntary;

b.     The frequency of participation;

c.      The criteria used by the interested party to judge satisfactory or unsatisfactory performance;

d.     Whether participants may be required to participate in follow– up Proficiency Testing schemes if performance is judged to be unsatisfactory;

e.     How the results of Proficiency Testing will be used in the Evaluation of performance and subsequent decisions;

f.      Details of the interested party’s policy on preserving participant’s confidentiality.

C.4

USE OF PROFICIENCY TESTING BY PARTICIPANTS

 

C.4.1

Participants should draw their own conclusions about their performance from an evaluation of the organization and design of the Proficiency Testing scheme. Reviews should consider the relation between the Proficiency Testing scheme and the needs of the participant’s customers. The information that should be taken into consideration includes:

a.     The origin and character of Proficiency Test items;

b.     The test and measurement methods used and, where possible, the assigned values for particular test or measurement methods;

c.      The organization of the Proficiency Testing scheme (e.g. the statistical design, the number of replicates, the measurements, the manner of execution);

d.     The criteria used by the Proficiency Testing provider to evaluate the participant’s performance;

e.     Any relevant regulatory, accreditation or other requirements

C.4.2

Participants should maintain their own records of performance in Proficiency Testing, including the outcomes of investigations of any unsatisfactory results and any subsequent corrective or preventive actions.

C.5

USE OF RESULTS BY INTERESTED PARTIES

 

C.5.1

Accreditation Bodies

 

C.5.1.1

The requirements for an accreditation body with regard to use of Proficiency Testing are specified in ISO/IEC 17011:2004, 7.15

Note

Further policies on Proficiency Testing relevant to the compliance of accreditation bodies with requirements for membership in the ILAC mutual recognition arrangement are specified in ILAC P–9 .

C.5.1.2

The results from Proficiency Testing schemes are useful for both participants and accreditation bodies. There are, however, limitations on the use of such results to determine competence. Successful performance in a specific Proficiency Testing scheme may represent evidence of competence for that exercise, but many not reflect a random departure from a participant's normal state of competence. It is for these reasons that Proficiency Testing should not be the only tool used by accreditation bodies in their accreditation process.

C.5.1.3

For participants reporting unsatisfactory results, the accreditation bodies should have policies to

a.     Ensure that the participants investigate and comment on their performance within an agreed time–frame, and take appropriate corrective action,

b.     (where necessary) ensure that the participant’s undertake any subsequent Proficiency Testing to confirm that any corrective actions taken by them are effective, and

c.      (where necessary) ensure that on–site evaluation of the participants is carried out by appropriate technical assessors to confirm that corrective actions are effective.

C.5.1.4

The accreditation bodies should advise their accredited bodies of the possible outcomes of unsatisfactory performance in a Proficiency Testing scheme. These may range from continuing accreditation subject to successful attention to corrective actions within agreed timeframes, temporary suspension of accreditation for the relevant tests (subject to corrective action), through to withdrawal of accreditation for the relevant tests.

C.5.1.5

The accreditation bodies should have policies for feedback from accredited bodies relating to action taken on the basis of results of Proficiency Testing schemes, particularly for unsatisfactory performance.

C.5.2

Other interested parties

 

C.5.2.1

Participants may need to demonstrate their competence to other interested parties, such as customers or in a subcontracting mandate. Proficiency Testing results, as well as other quality control activities, can be used to demonstrate competence, although this should not the only activity.

Note

Proficiency testing data used to validate claims of competence are normally used by organizations in conjunction with other evidence, such as accreditation. See C.5.1.2.

C.5.2.2

It is the responsibility of the participants to ensure that they have provided all the appropriate information to interested parties wishing to evaluate the participants as their competence.

C.6

USE OF PROFICIENCY TESTING BY REGULATORY BODIES

 

C.6.1

The results from Proficiency Testing schemes are useful for regulatory bodies that need to evaluate the performance of participants covered by regulations.

C.6.2

If the Proficiency Testing scheme is operated by a regulatory body, it should be operated in accordance with the requirements of this International Standard.

C.6.3

Regulatory bodies that use independent Proficiency Testing providers should

a.     Seek documentary evidence that the Proficiency Testing schemes comply with the requirements of this International Standard before recognizing the Proficiency Testing scheme, and

b.     Discuss the participants the scope and operational parameters of the Proficiency Testing schemes, in order that the participant’s performance may be judged adequately in relation to the regulations.

 

No comments: