Replacing missing data

With self-administration of questionnaires some people occasionally leave, for one reason or another, one question or two unanswered (blank), that is, there are missing data. Yet to derive the 15D score, there must be a response to each question (dimension). This raises the question of how to deal with missing data. The best approaches are as follows:

1) Please check first, whether you SPSS license includes Missing Values MVA procedure (Multiple imputation). If so, contact harri.sintonen@helsinki.fi to obtain a combined valuation algorithm and replacement algorithm entitled 15D_VALUE_REPLACE 3 MISSING VALUES. This script assumes that in addition to the 15D raw data, your data file includes also the variables of AGE and GENDER with these variable names. The names of the 15D dimensions must be MOVE, SEE, HEAR, BREATH, SLEEP, EAT, SPEECH, EXCRET, UACT, MENTAL, DISCO, DEPR, DISTR, VITAL and SEX (lowercase names are also OK). With the data file and script open and active, run the script (Run => All). It replaces up to 3 missing values per observation (case) on any of the dimensions and produces the dimension level values move1, see1, ..., sex1 on a 0-1 scale e.g. for drawing 15D profiles for a group with mean values and the overall 15D score (D15score) on a 0-1 scale. If there are more than 3 missing values per observation, the 15D score is not calculated, but is left missing. The imputation is based on regression method of MVA procedure. The text sections of the script tell what is going on in different parts of the script.

2) If your SPSS license does not cover the Missing Values MVA procedure (Multiple imputation), then you should request from harri.sintonen@helsinki.fi another combined valuation and replacement algorithm entitled 15D_VALUE_REPLACE 1 MISSING VALUE. This long script is capable of replacing only one missing value per observation (case) on any of the dimensions in a single run. The algorithm requires that your data file have the same variables and variable names as indicated above. Your data file and script should be open and active. Run the script by choosing Run => All. It replaces any single missing value per observation (case) on any of the dimensions and produces the dimension level values move1, see1, ..., sex1 on a 0-1 scale e.g. for drawing 15D profiles for a group with mean values, and the overall 15D score (D15score) on a 0-1 scale. If there are more than one missing value per observation, the 15D score is not calculated, but is left missing.

The most likely reason for this is that the regression analysis has not been able to calculate a prediction for an observation due to the fact that the observation has missing data on one or more independent variables. If you would like to get max three missing values replaced, you can proceed as follows:

Remove from the list of independent variables the variable(s) with missing data and run the regression and replacement procedure again. Since different observations may have missing data in different independent variables, this procedure may have to be repeated separately for each observation in order to get all missing values replaced, if it is regarded as necessary. A rule of thumb is that if an observation has missing data on more than three dimensions, there is no point in carrying out replacements, but the 15D score is left missing.

An example may clarify the procedure. Let us assume that an observation has a missing value on the dimensions of DISTRESS, VITALITY and SEX and therefore running the long script does not produce the 15D score. To get the missing value of SEX replaced first, copy from the long script the parts SEX REGRESSION, SEX REPLACEMENT and SEX VALUATION 2 onto a new syntax sheet (File =>New=>Syntax), set SEX1 as the dependent variable in the REGRESSION and remove DISTR1 and VITAL1 from the list of independent variables (METHOD=ENTER…). After running this new script (Run =>All) the missing value of SEX has been replaced.

Then, copy from the long script the parts VITAL REGRESSION, VITAL REPLACEMENT and VITAL VALUATION 2 onto a new syntax sheet (File =>New=>Syntax), set VITAL1 as the dependent variable in the REGRESSION and remove DISTR1 from the list of independent variables (METHOD=ENTER…). After running this new script (Run =>All) the missing value of VITALITY has been replaced. Then, run the original long script again (Run=>All) again and the final missing value of DISTRESS has been replaced and the 15D score has been calculated.

With these procedures the probability of getting the missing codes (1-5) of levels of the dependent variables (dimensions) right is 55-60 % (all levels, missing or not, on average 80 % right). The probability of getting the missing codes (1-5) of levels of the dependent variables (dimensions) right or at least within one level from the right is about 90 %. As a reference for these approaches you can use:

Sintonen H. The 15D-measure of health-related quality of life. I. Reliability, validity and sensitivity of its health state descriptive system. National Centre for Health Program Evaluation, Working Paper 41, Melbourne 1994 (can be downloaded from http://www.buseco.monash.edu.au/centres/che/pubs/wp41.pdf