Analysis of patient satisfaction survey data



Journal Title

Journal ISSN

Volume Title



We analyzed a dataset provided by an anonymous hospital in the Midwest, for the purpose of identifying characteristics that affect two response variables of interest: Topbox Overall score and Advocacy. Topbox Overall score is when patients rate the hospital as a 9 or 10 for an overall patient satisfaction score. Advocacy is when patients say “Yes” they would recommend the hospital to a close family member or friend. Since both Topbox Overall score and Advocacy are binary variables, we will use a logistic model for each response. The dataset contains 434 observations and 21 potential predictors. Most predictors are on an ordinal scale and contain many missing values. Ordinal predictors were converted to a Likert scale and treated as numeric reducing the number of parameters required to fit the logistic models. Missing values were examined to determine the cause of missingness, and most were found to be missing because they were not applicable. These missing values were changed to zero on the Likert scale, which allowed the affected observations to remain in the analysis. In total, 16 observations were removed from the analysis due to missing values leaving 418 observations to be used in the model building process. We used several different variable selection techniques to generate suitable models for the two distinct response variables: Topbox Overall score and Advocacy. These techniques were needed to identify a parsimonious model. Forward selection and backward elimination were used with a penalized AIC. These are two common techniques for variable selection. Variable selection was also performed using backward elimination via the p-value approach. For this technique the p-value was computed using the chi-squared distribution. Different techniques were used to determine if the results could be replicated. The same models were identified using all three techniques. After the reduced models were identified, two processes were used for model checking: Cook’s distances and the Hosmer-Lemeshow test. The Cook’s distances identified no influential points or outliers, and the Hosmer-Lemeshow test indicated that the logistic models were appropriate for both response variables. The variable selection process resulted in three predictors for the Topbox Overall score and two predictors for Advocacy. Using these predictors, a full interaction model was generated for each response. None of the interactions were significant, so the additive models were accepted as the final models. For Topbox Overall score, the three predictors identified were clear communication by nurses, received care within 30 minutes of arriving in the emergency room, and the doctors spent enough time with the patients. For Advocacy two predictors were identified the doctors listened carefully and nurses spent enough time with patients. In the two models both had predictors that involved the doctors and nurses, but the variables were not exactly the same. Variables related to communication and time spent with the patient were important themes for both models. Timeliness of care had a greater impact on Topbox Overall score than on Advocacy.



Statistics, Logistic model, Variable selection, Patient, Survey, Data

Graduation Month



Master of Science


Department of Statistics

Major Professor

Karen Keating