Study design and data collection
The study was a prospective educational study of a VR simulation-based training program of hip fracture surgery estimating CL with two subjective questionnaires and an objective relative reaction time test. The study flow chart is presented in Fig. 2. Participants were first introduced to the simulator by observing an instructor perform one simulation at the lowest competency level, before commencing hands-on simulation training and CL measurement.
Participants and setting
First-year orthopedic residents employed at departments in the Central Denmark Region and the North Denmark Region were invited to participate in this study from November 2016 to March 2019. Only novices who had performed less than 10 hip fracture-related surgeries and without prior experience with the VR simulator were enrolled. The study was performed at a centralized simulation center (Corporate HR, MidtSim, Central Denmark Region, Denmark). Participants were not compensated financially for their participation.
Simulator and metrics
We used a VR surgical simulator that can simulate hip fracture surgical procedures (TraumaVision, Swemac, Sweden). Radiographs are obtained using two foot pedals (the left display: AP view to the left and lateral view to the right), and the operating field is displayed on another screen (Fig. 1). The surgical instruments are controlled through a haptic device with force feedback (Geomagic Touch X, 3D Systems, Rock Hill, SC, USA) allowing the trainee to feel the contours of the femoral shaft and the varying resistance of cortical and trabecular bone.
The simulation training program consisted of three competency levels of increasing complexity:
Competency level 0 included only the basic simulation of a placement of a Kirschner wire in a patient model (left-sided, simple hip fracture).
Competency level 1 introduced the clinical variability such as surgery on the left or the right side of the patient, and different fracture patterns (24 different cases).
Competency level 2 consisted of simulation of the complete dynamic hip screw (DHS) surgical procedure (24 different cases).
In order to progress from one competency level to the next, participants had to achieve 13 successive passed procedures at every level (a “learning curve-cumulative sum” approach) [15]. For each failed test, participants were penalized by a subtraction of 7 from the amount of consecutive passed simulations. The pass/fail criteria were defined based on clinical studies and practical considerations: The main criterium, achieving a tip apex distance < 20 mm, e.g., the distance from the tip of the DHS (or Kirschner wire) to the center of the joint, has been clinically validated [16,17,18]. Furthermore, breach of cortical bone/cartilage into the hip joint and more than three attempts to place the Kirschner wire within the same procedure resulted in failing the test. Further details regarding the pass/fail criteria are provided as a supplement (Additional file 1).
Outcomes
Performance
Performance was measured as passing or failing the simulated procedure according to the pre-defined pass/fail criteria (Additional file 1). CL outcomes were investigated in relation to the number of failures within the last three and five test attempts in the simulator.
Reaction time test for CL estimation
We used a secondary task (reaction time test) to estimate CL at predefined simulation tests at each competency level (Fig. 2). The reaction test entailed pressing an arrow (left or right) on a keyboard as fast as possible as a response to a visual cue during the simulation test at one or two different times during the test procedure (marked with arrows in Fig. 1). We similarly measured the baseline reaction time prior to each simulation training block using five measurements at varying intervals. We used the ratio of in-simulation and baseline reaction time (relative reaction time, RRT) to estimate the individual change in CL.
Questionnaires for CL estimation
Two validated questionnaires (NASA-TLX and PAAS) were administered after 1, 10, and 25 simulations at each competency level in order to estimate CL [12, 13]. These time points were chosen to balance not administering the questionnaires too often and still be in the training process with a likelihood of having a mix of passed and failed procedures.
The NASA-TLX questionnaire consists of six questions rated on a visual analog scale, resulting in a 0–100 point score for each question. We registered the responses in 5-point intervals. The domains represented by the six questions are mental demand, physical demand, temporal demand, performance, effort, and frustration. We chose the NASA-TLX RAW analysis method in order to estimate CL in the present study and did not weigh the different domain scores (the NASA-TLX method) [19].
The PAAS questionnaire consists of a single question in which participants ranked their mental effort in the preceding simulation procedure on a 9-point Likert scale (from very, very low mental effort to very, very high mental effort) [13].
Sample size and statistics
The sample size was one of convenience, recruiting as many residents as possible during the study period. Data were analyzed using SPSS version 25 for MacOS X (SPSS Inc., IBM Corp., Armonk, NY). Linear mixed models were used due to a repeated measurements design. Models were built on principles for repeated measurement statistics in medical educational research as outlined by Leppink and iteratively optimized to account for relevant factors and potential interactions [20]. For all models, total procedure number was used as the repeated effect. For the reaction time data, the final models included time of measurement (0/18 s) and pass/fail in the latest simulated procedure or number of failures in the latest three or five procedures as fixed factors, respectively. For the questionnaire data, the final models included competency level (0/1/2) and pass/fail in the latest procedure or number of failures in the latest three or five procedures as fixed factors, respectively.
Due to the computerized measurement system of the reaction time, measurements where the participant did not react in time before the next reaction time measurement were assigned a reaction time of 99,999 by the system. Also, the system recorded time until the next reaction time test if the participant did not respond. Consequently, the overall distribution of the reaction time measurements was extremely right skewed. Further, a few recorded values were unrealistically low. However, censoring of extreme values would eliminate measurement of the very high cognitive load associated with missing the test. We therefore needed to define the realistic possible values of reaction time and assign the highest cutoff value for this as “penalty” for not reacting when needed. We did this similar to previous descriptions [11]. First, reaction times were log-transformed and based on the resulting frequency distribution, log(reaction time) > 4 were censored, and the remaining data was used to estimate the mean and ± 2 standard deviation (SD) of the central data. These values were next used for Winsorizing the reaction times after back transformation, resulting in reaction times < 747.1 ms being assigned the value 747.1 ms and reaction times > 4855.8 ms being assigned the value of 4855.8 ms. Finally, RRT (unitless) was calculated for each measurement.
Estimated marginal means of the linear mixed models are reported along with 95% confidence intervals (95% CI). Correlations were explored using standard linear regression. p values < 0.05 were considered statistically significant.
Ethics
Ethical approval was granted by the Ethical Committee of Central Denmark Region (ref: 251/2016). This study complies with the Helsinki Declaration. All participants were informed and gave their consent that their personal data and simulator generated data were stored and sent back to them and the head of education at their department. Furthermore, consent was given to store and use anonymized data (personal data, simulator-generated data, and questionnaire data) for educational research purposes.