Validity assessment is necessary for an SG, as for any new educational tool [14, 15]. In this study, we used the five domains of validity evidence described by Messick et al. [14] (content, response process, internal structure, relations to other variables and consequences). The main findings are that neither the gameplay scores nor the playing time of LabForgames Warning differentiated the level of the nurses’ skills. However, other domains of validity evidence for this SG were demonstrated.
First, content validity evidence is the most frequently assessed domain in educational literature [14, 28,29,30,31]. LabForGames Warning (educational content and objectives, different branched steps, scoring) was produced by clinical, pedagogical and interprofessional experts in conformity with the French nurse training curriculum [5] and literature review. Effective educational content was demonstrated as experts (group E) expressed a positive attitude toward the medical algorithm and the nurse decision-making process, confirming content legitimacy.
A second domain of validity evidence was the response process, which was assessed using rigorous quality and security control during the study. Moreover, both experts and novices were asked to assess the tool’s apparent similarity with reality and its usefulness for educational purposes. Evaluation by experts is especially crucial in order to collect validity evidence. In our study, the experts were from the units in which the game’s cases took place (orthopaedic department and psychiatry department but not from the nursing home). Moreover, nurses work in many different units (surgery, medicine, etc.) prior to graduating. Realism was considered for the whole gameplay but also for its different parts (i.e. nursing care, clinical examination, care records and graphics). The difference found for care record realism between groups may be explained by the fact that electronic care records are not available in all hospital units, which complicates extrapolation. Moreover, the SG’s ability to improve skills, or the impact on the professional outcome, were evaluated positively, especially by students and recently graduated nurses, confirming our initial educational choice to target this population.
Satisfaction with the training process, skill improvement self-assessment and the impact on professional outcomes were considered satisfactory. Moreover, after training with the SG, all of the participants felt that their skills had improved in the different steps of the nurse clinical reasoning process, with a global score of 52/70. Teaching clinical reasoning with the aid of an SG appears to be of value and relevant for trainees. The virtual cases represent experiential learning as described by Kolb [32] and explore the four domains of the clinical reasoning process [33]. Learning of clinical reasoning is complex to assess [34], and although self-assessment involves only a subjective perception, it does provide important information. The tool we used was based on the clinical reasoning process described by Levett-Jones and used by Koivisto [24, 23]. Other tools have also been published [35, 36]. Despite their uncertain validity, these tools aim to assess the various steps of clinical reasoning. However, most studies have analysed only the results of the overall reasoning process (i.e. diagnosis and treatment) but not all of the steps of clinical reasoning [8, 37,38,39,40].
Third, with regard to validity evidence for internal structure, factor analysis appeared to be a useful tool to identify behaviours specific to each group by assessing the relations between parts of the clinical reasoning process and errors. Clinical reasoning is a complex cognitive process [41]. According to the dual-process theory, two cognitive systems are used by healthcare providers. System 1 is heuristic reasoning based on illness pattern recognition (matching an actual configuration of signs with previously encountered equivalent situations), allowing intuitive mental shortcuts to reduce the cognitive load of decision making. System 2 is an analytical reasoning model that integrates all available information and requires great effort. Simply stated, system 1 is more easily implemented by experts due to clinical experience whereas system 2 would be used more often by novices. Interestingly, in our study, organisation of links between parts of the clinical reasoning process and errors was found to be an indicator of expertise. In group E, no relation between clinical reasoning, communication and situation awareness was observed, suggesting that encapsulation of clinical reasoning occurs with experience and is congruent with a more frequent use of system 1 (intuitive) processing [41]. Each step is dependent on the following one, as a unique “module” of clinical reasoning [42]. Moreover, the independence of the clinical reasoning items of the self-assessment suggests that the modularity of clinical reasoning is embedded in a deeper structure that is inaccessible to awareness and with no explicit link. On the contrary, in group S, the first part of reasoning was linked to both situational awareness and communication errors whereas the implementation part was linked to communication errors only. In group R, only communication was found to be related positively to the first part of reasoning and negatively to the implementation part, which suggests a beginning of expertise. However, although we did not assess situation awareness itself using validated methods [43], we classified errors (negative points) occurring in the diagnostic part of the scenario and those related to decisions regarding the next monitoring interval as “situation awareness” errors and investigated relations with other variables through factor analysis. Moreover, we did not record the exact time at which each action was done during each step. Furthermore, we do not know precisely when deterioration was identified since situational awareness is a progressive process with interconnecting steps. Only the global playtime of cases was recorded with no significant difference between groups. Therefore, individualization of the different steps of the clinical reasoning process during the game was not possible and the manner in which participants performed individually in each case could not be determined. It could be interesting to introduce markers for each step of the clinical reasoning process in a future version of the game.
The fourth domain of validity evidence studied was the relation to other variables by comparing the groups’ scores and playing time. In previous studies, the validity of the scoring system was either not assessed [24, 44,45,46] or was assessed only by means of “static” multiple choice questions that were obvious to the participant since they represented the logical steps of a surgical procedure [28,29,30]. In contrast, our game design was branched and included algorithms with many possibilities and different interactions between the patient and the physician, as the various ramifications in the storyboard aimed to reproduce the most likely clinical situations. Our SG combined the assessment of several non-technical skills, including situational awareness and communication. Therefore, our results highlight the difficulty in establishing a scoring system due to several interactive and complex problems. In a similar SG in critical paediatric emergency care, Gerard et al. demonstrated validity evidence for Pediatric Sim Game scores with higher scores for attendings followed by residents than for medical students [20]. A strong positive correlation was found between game scores and written knowledge test scores. However, as with our SG (Table 4), game scores were low across all groups (68/100 for attendings), which confirm the difficulty in constructing a score.
Additionally, the score explores only a limited part of the tool and not all of its pedagogical impact and utility. The assessment of construct validity is essential if the SG is to be used in a summative educational process. Indeed, if the score’s construct validity is not demonstrated, the game cannot be used to evaluate student learning at the end of an instructional unit. To our knowledge, no SG like this one has been used for summative assessment. However, this version of our SG can be used for training in an educational programme since some domains of validity evidence could be demonstrated. Certain teams have already included an SG in their training programme [24, 44, 47]. The nursing school participating in this project recently introduced it in the student curriculum because the detailed scoring analysis can improve the instructors’ debriefing since they can use the detailed scoring of each trainee, available on an e-platform. More studies are necessary to define the place of this tool in professional healthcare education.
Since LabForGames Warning scores could not differentiate between the levels of expertise, one might wonder how it might be improved. The instructors tried to align the scoring system to case complexity, for which a certain level of proficiency was expected. In the post-operative haemorrhage case, for example, analysis of the detailed subscores showed that the majority (> 85%) of essential nursing care actions were performed by each group. Participant actions were stereotyped and limited. Allocating points in a different manner and/or increasing the number of tasks available to the participant including some unnecessary (or even deceptive) actions might be more discriminant. Indeed, one limitation of the game itself is that if the participant performs all actions, many positive points may be earned with no actual clinical reasoning. Recording the response time at each point could also be useful.
Playing time is a surrogate marker of the time it takes to care for the patient, collect data, make a decision and call for help. Although one might expect playing time to be longer for the novice than for the expert, this study found no difference in the playing time between the groups. Results of previous studies are mixed on this subject and do not consistently show a direct correlation between playing time and expertise [28, 48]. The absence of differences in playing time could be explained by the fact that the time devoted to each participant’s action and communication is limited and predefined by the game itself.
When trying to study the consequences of using the SG (i.e. the last domain of validity evidence), a posteriori analysis found no significant difference in examination results between the student group having played the SG and the control group. However, no definitive conclusion can be drawn since the groups studied were not randomised.
One limitation of the study was that several items of our set of measures to assess validity were based on participant perception, although objective measures would appear more potent. Perception, however, is more often studied in the literature. For example, Graafland et al. and Sugand et al. validated the content and face validity of their SG with a self-report questionnaire [28,29,30,31].
Another limitation was that mean scores were low (< 50/100). Some negative points attached to the communication part were based on the use of the SBAR tool as even our expert nurses had received no previous training on how to use this tool for which a French version had only recently been made available [5]. However, even when scores were recalculated after excluding the SBAR tool, no significant differences were observed between groups.
Writing gameplay is a difficult task since each clinical situation has several possible outcome branches and poor writing may lead to low score results and poor discrimination. However, the pedagogical team was composed of medical experts (anaesthesiologists) and nursing experts, who were also experts in pedagogy. Interestingly, all of the teachers were also experienced high-fidelity session instructors.