Run continuous STT on the mic input. If text that's very similar to in-game text is detected, except that key words are transposed or omitted to make a sentence that means something different, slow down the display speed and lock out fast forward for a few seconds while highlighting the differences.
If frustrated sounds are detected after an incorrect puzzle solution is provided or if information is unexpectedly surprising, insert a slow-motion replay of when the information was provided earlier, followed by a “comprehension of implications” quiz that must be passed before proceeding.
If handholding mechanisms start getting triggered too often, start removing player decision points, eventually leading to Full Auto Mode which just plays itself from a walkthrough and makes little “wait”/buzzer sounds if the player tries to push any buttons.
Not the sort of thing that should be in every game, but I could see it actually working in some contexts, with adequate warning…