Targeting Design-Loop Adaptivity Stephen E. Fancsali Hao Li Michael Sandbothe Steven Ritter Carnegie Learning, Inc. {sfancsali, hli, msandbothe, sritter}@carnegielearning.com ABSTRACT of content, elements of which must be prioritized and targeted for improvement given finite learning engineering and software Recent work describes methods for systematic, data-driven development resources. We present a case study using a data set improvement to instructional content and calls for diverse teams that is among the largest considered in the literature on knowledge of learning engineers to implement and evaluate such tracing and related methods [9, 18, 22], comprised of middle improvements. Focusing on an approach called “design-loop school and high school student work over an academic year on adaptivity,” we consider the problem of how developers might use several hundred mathematics topics, each generally completed by data to target or prioritize particular instructional content for thousands of students, generating several hundred million data improvement processes when faced with large portfolios of points tracking student actions. We motivate, describe, and content and limited engineering resources to implement illustrate two approaches to targeting content for improvement improvements. To do so, we consider two data-driven metrics that within this portfolio, focusing primarily on what Aleven et al. [1] may capture different facets of how instructional content is call “design-loop adaptation to student knowledge,” relying on “working.” The first is a measure of the extent to which learners large-scale data to find similarities amongst learners we might struggle to master target skills, and the second is a metric based leverage to redesign instructional content for better learning. on the difference in prediction performance between deep learning and more “traditional” approaches to knowledge tracing. This One targeting method is based on a measure of the extent to which second metric may point learning engineers to workspaces that learners tend to struggle with particular pieces of content, and we are, effectively, “too easy.” We illustrate aspects of the diversity contrast it with an approach based on the relative prediction of learning content and variability in learner performance often performance of deep learning models (i.e., Deep Knowledge represented by large educational datasets. We suggest that Tracing; DKT [18, 22]) compared to traditional Bayesian “monolithic” treatment of such datasets in prediction tasks and Knowledge Tracing (BKT; [9]) models. other research endeavors may be missing out on important opportunities to drive improved learning within target systems. The first method targets content students struggle to learn, relying on measures of knowledge component (KC [19]; or skill) mastery Keywords that are internal to the target intelligent tutoring system (ITS). In contrast, the second method is roughly motivated by the idea that Design-loop adaptivity, deep knowledge tracing, Bayesian identifying content in which there is a large difference in knowledge tracing, mastery learning, learning engineering. performance between deep learning and traditional Bayesian approaches may suggest areas in which deep learning can 1. INTRODUCTION leverage statistical regularities in students’ performance that could Recent work calls on researchers and developers, including teams point to improvements in the KC models that are used to drive of learning engineers [14, 26], to focus on “explanatory” models adaptation with BKT. Such performance differences may suggest of learners [25] and “design-loop adaptivity” processes [1, 15] to a particular focus area for KC model improvements. Relative practically improve learning systems. While researchers describe DKT performance versus BKT performance also provides an specific examples of how explanatory learner models and design- instance of a metric that is perhaps less dependent on how the run- loop adaptivity can be used to drive improvements to instruction, time ITS has “set the bar” for success in terms of KC mastery. less (if any) attention has been paid in the literature to the In exploring these two approaches, we illustrate the variability in practical problem of how content developers and learning learning content and experiences within widely deployed systems engineers target and prioritize content for improvement. like Carnegie Learning’s MATHia (formerly Cognitive Tutor) We focus on cases in which a target system has a large portfolio [23]. While different facets of variation may at times call for different approaches to content improvement (e.g., variation in student motivation could call for redesigns that discourage “gaming the system” [3]), our present work explores how to guide Stephen Fancsali, Hao Li, Michael Sandbothe and Steven Rit- learning engineers’ “attention” to particular pieces of content to ter “Targeting Design-Loop Adaptivity”. 2021. In: Proceedings then consider specific improvements via processes for design-loop of The 14th International Conference on Educational Data Mining adaptivity [1, 15]. (EDM21). International Educational Data Mining Society, 323-330. https://educationaldatamining.org/edm2021/ Original contributions of this work are two-fold: (1) We describe EDM’21June29-July022021,Paris,France a novel problem in the literature related to how to target Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 323 instructional content improvement or design-loop adaptivity and attempt to make progress in a system by taking advantage of explore two targeting approaches, and (2) we shed light on system features like hints, rather than making genuine attempts to opportunities in treating large-scale educational datasets that may master content. Aleven et al. [1] suggest that an approach to be missed by treating such datasets as “monolithic” targets for modeling gaming the system behavior based on a large-scale data-intensive approaches. Treating datasets in a “monolithic” survey of the extent to which gaming the system [3] manifests way, though not a universal practice (e.g., [4-5, 10]) may inhibit across topics (what we will refer to as “workspaces”) in an practical progress in learning engineering. intelligent tutoring system like MATHia provides a foundation for future design-loop adaptivity investigations. One important facet In addition to considering one of the largest-scale applications of of this work (and related work on off-task behavior [4]) is its DKT (and BKT) modeling in the literature, we illuminate avenues appreciation of the extent to which there is variability in how for research at the intersection of educational data science and learning occurs across different (types of) content within adaptive learning engineering at scale in a widely-deployed adaptive instructional systems. Appreciating and surveying this variability learning platform for K-12 mathematics. We seek to amplify is vital to ascertaining where, within large portfolios of content, to extant calls for a more nuanced approach to work on performance target design-loop adaptivity efforts and related data-driven, prediction [15, 25] while illustrating solutions to practical instructional improvement efforts. problems in learning engineering and product improvement. 2.2 A Process for Design-Loop Adaptivity 2. DESIGN-LOOP ADAPTIVITY Huang et al. [15] describe a systematic approach to design-loop 2.1 Background adaptivity or data-driven instructional redesign and improvement. They suggest three general goals for such redesign efforts. For a A recent survey of adaptive instructional technologies [1] particular piece of content in an ITS or similar adaptive describes three categories along which learners’ experience can be instructional system with a KC model, the goals are: (1) refine the varied, including “step-loop adaptivity,” “task-loop adaptivity,” KC model for the target content, (2) redesign the content, and (3) and “design-loop adaptivity.” Step-loop adaptivity and task-loop optimize individualized learning within the content. Existing adaptivity roughly correspond to “inner” and “outer” loop EDM methods and novel analyses are then described to achieve adaptive functionality in ITSs distinguished by VanLehn (e.g., each of these goals, targeting an “Algebraic Expressions” unit of [28]), respectively. We briefly describe step-loop and task-loop content within the Mathtutor ITS [2]. For example, KC models adaptivity before considering design-look adaptivity. can be refined using data-driven, computationally intensive Step-loop or inner-loop adaptivity enables an adaptive methods like Learning Factors Analysis (LFA; [8]) or a simpler instructional system or ITS to provide support to learners within a approximation of such an approach that uses regression particular learning task based on their performance (e.g., techniques called “difficulty factor effect analysis” by Huang et providing context-sensitive hints or just-in-time feedback within a al. [15]. Human expertise also plays an important role in such math problem based on learner responses). Task-loop or outer- refinements, including in setting up data-driven analyses to loop adaptivity enable an instructional system to choose the next produce meaningful results, interpreting these results for inclusion appropriate task for a learner based on a model of student in potential task redesigns, and often in providing suggested learning and evolving estimates of a learner’s mastery of refinements for target tasks. underlying competencies, skills, or KCs [19] based on a learner’s Huang et al. [15] demonstrate that redesigned content improves performance. Extensive educational data mining (EDM) literature learning as measured by pre-tests and post-tests. Broadly, these considers, for example, variants of and data-driven parameter goals align with on-going, data-driven content improvement optimizations for BKT (e.g., [18]), which can be used to select efforts pursued by learning engineers working with MATHia. tasks for learners as their mastery of KCs evolves. Nevertheless, the process of design-loop adaptivity generally In their recent survey, Aleven and colleagues describe design-loop requires extensive human and computational resources to be adaptivity as involving carried out in ways that will drive improved instructional effectiveness. The present work seeks to illustrate how EDM data-driven decisions made by course designers before techniques might help improve targeting this process. and between iterations of system design, in which a… system is updated based on data about student learning, 3. MATHia specifically, data collected with the same system… [1]. 3.1 Learning Platform They go on to describe goals toward which design-loop Carnegie Learning’s MATHia [23] is an ITS used by hundreds of adaptations might be made, including adaptations to student thousands of learners each year, mostly in middle and high school knowledge, affect and motivation, student strategies and errors, classrooms as a part of a blended math curriculum that combines and self-regulated learning, providing examples of each. collaborative work guided by instructors and Carnegie Learning’s Canonical examples of design-loop adaptivity or adaptation to MATHbook worktexts (60% of instructional time in recommended student knowledge, the goal of our present targeting and implementations) with individual work in MATHia (40% of prioritization endeavor, generally involve situations in which instructional time). Nevertheless, usage of MATHia, contexts in content within tutoring systems or online courses are improved by which it is used, and other implementation details vary across a refining the fine-grained KC models that drive the adaptive diverse, nationwide user-base. experience of learners using a combination of data and human expertise [17, 20, 27]. Grade levels of content in MATHia (e.g., Grade 7, Algebra I) are organized into a series of “modules,” each of which is comprised Design-loop adaptivity for motivation and affect might drive of a series of “units.” Units are composed of a series of content or system design and redesign to discourage off-task “workspaces.” Workspaces represent the underlying unit of behavior [4] and “gaming the system” [3], wherein students learner progress to mastery in MATHia. Each workspace presents 324 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) a set of problems associated with a set of KCs; student progress knowledge state to observable outcomes at any KC practice within the system is determined by students’ achievement of opportunity: the probability that a student is in the unmastered mastery of all of the KCs associated with a particular workspace, state and responds correctly (“guessing”) and the probability that estimated by MATHia using BKT (see §3.2). Learning a student is in the mastered state and answers incorrectly experiences vary substantially between workspaces with respect (“slipping”). Extensive EDM literature has explored the data- to design patterns, content areas, types of practice and instruction driven fitting of BKT parameters as well as individualized (e.g., provided, (quality of) KC models intended to practice such [30]) and more sophisticated variants of this approach (e.g., [18]). content (e.g., some the result of years of iterative refinements, others introduced more recently), BKT parameters, and other parameters that drive task selection and mastery judgment. Consider the problem solving task illustrated in Figures 1 and 2. Figure 1 illustrates the workspace “Modeling the Constant of Proportionality.” In this workspace, students are provided with a word problem and several associated questions (left pane; Figure 1). On the right-hand side of Figure 1, tools are presented to solve the problem’s “steps.” There is a worksheet or table in which they can provide units of measurement, responses to questions, and fields in which to write expressions to model the problem’s scenario. After they have completed entries in the worksheet, students work with a graphing tool. Each problem-step in the ITS can provide context-sensitive hints upon request as well as just-in- Figure 1. Problem-solving screenshot from a MATHia time feedback that tracks errors that students often make. Most workspace called “Modeling the Constant of Proportionality.” problem-steps are mapped to KCs, for which MATHia provides an evolving mastery estimate to adapt problem selection to the individual student’s needs (see §3.2). Contrast the learning experience of the problem in Figure 1 with that of Figure 2. “Modeling the Constant of Proportionality” (Figure 1) involves substantive reading, modeling the problem scenario via algebraic expressions, working through concrete instances of these expressions, and using a graphing tool. Figure 2 illustrates problem-solving in a menu-based equation “solver” workspace, “Solving with the Distributive Property Over Multiplication.” Here the student is tasked with solving for x in the equation 65 = 10 (x + 6). There is little reading and no context provided for the equation, but hints and just-in-time feedback are Figure 2. Screenshot from the MATHia workspace “Solving available. Learners’ progress toward mastery is tracked for a with the Distributive Property Over Multiplication.” different set of KCs. The menu-based solver constrains possible student actions at various points in the equation-solving process Based on parameter settings and performance data collected as a compared to the typed-in input that students provide in the student practices each KC, the system can use BKT to infer and worksheet in Figure 1. Far from an exhaustive list, we seek to update estimates of the probability that a student is in the illustrate a few from among substantial differences in types of “mastered” state for any particular KC. Typically, systems set a content provided, design patterns, interaction modalities, threshold for mastery (often 0.95, as in MATHia); if the system’s underlying KC models, and tools available, even within the estimate that the probability a student has mastered a particular relatively constrained domain of math, any of which may have KC is above the threshold, then the system considers the KC important impacts on inferences that might be drawn from data or mastered for that student. the ability of different methods to predict performance and Relying on evolving estimates of learner KC mastery, learning within such content. While any of the features in these instructional systems can use knowledge tracing frameworks like examples might reasonably be refined as a part of the design-loop BKT to drive “task-loop” (or “outer loop”) adaptivity [1, 28] and adaptivity or content improvement process, we leave to future mastery learning [7, 24]. After a student completes a problem (or work the data-driven targeting of specific improvements within a task; like the problems illustrated in Figures 1-2), the system can workspace. We consider how to target specific “workspaces” for select the next problem based on KCs that a student has yet to design-loop adaptivity improvements. master. In this way, systems can adapt to the student’s evolving mastery of KCs, providing (ideally) just enough practice for 3.2 Knowledge Tracing & Mastery Learning students to master KCs and avoiding cases in which the system BKT [9] posits a binary (i.e., “mastered” or “unmastered”) provides too little or too much practice. knowledge state for each independently modeled KC and can be Implementing self-paced mastery learning [7, 24], MATHia formalized as a four-parameter hidden Markov model. One provides practice to a student until they have either mastered all parameter represents the probability that a learner has already KCs associated with a particular workspace or they have reached mastered a KC before their first opportunity to practice it. A the maximum number of problems that designers have specified second parameter represents the probability that a learner for a particular workspace. Once the student masters all of the transitions from the unmastered to the mastered state at any KCs in a particular workspace (or reaches the max number of particular KC practice opportunity. Two parameters link the problems), they are moved on to the next workspace in an Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 325 assigned content sequence. Teachers are alerted when students such failure to reach mastery often indicate that content reach the max number of problems in a workspace without improvements (i.e., design-loop adaptivity) is called for to reaching mastery. Setting a max number of problems ensures that enhance experiences for learners. students do not endlessly struggle unproductively within a piece Prior research considers MATHia’s workspace level as a unit of of content [11]. analysis. Researchers have focused on associations between characteristics of Cognitive Tutor “lessons” (MATHia’s 3.3 Data workspaces) and learners’ affective states like confusion and We consider data from 252,036 learners who used MATHia frustration [10] as well as the extent to which students go off-task during the 2018-19 academic year and completed at least one of [4] and game the system [5]. In what follows, we adopt an 308 workspaces that track KC mastery across math content for approach similar in spirit to this literature by considering a large Grades 6-8, Algebra I, Algebra II, and Geometry. These data corpus of MATHia data as broken down into workspaces rather account for approximately 3.8 million workspace completions. than treating the entire dataset in a monolithic fashion. Models are learned over subsets of 267,419,999 student actions (i.e., first-attempts, including hint requests) at problem-steps The first metric we consider helps identify content that is mapped to KCs. Over the 308 workspaces, MATHia tracks 2,152 instructionally ineffective in ways that manifest as difficulty for KCs. Table 1 provides summary statistics. learners to successfully complete the content. In considering the second metric, we explore one example where the metric may be Table 1. Summary statistics for 308 MATHia workspaces in providing some insights into places where content is not 2018-2019; “KCs” = # KCs tracked; “Comps.” = # student- “difficult” (i.e., measures of difficulty do not “raise flags” about workspace-completions; “Actions” = sum across all students improvement needs) but where design-loop adaptivity completing workspace of count of first attempts (including improvements might drastically improve student learning. hint requests) at problem-steps within workspace problems. 4.1 Proportion of Failures to Reach Mastery Min. Q1 Med. Q3 Max. The first design-loop adaptivity targeting metric we consider is KCs 2 5 6 9 15 the proportion of learners who fail to reach mastery of at least one of the KCs associated with a workspace before reaching the Comps. 167 4275 9414 18801 51097 maximum number of problems set by content designers. Figure 3 Actions 5530 197757 489159 1278325 7191034 provides a histogram showing the overall distribution of this proportion across workspaces. The median workspace has 4.3% of When working with large, complex datasets, it is essential to students fail to reach mastery of all its KCs (minimum = 0%; Q1 focus learning engineering efforts on the portions of the system = .7%; Q3 = 12.1%; maximum = 77.7%). for which improvements can be most impactful. Rather than consider such a broad dataset as a single monolithic target, especially for performance prediction modeling in §4.2, we learn models for each workspace within the dataset; input data are sequences of correctness labels for learner actions (e.g., binary correct or incorrect, where incorrect includes both errors and hint requests) and labels for KCs mapped to each action. 4. METRICS FOR TARGETING IMPROVEMENTS As illustrated in Figures 1 and 2, workspace-to-workspace variability in learning experiences is substantial. Types of practice vary (e.g., equation solving, graphing, etc.), and developers make a plethora of design choices in creating content. Some workspaces require more reading; KC models vary in complexity, and some have been iteratively refined over the course of nearly two Figure 3. Histogram illustrating the distribution of the decades while others are newly deployed in a given year. Given proportion of students failing to reach mastery of all KCs this variation and the nature of grade-level content standards, associated with 308 workspaces in the 2018-19 academic year. there is also variability in the extent to which learners find particular content difficult. Fancsali et al. [11] argue that students’ failure to achieve mastery at a level of aggregation like that of a workspace is an important Learner difficulties manifest at the problem-step level in the form outcome for predictive modeling, mostly overlooked in the of problem-solving errors and hint requests and at the workspace literature on so-called “wheel spinning” (e.g., [6]), which tends to level in at least two ways: (1) that some learners require a greater develop models to predict whether students will master particular number of problems to achieve mastery of all KCs, and (2) that KCs in a tutoring system, ignoring other elements of how some learners reach the maximum number of problems set by instructional content is presented. Fancsali et al. argue that, given designers without having achieved mastery of all KCs. These the clustering of KCs within problems, the clustering of problems latter students are moved along within their curriculum sequence within workspaces, and the fact that workspaces are the unit at without mastery. Teachers are alerted of this failure to reach which learners make progress in ITSs like MATHia, reporting mastery via reporting analytics available to them as well as in the outcomes like the count and percentage of KCs that student fail to LiveLab teacher companion app to MATHia. Some students fail master (a la Beck and Gong [6]) is of dubious practical value. to reach mastery in a workspace because of genuine difficulty with presented math content, but relatively frequent instances of 326 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Since design-loop adaptivity improvements are likely to often student-sequences available for training and testing) (r = .2; p < involve redesign of instructional content, we similarly contend .001), BKT performs comparably to DKT on a number of that measures closely aligned to instructional delivery are likely to workspaces with tens of thousands of students’ data, and BKT be helpful in targeting this process. Large proportions of students only underperforms DKT by approximately .07 AUC units for the failing to master instructional content are likely to be important in median workspace. The Q1 value for this difference (the greatest determining what learning content to improve with limited difference over 77 workspaces) is approximately in line, in terms resources. This metric serves as a foil to a second approach. of AUC units, with a value (.03 AUC units) declared comparable by Khajah et al. [18] for BKT “variants” compared to DKT. 4.2 DKT vs. BKT Prediction Performance The difference in AUC between DKT and BKT is uncorrelated Extensive recent literature (e.g., [12, 18, 22]) considers deep with the proportion of students who fail to reach mastery (r = -.05; learning approaches to the problem of predicting student p = .4) and is thus not an indicator of the relative difficulty of performance at fine-grained opportunities to demonstrate mastery particular workspaces, regardless of the source of difficulty. of KCs in learning systems like ASSISTments [13]. DKT [22] has Table 2. Summary statistics for AUC performance over 308 been compared (e.g., [12, 18] to BKT and logistic regression workspaces of DKT and BKT models and of the difference approaches to the same type of prediction task [21]. Early work between DKT and BKT performance (∆); negative minimum demonstrated that DKT generally had superior prediction value indicates better BKT performance for some workspaces. performance compared to BKT [22], but subsequent literature also suggests that variations of BKT (e.g., modeling “forgetting”) and AUC Min. Q1 Med. Q3 Max. logistic regression approaches can bridge some, if not most, of the DKT .5852 .7839 .8331 .8783 .9763 gap in prediction performance (e.g., [18, 29]). Nevertheless, we seek to better understand the extent to which BKT .5150 .7045 .7456 .7854 .9563 DKT out-performs BKT when considered workspace-by- ∆ -.0802 .0361 .0676 .1281 .3073 workspace across a large dataset from MATHia, which presents a wide variety of learning experiences. We find that, for a variety of 4.2.3 Practical Promise workspaces, classic BKT’s performance is often comparable to DKT even without accoutrements added in the work of Khajah et We consider two observations relating to workspace design al. [18]. Further, in keeping with our primary concern in the patterns that emerge from considering workspaces with the largest present work, we explore the extent to which observed differences differences in terms of DKT’s (generally better) prediction in performance between the two approaches, especially examples performance compared to BKT. First, we consider the design of a of DKT’s far superior prediction performance, might serve as a particular workspace as a prime target for design-loop adaptivity metric for targeting improvement work for MATHia workspaces, to student knowledge, motivation, and affect. Second, we consider possibly indicating an especially flawed KC model. more general design patterns in workspaces on which DKT and BKT performance differences are greatest, suggesting more 4.2.1 Modeling Approach “macro-level” design-loop adaptivity that may affect broader We rely on the Khajah et al. [18] implementation of DKT with categories of workspaces. long short-term memory (LSTM) recurrent units.1 We use Yudelson’s hmm-scalable2 implementation of classic BKT 4.2.3.1 Example Workspace parameter fitting using expectation maximization [30]. We learn The second greatest observed difference in AUC occurred for the DKT and BKT models for each of the 308 workspaces, splitting workspace “Checking Solutions to Linear Equations” (DKT AUC the data for each workspace into training and test sets with a 80%- = .968; BKT AUC = .684). A mere 0.2% of students fail to master 20% student-level split and calculate the AUC (area under the all KCs in this workspace, suggesting that it may not be “flagged” receiver-operating characteristic curve) on the test set following for design-loop adaptivity improvements based on difficulty. methods in Khajah et al. [18]. BKT and DKT models are trained Nevertheless, careful inspection of the workspace yields several and tested on the same datasets. AUC is a measure of the extent to areas for improvement. which a model can “discriminate” between or predict students’ This workspace presents students with problems (See Figure 4) correct and incorrect responses in the held-out test set. An AUC like: “Jordan solved the equation -3u – 8 = 10. She calculated u = value of 0.5 indicates “chance” ability to discriminate between -6. Use the Solver to check Jordan’s solution.” The student is then two classes; a value of 1.0 indicates perfect discrimination. presented with a menu-based equation solver. Work with the equation solver should involve the student substituting in the 4.2.2 Results solution value from the problem presentation and checking Table 2 provides summary statistics for AUC performance for whether the result is a balanced equation. After choosing DKT, BKT, and AUC differences of these methods over all “Substitute for variable” from the menu, the student then must workspaces. As expected, DKT generally provides superior input a value on the left-hand side of the equation (see Figure 5). prediction performance to classic BKT over the 308 workspaces. However, there is substantial variability, with classic BKT in Problems in this workspace present both correct and incorrect some cases, albeit many (but not all) with relatively small sample cases, but the KC model does not distinguish between correct and sizes, even out-performing DKT. While there is a modest, incorrect cases, making problems with a correct solution targets statistically significant positive correlation between the AUC for possible gaming the system. For example, in the problem in difference in DKT and BKT and sample size (i.e., the number of Figure 5, the student might enter 10 to complete “10 = 10.” This response may not reflect having correctly carried out the variable substitution to arrive at this solution. KCs in this workspace are also not currently mapped to work in the solver; the solver 1 https://github.com/mmkhajah/dkt provides hints and just-in-time feedback on errors, but it is not 2 https://github.com/myudelson/hmm-scalable instrumented to track KC mastery. Once the student has entered Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 327 the appropriate value, two questions appear on the left of the broader MATHia content. Fourteen workspaces (70%) involve screen (see Figure 6), and student responses to these questions equation solving, and the others are split between those that trigger updates to two KCs at a time. involve placing values on a number line and those in which problem input is provided via drop-down menus. Gervet et al. raise questions about explanations for observed properties of DKT in predicting student performance. Can DKT, for example, “better pick up on local patterns of student behavior like gaming the systems” [12]? While far from conclusive, DKT’s performance for the workspace “Checking Solutions to Linear Equations” could exemplify this phenomenon. Workspaces with more constrained inputs may provide examples where DKT “picks up” on local patterns that BKT does not. Future work ought to investigate whether these particular types of relatively constrained input mechanisms are easy to “game” or whether and how DKT learns local performance patterns. Equation solver and number line workspaces are widespread in Figure 4. Screenshot in the MATHia workspace “Checking the top workspaces in which DKT outperforms BKT. “Checking Solutions to Linear Equations.” Solutions to Linear Equations” has readily apparent flaws, suggesting that our approach may be promising in targeting instructional improvement work. Systematic review of these results remains future work. 5. DISCUSSION There are numerous questions for future research. That differences in AUC between DKT and BKT are uncorrelated with an important measure of instructional ineffectiveness, combined with DKT’s ability to find regularities in data that are not found by BKT suggests that this difference may be signaling important workspace characteristics. Analysis of a particular workspace (§4.2.3.1) suggests that DKT-BKT differences may signal inadequacies in the KC model. These findings can be compared to Figure 5. Screenshot after the student has selected “Substitute the results of data-driven search for better KC models [8]. for variable” from the equation solving menu (see Figure 4). Improvements can be made to the workspace, and A/B tests can “close the loop” and establish more effective approaches. Systematic analysis of instructional content and prediction performance differences in DKT and BKT might follow work that explores a space of properties and features of particular tutor “lessons” to determine which predict students’ affect, gaming the system, and off-task behavior [4-5, 10]. Comparisons to logistic regression methods (e.g., [12]) are also needed. Naïve learning engineering may focus on reducing students’ mastery failures. Such an approach could lead to “over- simplified” tasks that don’t produce failure because they don’t require much knowledge. Large differences between DKT and BKT may help identify over-simplified workspaces that provide Figure 6. Screenshot after the student has entered the value 10 opportunities for students to game the system [3, 12]. To what on the left-hand side of the equation (see Figure 5). extent do gaps in modeling techniques’ performance indicate unproductive patterns of “local” behavior in particular Following design-loop adaptivity steps laid out by Huang et al. workspaces? What else drives differences? What other behavior [15], we have identified (1) where the KC model can be refined, patterns indicate ways to target improvement? and (2) areas for task redesign. The third step would involve fitting a BKT model using the hypothetical re-mapping of KCs to Methodologically, our “non-monolithic” analysis of a large steps within problems in existing data to determine whether the educational data set treats component instructional experiences as hypothesized, refined KC model fits the data better than the units for analysis. Such analytical decomposition is vital to existing KC model. Future work will experimentally test practical learning engineering to improve instructional systems workspace redesigns to “close the loop” (cf. [16, 25]) between this and large portfolios of content used by learners every day. data-driven approach and empirical learning outcomes. 6. ACKNOWLEDGMENTS 4.2.3.2 Prominent Design Patterns This research was sponsored by the National Science Foundation Patterns emerge in comparing the performance of DKT to BKT under the award The Learner Data Institute (Award #1934745). over 308 workspaces. In the top twenty workspaces in which The opinions, findings, and results are solely those of the authors DKT outperforms BKT, differences in AUC units range from .307 and do not reflect those of the National Science Foundation. to .212, and all provide constrained input mechanisms relative to 328 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 7. REFERENCES [14] Hess, F., Saxberg, B. 2014. Breakthrough leadership in the digital age: Using learning science to reboot schooling. [1] Aleven, V., McLaughlin, E.A., Glenn, R.A., Koedinger, K.R. Thousand Oaks, CA: Corwin. 2017. Instruction based on adaptive learning technologies. In Handbook of Research on Learning and Instruction, 2nd Ed., [15] Huang Y., Aleven V., McLaughlin E., Koedinger K. 2020. A Routledge, New York, 522–560. general multi-method approach to design-loop adaptivity in intelligent tutoring systems. In Artificial Intelligence in [2] Aleven, V., Sewall, J. 2016. The frequency of tutor Education. AIED 2020. LNCS, vol 12164. Springer, Cham, behaviors: a case study. In ITS 2016. LNCS, vol. 9684, 124-129. https://doi.org/10.1007/978-3-030-52240-7_23 Springer, Cham, 396–401. https://doi.org/10.1007/978-3- 319-39583-8_47 [16] Liu, R., & Koedinger, K.R. (2017). Closing the loop: Automated data‐driven cognitive model discoveries lead to [3] Baker, R.S., Corbett, A.T., Koedinger, K.R., Wagner, A.Z. improved instruction and learning gains. Journal of 2004. Off-task behavior in the Cognitive Tutor classroom: Educational Data Mining 9(1), 25–41. When students "game the system." In Proceedings of ACM CHI 2004: Computer-Human Interaction, 383-390. [17] Lovett, M., Meyer, O., Thille, C. 2008. The open learning initiative: Measuring the effectiveness of the OLI statistics [4] Baker, R.S.J.d. 2009. Differences between intelligent tutor course in accelerating student learning. Journal of Interactive lessons, and the choice to go off-task. In Proceedings of the Media Education, 14. 2nd International Conference on Educational Data Mining, 11-20. [18] Khajah, M., Lindsey, R.V., Mozer, M.C. 2016. How deep is knowledge tracing? In Proceedings of the 9th International [5] Baker, R.S.J.d., de Carvalho, A.M.J.A., Raspat, J., Aleven, Conference on Educational Data Mining (Jun 29 - Jul 2, V., Corbett, A.T., Koedinger, K.R. 2009. Educational 2016, Raleigh, NC, USA). EDM 2016. International software features that encourage and discourage "gaming the Educational Data Mining Society, 94-101. system". In Proceedings of the 14th International Conference on Artificial Intelligence in Education, 475-482. [19] Koedinger, K.R., Corbett, A.T., Perfetti, C. 2012. The knowledge-learning-instruction framework: Bridging the [6] Beck J.E., Gong Y. 2013. Wheel-Spinning: Students Who science-practice chasm to enhance robust student learning. Fail to Master a Skill. In Artificial Intelligence in Education. Cognitive Science 36, 5, 757–798. AIED 2013. LNCS, vol 7926. Springer, Berlin/Heidelberg. https://doi.org/10.1007/978-3-642-39112-5_44 [20] Koedinger, K.R. & McLaughlin, E.A. 2010. Seeing language learning inside the math: Cognitive analysis yields transfer. [7] Bloom, B. S. 1968. Learning for mastery. Evaluation In Proceedings of the 32nd Annual Conference of the Comment 1(2). Los Angeles: University of California at Los Cognitive Science Society. Austin, TX, Cognitive Science Angeles, Center for the Study of Evaluation of Instructional Society, 471-476. Programs. [21] Pavlik, P.I., Cen, H., Koedinger, K.R. 2009. Performance [8] Cen, H., Koedinger, K.R., Junker, B. 2006. Learning factors factors analysis – a new alternative to knowledge tracing. In analysis: A general method for cognitive model evaluation Proceedings of the 2009 Conference on Artificial and improvement. In Proceedings of the 8th International Intelligence in Education. IOS Press, 531–538. Conference on Intelligent Tutoring Systems Springer‐Verlag, Berlin, 164–175. [22] Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J., Sohl-Dickstein, J. 2015. Deep knowledge [9] Corbett, A.T., Anderson, J.R. 1994. Knowledge tracing: tracing. In Advances in Neural Information Processing Modeling the acquisition of procedural knowledge. User Systems, 505–513. Modeling and User‐Adapted Interaction 4, 253– 278. [23] Ritter, S., Anderson, J.R., Koedinger, K.R., and Corbett, A.T. [10] Doddannara L.S., Gowda S.M., Baker R.S.J.., Gowda S.M., 2007. Cognitive Tutor: applied research in mathematics de Carvalho A.M.J.B. 2013. Exploring the relationships education. Psychon. B. Rev. 14, 249-255. between design, students’ affective states, and disengaged behaviors within an ITS. In Artificial Intelligence in [24] Ritter, S., Yudelson, M., Fancsali, S.E., Berman, S.R. 2016. Education. AIED 2013. LNCS, vol 7926. Springer, Berlin, How mastery learning works at scale. In Proceedings of the Heidelberg. https://doi.org/10.1007/978-3-642-39112-5_4 3rd Annual ACM Conference on Learning at Scale (Apr 25 - 26, 2016, Edinburgh, UK). L@S 2016. ACM, New York, [11] Fancsali S.E., Holstein K., Sandbothe M., Ritter S., McLaren NY, 71-79. B.M., Aleven V. 2020. Towards practical detection of unproductive struggle. In Artificial Intelligence in Education. [25] Rosé, C.P., McLaughlin, E.A., Liu, R., Koedinger, K.R. AIED 2020. LNCS, vol 12164. Springer, Cham. 2019. Explanatory learner models: Why machine learning https://doi.org/10.1007/978-3-030-52240-7_17 (alone) is not the answer. Br J Educ Technol 50, 2943-2958. https://doi.org/10.1111/bjet.12858 [12] Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to [26] Simon, H.A. 1967. The job of a college president. knowledge tracing? Journal of Educational Data Mining Educational Record 48(Winter), 68–78. 12(3), 31-54. https://doi.org/10.5281/zenodo.4143614 [27] Stamper, J. & Koedinger, K.R. 2011. Human-machine [13] Heffernan, N.T., Heffernan, C.L. 2014.The ASSISTments student model discovery and improvement using data. In ecosystem: building a platform that brings scientists and Proceedings of the 15th International Conference on teachers together for minimally invasive research on human Artificial Intelligence in Education. Springer, learning and teaching. Int. J. Artif. Intell. Educ. 24(4), 470– Berlin/Heidelberg, 353-360. 497. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 329 [28] Vanlehn, K. 2006. The behavior of tutoring systems. Education at the 30th Conference on Neural Information International Journal of Artificial Intelligence in Education, Processing Systems (NIPS 2016). 16(3), 227–265. [30] Yudelson, M., Koedinger, K., Gordon, G. 2013. [29] Wilson, K.H., Xiong, X., Khajah, M., Lindsey, R.V., Zhao, Individualized Bayesian Knowledge Tracing Models. In S., Karklin, Y., Van Inwegen, E.G., Han, B., Ekanadham, C., Proceedings of 16th International Conference on Artificial Beck, J.E., Heffernan, N., Mozer, M.C. 2016. Estimating Intelligence in Education (Memphis, TN) AIED 2013. LNCS student proficiency: Deep learning is not the panacea. In vol. 7926, Springer-Verlag, Berlin/Heidelberg, 171-180. Proceedings of the Workshop on Machine Learning for 330 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

ERIC ED615617: Targeting Design-Loop Adaptivity PDF

2021

0.95 MB

English

by ERIC

#additional_collections #ericarchive

Checking for file health...

Preview ERIC ED615617: Targeting Design-Loop Adaptivity

Targeting Design-Loop Adaptivity Stephen E. Fancsali Hao Li Michael Sandbothe Steven Ritter Carnegie Learning, Inc. {sfancsali, hli, msandbothe, sritter}@carnegielearning.com ABSTRACT of content, elements of which must be prioritized and targeted for improvement given finite learning engineering and software Recent work describes methods for systematic, data-driven development resources. We present a case study using a data set improvement to instructional content and calls for diverse teams that is among the largest considered in the literature on knowledge of learning engineers to implement and evaluate such tracing and related methods [9, 18, 22], comprised of middle improvements. Focusing on an approach called “design-loop school and high school student work over an academic year on adaptivity,” we consider the problem of how developers might use several hundred mathematics topics, each generally completed by data to target or prioritize particular instructional content for thousands of students, generating several hundred million data improvement processes when faced with large portfolios of points tracking student actions. We motivate, describe, and content and limited engineering resources to implement illustrate two approaches to targeting content for improvement improvements. To do so, we consider two data-driven metrics that within this portfolio, focusing primarily on what Aleven et al. [1] may capture different facets of how instructional content is call “design-loop adaptation to student knowledge,” relying on “working.” The first is a measure of the extent to which learners large-scale data to find similarities amongst learners we might struggle to master target skills, and the second is a metric based leverage to redesign instructional content for better learning. on the difference in prediction performance between deep learning and more “traditional” approaches to knowledge tracing. This One targeting method is based on a measure of the extent to which second metric may point learning engineers to workspaces that learners tend to struggle with particular pieces of content, and we are, effectively, “too easy.” We illustrate aspects of the diversity contrast it with an approach based on the relative prediction of learning content and variability in learner performance often performance of deep learning models (i.e., Deep Knowledge represented by large educational datasets. We suggest that Tracing; DKT [18, 22]) compared to traditional Bayesian “monolithic” treatment of such datasets in prediction tasks and Knowledge Tracing (BKT; [9]) models. other research endeavors may be missing out on important opportunities to drive improved learning within target systems. The first method targets content students struggle to learn, relying on measures of knowledge component (KC [19]; or skill) mastery Keywords that are internal to the target intelligent tutoring system (ITS). In contrast, the second method is roughly motivated by the idea that Design-loop adaptivity, deep knowledge tracing, Bayesian identifying content in which there is a large difference in knowledge tracing, mastery learning, learning engineering. performance between deep learning and traditional Bayesian approaches may suggest areas in which deep learning can 1. INTRODUCTION leverage statistical regularities in students’ performance that could Recent work calls on researchers and developers, including teams point to improvements in the KC models that are used to drive of learning engineers [14, 26], to focus on “explanatory” models adaptation with BKT. Such performance differences may suggest of learners [25] and “design-loop adaptivity” processes [1, 15] to a particular focus area for KC model improvements. Relative practically improve learning systems. While researchers describe DKT performance versus BKT performance also provides an specific examples of how explanatory learner models and design- instance of a metric that is perhaps less dependent on how the run- loop adaptivity can be used to drive improvements to instruction, time ITS has “set the bar” for success in terms of KC mastery. less (if any) attention has been paid in the literature to the In exploring these two approaches, we illustrate the variability in practical problem of how content developers and learning learning content and experiences within widely deployed systems engineers target and prioritize content for improvement. like Carnegie Learning’s MATHia (formerly Cognitive Tutor) We focus on cases in which a target system has a large portfolio [23]. While different facets of variation may at times call for different approaches to content improvement (e.g., variation in student motivation could call for redesigns that discourage “gaming the system” [3]), our present work explores how to guide Stephen Fancsali, Hao Li, Michael Sandbothe and Steven Rit- learning engineers’ “attention” to particular pieces of content to ter “Targeting Design-Loop Adaptivity”. 2021. In: Proceedings then consider specific improvements via processes for design-loop of The 14th International Conference on Educational Data Mining adaptivity [1, 15]. (EDM21). International Educational Data Mining Society, 323-330. https://educationaldatamining.org/edm2021/ Original contributions of this work are two-fold: (1) We describe EDM’21June29-July022021,Paris,France a novel problem in the literature related to how to target Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 323 instructional content improvement or design-loop adaptivity and attempt to make progress in a system by taking advantage of explore two targeting approaches, and (2) we shed light on system features like hints, rather than making genuine attempts to opportunities in treating large-scale educational datasets that may master content. Aleven et al. [1] suggest that an approach to be missed by treating such datasets as “monolithic” targets for modeling gaming the system behavior based on a large-scale data-intensive approaches. Treating datasets in a “monolithic” survey of the extent to which gaming the system [3] manifests way, though not a universal practice (e.g., [4-5, 10]) may inhibit across topics (what we will refer to as “workspaces”) in an practical progress in learning engineering. intelligent tutoring system like MATHia provides a foundation for future design-loop adaptivity investigations. One important facet In addition to considering one of the largest-scale applications of of this work (and related work on off-task behavior [4]) is its DKT (and BKT) modeling in the literature, we illuminate avenues appreciation of the extent to which there is variability in how for research at the intersection of educational data science and learning occurs across different (types of) content within adaptive learning engineering at scale in a widely-deployed adaptive instructional systems. Appreciating and surveying this variability learning platform for K-12 mathematics. We seek to amplify is vital to ascertaining where, within large portfolios of content, to extant calls for a more nuanced approach to work on performance target design-loop adaptivity efforts and related data-driven, prediction [15, 25] while illustrating solutions to practical instructional improvement efforts. problems in learning engineering and product improvement. 2.2 A Process for Design-Loop Adaptivity 2. DESIGN-LOOP ADAPTIVITY Huang et al. [15] describe a systematic approach to design-loop 2.1 Background adaptivity or data-driven instructional redesign and improvement. They suggest three general goals for such redesign efforts. For a A recent survey of adaptive instructional technologies [1] particular piece of content in an ITS or similar adaptive describes three categories along which learners’ experience can be instructional system with a KC model, the goals are: (1) refine the varied, including “step-loop adaptivity,” “task-loop adaptivity,” KC model for the target content, (2) redesign the content, and (3) and “design-loop adaptivity.” Step-loop adaptivity and task-loop optimize individualized learning within the content. Existing adaptivity roughly correspond to “inner” and “outer” loop EDM methods and novel analyses are then described to achieve adaptive functionality in ITSs distinguished by VanLehn (e.g., each of these goals, targeting an “Algebraic Expressions” unit of [28]), respectively. We briefly describe step-loop and task-loop content within the Mathtutor ITS [2]. For example, KC models adaptivity before considering design-look adaptivity. can be refined using data-driven, computationally intensive Step-loop or inner-loop adaptivity enables an adaptive methods like Learning Factors Analysis (LFA; [8]) or a simpler instructional system or ITS to provide support to learners within a approximation of such an approach that uses regression particular learning task based on their performance (e.g., techniques called “difficulty factor effect analysis” by Huang et providing context-sensitive hints or just-in-time feedback within a al. [15]. Human expertise also plays an important role in such math problem based on learner responses). Task-loop or outer- refinements, including in setting up data-driven analyses to loop adaptivity enable an instructional system to choose the next produce meaningful results, interpreting these results for inclusion appropriate task for a learner based on a model of student in potential task redesigns, and often in providing suggested learning and evolving estimates of a learner’s mastery of refinements for target tasks. underlying competencies, skills, or KCs [19] based on a learner’s Huang et al. [15] demonstrate that redesigned content improves performance. Extensive educational data mining (EDM) literature learning as measured by pre-tests and post-tests. Broadly, these considers, for example, variants of and data-driven parameter goals align with on-going, data-driven content improvement optimizations for BKT (e.g., [18]), which can be used to select efforts pursued by learning engineers working with MATHia. tasks for learners as their mastery of KCs evolves. Nevertheless, the process of design-loop adaptivity generally In their recent survey, Aleven and colleagues describe design-loop requires extensive human and computational resources to be adaptivity as involving carried out in ways that will drive improved instructional effectiveness. The present work seeks to illustrate how EDM data-driven decisions made by course designers before techniques might help improve targeting this process. and between iterations of system design, in which a… system is updated based on data about student learning, 3. MATHia specifically, data collected with the same system… [1]. 3.1 Learning Platform They go on to describe goals toward which design-loop Carnegie Learning’s MATHia [23] is an ITS used by hundreds of adaptations might be made, including adaptations to student thousands of learners each year, mostly in middle and high school knowledge, affect and motivation, student strategies and errors, classrooms as a part of a blended math curriculum that combines and self-regulated learning, providing examples of each. collaborative work guided by instructors and Carnegie Learning’s Canonical examples of design-loop adaptivity or adaptation to MATHbook worktexts (60% of instructional time in recommended student knowledge, the goal of our present targeting and implementations) with individual work in MATHia (40% of prioritization endeavor, generally involve situations in which instructional time). Nevertheless, usage of MATHia, contexts in content within tutoring systems or online courses are improved by which it is used, and other implementation details vary across a refining the fine-grained KC models that drive the adaptive diverse, nationwide user-base. experience of learners using a combination of data and human expertise [17, 20, 27]. Grade levels of content in MATHia (e.g., Grade 7, Algebra I) are organized into a series of “modules,” each of which is comprised Design-loop adaptivity for motivation and affect might drive of a series of “units.” Units are composed of a series of content or system design and redesign to discourage off-task “workspaces.” Workspaces represent the underlying unit of behavior [4] and “gaming the system” [3], wherein students learner progress to mastery in MATHia. Each workspace presents 324 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) a set of problems associated with a set of KCs; student progress knowledge state to observable outcomes at any KC practice within the system is determined by students’ achievement of opportunity: the probability that a student is in the unmastered mastery of all of the KCs associated with a particular workspace, state and responds correctly (“guessing”) and the probability that estimated by MATHia using BKT (see §3.2). Learning a student is in the mastered state and answers incorrectly experiences vary substantially between workspaces with respect (“slipping”). Extensive EDM literature has explored the data- to design patterns, content areas, types of practice and instruction driven fitting of BKT parameters as well as individualized (e.g., provided, (quality of) KC models intended to practice such [30]) and more sophisticated variants of this approach (e.g., [18]). content (e.g., some the result of years of iterative refinements, others introduced more recently), BKT parameters, and other parameters that drive task selection and mastery judgment. Consider the problem solving task illustrated in Figures 1 and 2. Figure 1 illustrates the workspace “Modeling the Constant of Proportionality.” In this workspace, students are provided with a word problem and several associated questions (left pane; Figure 1). On the right-hand side of Figure 1, tools are presented to solve the problem’s “steps.” There is a worksheet or table in which they can provide units of measurement, responses to questions, and fields in which to write expressions to model the problem’s scenario. After they have completed entries in the worksheet, students work with a graphing tool. Each problem-step in the ITS can provide context-sensitive hints upon request as well as just-in- Figure 1. Problem-solving screenshot from a MATHia time feedback that tracks errors that students often make. Most workspace called “Modeling the Constant of Proportionality.” problem-steps are mapped to KCs, for which MATHia provides an evolving mastery estimate to adapt problem selection to the individual student’s needs (see §3.2). Contrast the learning experience of the problem in Figure 1 with that of Figure 2. “Modeling the Constant of Proportionality” (Figure 1) involves substantive reading, modeling the problem scenario via algebraic expressions, working through concrete instances of these expressions, and using a graphing tool. Figure 2 illustrates problem-solving in a menu-based equation “solver” workspace, “Solving with the Distributive Property Over Multiplication.” Here the student is tasked with solving for x in the equation 65 = 10 (x + 6). There is little reading and no context provided for the equation, but hints and just-in-time feedback are Figure 2. Screenshot from the MATHia workspace “Solving available. Learners’ progress toward mastery is tracked for a with the Distributive Property Over Multiplication.” different set of KCs. The menu-based solver constrains possible student actions at various points in the equation-solving process Based on parameter settings and performance data collected as a compared to the typed-in input that students provide in the student practices each KC, the system can use BKT to infer and worksheet in Figure 1. Far from an exhaustive list, we seek to update estimates of the probability that a student is in the illustrate a few from among substantial differences in types of “mastered” state for any particular KC. Typically, systems set a content provided, design patterns, interaction modalities, threshold for mastery (often 0.95, as in MATHia); if the system’s underlying KC models, and tools available, even within the estimate that the probability a student has mastered a particular relatively constrained domain of math, any of which may have KC is above the threshold, then the system considers the KC important impacts on inferences that might be drawn from data or mastered for that student. the ability of different methods to predict performance and Relying on evolving estimates of learner KC mastery, learning within such content. While any of the features in these instructional systems can use knowledge tracing frameworks like examples might reasonably be refined as a part of the design-loop BKT to drive “task-loop” (or “outer loop”) adaptivity [1, 28] and adaptivity or content improvement process, we leave to future mastery learning [7, 24]. After a student completes a problem (or work the data-driven targeting of specific improvements within a task; like the problems illustrated in Figures 1-2), the system can workspace. We consider how to target specific “workspaces” for select the next problem based on KCs that a student has yet to design-loop adaptivity improvements. master. In this way, systems can adapt to the student’s evolving mastery of KCs, providing (ideally) just enough practice for 3.2 Knowledge Tracing & Mastery Learning students to master KCs and avoiding cases in which the system BKT [9] posits a binary (i.e., “mastered” or “unmastered”) provides too little or too much practice. knowledge state for each independently modeled KC and can be Implementing self-paced mastery learning [7, 24], MATHia formalized as a four-parameter hidden Markov model. One provides practice to a student until they have either mastered all parameter represents the probability that a learner has already KCs associated with a particular workspace or they have reached mastered a KC before their first opportunity to practice it. A the maximum number of problems that designers have specified second parameter represents the probability that a learner for a particular workspace. Once the student masters all of the transitions from the unmastered to the mastered state at any KCs in a particular workspace (or reaches the max number of particular KC practice opportunity. Two parameters link the problems), they are moved on to the next workspace in an Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 325 assigned content sequence. Teachers are alerted when students such failure to reach mastery often indicate that content reach the max number of problems in a workspace without improvements (i.e., design-loop adaptivity) is called for to reaching mastery. Setting a max number of problems ensures that enhance experiences for learners. students do not endlessly struggle unproductively within a piece Prior research considers MATHia’s workspace level as a unit of of content [11]. analysis. Researchers have focused on associations between characteristics of Cognitive Tutor “lessons” (MATHia’s 3.3 Data workspaces) and learners’ affective states like confusion and We consider data from 252,036 learners who used MATHia frustration [10] as well as the extent to which students go off-task during the 2018-19 academic year and completed at least one of [4] and game the system [5]. In what follows, we adopt an 308 workspaces that track KC mastery across math content for approach similar in spirit to this literature by considering a large Grades 6-8, Algebra I, Algebra II, and Geometry. These data corpus of MATHia data as broken down into workspaces rather account for approximately 3.8 million workspace completions. than treating the entire dataset in a monolithic fashion. Models are learned over subsets of 267,419,999 student actions (i.e., first-attempts, including hint requests) at problem-steps The first metric we consider helps identify content that is mapped to KCs. Over the 308 workspaces, MATHia tracks 2,152 instructionally ineffective in ways that manifest as difficulty for KCs. Table 1 provides summary statistics. learners to successfully complete the content. In considering the second metric, we explore one example where the metric may be Table 1. Summary statistics for 308 MATHia workspaces in providing some insights into places where content is not 2018-2019; “KCs” = # KCs tracked; “Comps.” = # student- “difficult” (i.e., measures of difficulty do not “raise flags” about workspace-completions; “Actions” = sum across all students improvement needs) but where design-loop adaptivity completing workspace of count of first attempts (including improvements might drastically improve student learning. hint requests) at problem-steps within workspace problems. 4.1 Proportion of Failures to Reach Mastery Min. Q1 Med. Q3 Max. The first design-loop adaptivity targeting metric we consider is KCs 2 5 6 9 15 the proportion of learners who fail to reach mastery of at least one of the KCs associated with a workspace before reaching the Comps. 167 4275 9414 18801 51097 maximum number of problems set by content designers. Figure 3 Actions 5530 197757 489159 1278325 7191034 provides a histogram showing the overall distribution of this proportion across workspaces. The median workspace has 4.3% of When working with large, complex datasets, it is essential to students fail to reach mastery of all its KCs (minimum = 0%; Q1 focus learning engineering efforts on the portions of the system = .7%; Q3 = 12.1%; maximum = 77.7%). for which improvements can be most impactful. Rather than consider such a broad dataset as a single monolithic target, especially for performance prediction modeling in §4.2, we learn models for each workspace within the dataset; input data are sequences of correctness labels for learner actions (e.g., binary correct or incorrect, where incorrect includes both errors and hint requests) and labels for KCs mapped to each action. 4. METRICS FOR TARGETING IMPROVEMENTS As illustrated in Figures 1 and 2, workspace-to-workspace variability in learning experiences is substantial. Types of practice vary (e.g., equation solving, graphing, etc.), and developers make a plethora of design choices in creating content. Some workspaces require more reading; KC models vary in complexity, and some have been iteratively refined over the course of nearly two Figure 3. Histogram illustrating the distribution of the decades while others are newly deployed in a given year. Given proportion of students failing to reach mastery of all KCs this variation and the nature of grade-level content standards, associated with 308 workspaces in the 2018-19 academic year. there is also variability in the extent to which learners find particular content difficult. Fancsali et al. [11] argue that students’ failure to achieve mastery at a level of aggregation like that of a workspace is an important Learner difficulties manifest at the problem-step level in the form outcome for predictive modeling, mostly overlooked in the of problem-solving errors and hint requests and at the workspace literature on so-called “wheel spinning” (e.g., [6]), which tends to level in at least two ways: (1) that some learners require a greater develop models to predict whether students will master particular number of problems to achieve mastery of all KCs, and (2) that KCs in a tutoring system, ignoring other elements of how some learners reach the maximum number of problems set by instructional content is presented. Fancsali et al. argue that, given designers without having achieved mastery of all KCs. These the clustering of KCs within problems, the clustering of problems latter students are moved along within their curriculum sequence within workspaces, and the fact that workspaces are the unit at without mastery. Teachers are alerted of this failure to reach which learners make progress in ITSs like MATHia, reporting mastery via reporting analytics available to them as well as in the outcomes like the count and percentage of KCs that student fail to LiveLab teacher companion app to MATHia. Some students fail master (a la Beck and Gong [6]) is of dubious practical value. to reach mastery in a workspace because of genuine difficulty with presented math content, but relatively frequent instances of 326 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Since design-loop adaptivity improvements are likely to often student-sequences available for training and testing) (r = .2; p < involve redesign of instructional content, we similarly contend .001), BKT performs comparably to DKT on a number of that measures closely aligned to instructional delivery are likely to workspaces with tens of thousands of students’ data, and BKT be helpful in targeting this process. Large proportions of students only underperforms DKT by approximately .07 AUC units for the failing to master instructional content are likely to be important in median workspace. The Q1 value for this difference (the greatest determining what learning content to improve with limited difference over 77 workspaces) is approximately in line, in terms resources. This metric serves as a foil to a second approach. of AUC units, with a value (.03 AUC units) declared comparable by Khajah et al. [18] for BKT “variants” compared to DKT. 4.2 DKT vs. BKT Prediction Performance The difference in AUC between DKT and BKT is uncorrelated Extensive recent literature (e.g., [12, 18, 22]) considers deep with the proportion of students who fail to reach mastery (r = -.05; learning approaches to the problem of predicting student p = .4) and is thus not an indicator of the relative difficulty of performance at fine-grained opportunities to demonstrate mastery particular workspaces, regardless of the source of difficulty. of KCs in learning systems like ASSISTments [13]. DKT [22] has Table 2. Summary statistics for AUC performance over 308 been compared (e.g., [12, 18] to BKT and logistic regression workspaces of DKT and BKT models and of the difference approaches to the same type of prediction task [21]. Early work between DKT and BKT performance (∆); negative minimum demonstrated that DKT generally had superior prediction value indicates better BKT performance for some workspaces. performance compared to BKT [22], but subsequent literature also suggests that variations of BKT (e.g., modeling “forgetting”) and AUC Min. Q1 Med. Q3 Max. logistic regression approaches can bridge some, if not most, of the DKT .5852 .7839 .8331 .8783 .9763 gap in prediction performance (e.g., [18, 29]). Nevertheless, we seek to better understand the extent to which BKT .5150 .7045 .7456 .7854 .9563 DKT out-performs BKT when considered workspace-by- ∆ -.0802 .0361 .0676 .1281 .3073 workspace across a large dataset from MATHia, which presents a wide variety of learning experiences. We find that, for a variety of 4.2.3 Practical Promise workspaces, classic BKT’s performance is often comparable to DKT even without accoutrements added in the work of Khajah et We consider two observations relating to workspace design al. [18]. Further, in keeping with our primary concern in the patterns that emerge from considering workspaces with the largest present work, we explore the extent to which observed differences differences in terms of DKT’s (generally better) prediction in performance between the two approaches, especially examples performance compared to BKT. First, we consider the design of a of DKT’s far superior prediction performance, might serve as a particular workspace as a prime target for design-loop adaptivity metric for targeting improvement work for MATHia workspaces, to student knowledge, motivation, and affect. Second, we consider possibly indicating an especially flawed KC model. more general design patterns in workspaces on which DKT and BKT performance differences are greatest, suggesting more 4.2.1 Modeling Approach “macro-level” design-loop adaptivity that may affect broader We rely on the Khajah et al. [18] implementation of DKT with categories of workspaces. long short-term memory (LSTM) recurrent units.1 We use Yudelson’s hmm-scalable2 implementation of classic BKT 4.2.3.1 Example Workspace parameter fitting using expectation maximization [30]. We learn The second greatest observed difference in AUC occurred for the DKT and BKT models for each of the 308 workspaces, splitting workspace “Checking Solutions to Linear Equations” (DKT AUC the data for each workspace into training and test sets with a 80%- = .968; BKT AUC = .684). A mere 0.2% of students fail to master 20% student-level split and calculate the AUC (area under the all KCs in this workspace, suggesting that it may not be “flagged” receiver-operating characteristic curve) on the test set following for design-loop adaptivity improvements based on difficulty. methods in Khajah et al. [18]. BKT and DKT models are trained Nevertheless, careful inspection of the workspace yields several and tested on the same datasets. AUC is a measure of the extent to areas for improvement. which a model can “discriminate” between or predict students’ This workspace presents students with problems (See Figure 4) correct and incorrect responses in the held-out test set. An AUC like: “Jordan solved the equation -3u – 8 = 10. She calculated u = value of 0.5 indicates “chance” ability to discriminate between -6. Use the Solver to check Jordan’s solution.” The student is then two classes; a value of 1.0 indicates perfect discrimination. presented with a menu-based equation solver. Work with the equation solver should involve the student substituting in the 4.2.2 Results solution value from the problem presentation and checking Table 2 provides summary statistics for AUC performance for whether the result is a balanced equation. After choosing DKT, BKT, and AUC differences of these methods over all “Substitute for variable” from the menu, the student then must workspaces. As expected, DKT generally provides superior input a value on the left-hand side of the equation (see Figure 5). prediction performance to classic BKT over the 308 workspaces. However, there is substantial variability, with classic BKT in Problems in this workspace present both correct and incorrect some cases, albeit many (but not all) with relatively small sample cases, but the KC model does not distinguish between correct and sizes, even out-performing DKT. While there is a modest, incorrect cases, making problems with a correct solution targets statistically significant positive correlation between the AUC for possible gaming the system. For example, in the problem in difference in DKT and BKT and sample size (i.e., the number of Figure 5, the student might enter 10 to complete “10 = 10.” This response may not reflect having correctly carried out the variable substitution to arrive at this solution. KCs in this workspace are also not currently mapped to work in the solver; the solver 1 https://github.com/mmkhajah/dkt provides hints and just-in-time feedback on errors, but it is not 2 https://github.com/myudelson/hmm-scalable instrumented to track KC mastery. Once the student has entered Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 327 the appropriate value, two questions appear on the left of the broader MATHia content. Fourteen workspaces (70%) involve screen (see Figure 6), and student responses to these questions equation solving, and the others are split between those that trigger updates to two KCs at a time. involve placing values on a number line and those in which problem input is provided via drop-down menus. Gervet et al. raise questions about explanations for observed properties of DKT in predicting student performance. Can DKT, for example, “better pick up on local patterns of student behavior like gaming the systems” [12]? While far from conclusive, DKT’s performance for the workspace “Checking Solutions to Linear Equations” could exemplify this phenomenon. Workspaces with more constrained inputs may provide examples where DKT “picks up” on local patterns that BKT does not. Future work ought to investigate whether these particular types of relatively constrained input mechanisms are easy to “game” or whether and how DKT learns local performance patterns. Equation solver and number line workspaces are widespread in Figure 4. Screenshot in the MATHia workspace “Checking the top workspaces in which DKT outperforms BKT. “Checking Solutions to Linear Equations.” Solutions to Linear Equations” has readily apparent flaws, suggesting that our approach may be promising in targeting instructional improvement work. Systematic review of these results remains future work. 5. DISCUSSION There are numerous questions for future research. That differences in AUC between DKT and BKT are uncorrelated with an important measure of instructional ineffectiveness, combined with DKT’s ability to find regularities in data that are not found by BKT suggests that this difference may be signaling important workspace characteristics. Analysis of a particular workspace (§4.2.3.1) suggests that DKT-BKT differences may signal inadequacies in the KC model. These findings can be compared to Figure 5. Screenshot after the student has selected “Substitute the results of data-driven search for better KC models [8]. for variable” from the equation solving menu (see Figure 4). Improvements can be made to the workspace, and A/B tests can “close the loop” and establish more effective approaches. Systematic analysis of instructional content and prediction performance differences in DKT and BKT might follow work that explores a space of properties and features of particular tutor “lessons” to determine which predict students’ affect, gaming the system, and off-task behavior [4-5, 10]. Comparisons to logistic regression methods (e.g., [12]) are also needed. Naïve learning engineering may focus on reducing students’ mastery failures. Such an approach could lead to “over- simplified” tasks that don’t produce failure because they don’t require much knowledge. Large differences between DKT and BKT may help identify over-simplified workspaces that provide Figure 6. Screenshot after the student has entered the value 10 opportunities for students to game the system [3, 12]. To what on the left-hand side of the equation (see Figure 5). extent do gaps in modeling techniques’ performance indicate unproductive patterns of “local” behavior in particular Following design-loop adaptivity steps laid out by Huang et al. workspaces? What else drives differences? What other behavior [15], we have identified (1) where the KC model can be refined, patterns indicate ways to target improvement? and (2) areas for task redesign. The third step would involve fitting a BKT model using the hypothetical re-mapping of KCs to Methodologically, our “non-monolithic” analysis of a large steps within problems in existing data to determine whether the educational data set treats component instructional experiences as hypothesized, refined KC model fits the data better than the units for analysis. Such analytical decomposition is vital to existing KC model. Future work will experimentally test practical learning engineering to improve instructional systems workspace redesigns to “close the loop” (cf. [16, 25]) between this and large portfolios of content used by learners every day. data-driven approach and empirical learning outcomes. 6. ACKNOWLEDGMENTS 4.2.3.2 Prominent Design Patterns This research was sponsored by the National Science Foundation Patterns emerge in comparing the performance of DKT to BKT under the award The Learner Data Institute (Award #1934745). over 308 workspaces. In the top twenty workspaces in which The opinions, findings, and results are solely those of the authors DKT outperforms BKT, differences in AUC units range from .307 and do not reflect those of the National Science Foundation. to .212, and all provide constrained input mechanisms relative to 328 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 7. REFERENCES [14] Hess, F., Saxberg, B. 2014. Breakthrough leadership in the digital age: Using learning science to reboot schooling. [1] Aleven, V., McLaughlin, E.A., Glenn, R.A., Koedinger, K.R. Thousand Oaks, CA: Corwin. 2017. Instruction based on adaptive learning technologies. In Handbook of Research on Learning and Instruction, 2nd Ed., [15] Huang Y., Aleven V., McLaughlin E., Koedinger K. 2020. A Routledge, New York, 522–560. general multi-method approach to design-loop adaptivity in intelligent tutoring systems. In Artificial Intelligence in [2] Aleven, V., Sewall, J. 2016. The frequency of tutor Education. AIED 2020. LNCS, vol 12164. Springer, Cham, behaviors: a case study. In ITS 2016. LNCS, vol. 9684, 124-129. https://doi.org/10.1007/978-3-030-52240-7_23 Springer, Cham, 396–401. https://doi.org/10.1007/978-3- 319-39583-8_47 [16] Liu, R., & Koedinger, K.R. (2017). Closing the loop: Automated data‐driven cognitive model discoveries lead to [3] Baker, R.S., Corbett, A.T., Koedinger, K.R., Wagner, A.Z. improved instruction and learning gains. Journal of 2004. Off-task behavior in the Cognitive Tutor classroom: Educational Data Mining 9(1), 25–41. When students "game the system." In Proceedings of ACM CHI 2004: Computer-Human Interaction, 383-390. [17] Lovett, M., Meyer, O., Thille, C. 2008. The open learning initiative: Measuring the effectiveness of the OLI statistics [4] Baker, R.S.J.d. 2009. Differences between intelligent tutor course in accelerating student learning. Journal of Interactive lessons, and the choice to go off-task. In Proceedings of the Media Education, 14. 2nd International Conference on Educational Data Mining, 11-20. [18] Khajah, M., Lindsey, R.V., Mozer, M.C. 2016. How deep is knowledge tracing? In Proceedings of the 9th International [5] Baker, R.S.J.d., de Carvalho, A.M.J.A., Raspat, J., Aleven, Conference on Educational Data Mining (Jun 29 - Jul 2, V., Corbett, A.T., Koedinger, K.R. 2009. Educational 2016, Raleigh, NC, USA). EDM 2016. International software features that encourage and discourage "gaming the Educational Data Mining Society, 94-101. system". In Proceedings of the 14th International Conference on Artificial Intelligence in Education, 475-482. [19] Koedinger, K.R., Corbett, A.T., Perfetti, C. 2012. The knowledge-learning-instruction framework: Bridging the [6] Beck J.E., Gong Y. 2013. Wheel-Spinning: Students Who science-practice chasm to enhance robust student learning. Fail to Master a Skill. In Artificial Intelligence in Education. Cognitive Science 36, 5, 757–798. AIED 2013. LNCS, vol 7926. Springer, Berlin/Heidelberg. https://doi.org/10.1007/978-3-642-39112-5_44 [20] Koedinger, K.R. & McLaughlin, E.A. 2010. Seeing language learning inside the math: Cognitive analysis yields transfer. [7] Bloom, B. S. 1968. Learning for mastery. Evaluation In Proceedings of the 32nd Annual Conference of the Comment 1(2). Los Angeles: University of California at Los Cognitive Science Society. Austin, TX, Cognitive Science Angeles, Center for the Study of Evaluation of Instructional Society, 471-476. Programs. [21] Pavlik, P.I., Cen, H., Koedinger, K.R. 2009. Performance [8] Cen, H., Koedinger, K.R., Junker, B. 2006. Learning factors factors analysis – a new alternative to knowledge tracing. In analysis: A general method for cognitive model evaluation Proceedings of the 2009 Conference on Artificial and improvement. In Proceedings of the 8th International Intelligence in Education. IOS Press, 531–538. Conference on Intelligent Tutoring Systems Springer‐Verlag, Berlin, 164–175. [22] Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J., Sohl-Dickstein, J. 2015. Deep knowledge [9] Corbett, A.T., Anderson, J.R. 1994. Knowledge tracing: tracing. In Advances in Neural Information Processing Modeling the acquisition of procedural knowledge. User Systems, 505–513. Modeling and User‐Adapted Interaction 4, 253– 278. [23] Ritter, S., Anderson, J.R., Koedinger, K.R., and Corbett, A.T. [10] Doddannara L.S., Gowda S.M., Baker R.S.J.., Gowda S.M., 2007. Cognitive Tutor: applied research in mathematics de Carvalho A.M.J.B. 2013. Exploring the relationships education. Psychon. B. Rev. 14, 249-255. between design, students’ affective states, and disengaged behaviors within an ITS. In Artificial Intelligence in [24] Ritter, S., Yudelson, M., Fancsali, S.E., Berman, S.R. 2016. Education. AIED 2013. LNCS, vol 7926. Springer, Berlin, How mastery learning works at scale. In Proceedings of the Heidelberg. https://doi.org/10.1007/978-3-642-39112-5_4 3rd Annual ACM Conference on Learning at Scale (Apr 25 - 26, 2016, Edinburgh, UK). L@S 2016. ACM, New York, [11] Fancsali S.E., Holstein K., Sandbothe M., Ritter S., McLaren NY, 71-79. B.M., Aleven V. 2020. Towards practical detection of unproductive struggle. In Artificial Intelligence in Education. [25] Rosé, C.P., McLaughlin, E.A., Liu, R., Koedinger, K.R. AIED 2020. LNCS, vol 12164. Springer, Cham. 2019. Explanatory learner models: Why machine learning https://doi.org/10.1007/978-3-030-52240-7_17 (alone) is not the answer. Br J Educ Technol 50, 2943-2958. https://doi.org/10.1111/bjet.12858 [12] Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to [26] Simon, H.A. 1967. The job of a college president. knowledge tracing? Journal of Educational Data Mining Educational Record 48(Winter), 68–78. 12(3), 31-54. https://doi.org/10.5281/zenodo.4143614 [27] Stamper, J. & Koedinger, K.R. 2011. Human-machine [13] Heffernan, N.T., Heffernan, C.L. 2014.The ASSISTments student model discovery and improvement using data. In ecosystem: building a platform that brings scientists and Proceedings of the 15th International Conference on teachers together for minimally invasive research on human Artificial Intelligence in Education. Springer, learning and teaching. Int. J. Artif. Intell. Educ. 24(4), 470– Berlin/Heidelberg, 353-360. 497. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 329 [28] Vanlehn, K. 2006. The behavior of tutoring systems. Education at the 30th Conference on Neural Information International Journal of Artificial Intelligence in Education, Processing Systems (NIPS 2016). 16(3), 227–265. [30] Yudelson, M., Koedinger, K., Gordon, G. 2013. [29] Wilson, K.H., Xiong, X., Khajah, M., Lindsey, R.V., Zhao, Individualized Bayesian Knowledge Tracing Models. In S., Karklin, Y., Van Inwegen, E.G., Han, B., Ekanadham, C., Proceedings of 16th International Conference on Artificial Beck, J.E., Heffernan, N., Mozer, M.C. 2016. Estimating Intelligence in Education (Memphis, TN) AIED 2013. LNCS student proficiency: Deep learning is not the panacea. In vol. 7926, Springer-Verlag, Berlin/Heidelberg, 171-180. Proceedings of the Workshop on Machine Learning for 330 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.