edgeworth-eval.tex

In this chapter, we discuss three studies to evaluate various aspects of \Edgeworth (\cref{chp:edgeworth}). To effectively scale up visual practice authoring, \Edgeworth must support a diverse set of instructional domains, generate high-quality diagrams consistently, and allow educators to author real-world problems. In \cref{sec:edgeworth-user-study,sec:reliability-eval,sec:expert-feedback}, we evaluate \Edgeworth by answering the following research questions on these qualities:

% self-contained RQs
% \refstepcounter{rqsupcounter}\label{rq:mut}
% \refstepcounter{rqsupcounter}\label{rq:eff}
% \refstepcounter{rqsupcounter}\label{rq:eco}


\begin{itemize}
    \item\textbf{Reliability} (\ref{rq:mut}): Can \Edgeworth reliably generate translation problems within relatively few diagram variations?
    \item\textbf{Efficiency} (\ref{rq:eff}): comparing with a conventional drawing tool, are authors more efficient at making translation problems using \Edgeworth? 
    \item\textbf{Ecological validity} (\ref{rq:eco}): Do real-world instructors consider \Edgeworth-generated translation problems to be useful? 
\end{itemize}

First, we evaluated the reliability of \Edgeworth by labeling 310 diagram variations from translation problem dataset (\cref{sec:edgeworth-case-studies}) by hand. With high inter-rater reliability ($\kappa=1$), the result shows that \Edgeworth can reliably generate diagrams that constitute valid four-choice translation problems, when constrained to 10 variations per problem.

Second, we performed a user study to measure authors' efficiency at creating translation problems using \Edgeworth, compared with a conventional drawing tool. The results show that once authors make a correct diagram, they are about 3 times faster at making diagrammatic options for translation problems using \Edgeworth compared to Google Drawings. 

Finally, we conducted walkthrough demonstrations with 9 educators that have experience creating problems. The goal of the demonstrations was to obtain feedback on the ecological validity of \Edgeworth-generated problems and the usefulness of \Edgeworth in general. Overall, these experts found \Edgeworth-generated problems to contain pedagogically useful variations and high visual quality. They provided detailed feedback on individual diagram variations and suggested how \Edgeworth might fit into their instructional contexts. 

\section{Reliability Evaluation (\ref{rq:mut})}
\label{sec:reliability-eval}

\Edgeworth's approach involves random mutations. The mutation operations are type-safe, but type-safety does not prevent degenerate diagram layouts. For instance, \sub{Point A, B} followed by \sub{Triangle t := MkTriangle(A, A, B)} will typecheck. However, since the triangle described in this scenario involves the \sub{Point A} twice, \Edgeworth will produce a line segment, not a triangle from this scenario. Are \Edgeworth suggestions dominated by these nonsensical scenarios? In this section, we evaluate whether \Edgeworth can reliably suggest diagrams that are valid answer options to multiple-choice translation problems (\ref{rq:mut}). 

\subsection{Methods}
\label{sec:reliability-method}

The goal of \Edgeworth is to generate enough diagram variations to assemble a four-choice multiple-choice problem for a given prompt. To this end, we use the following classification scheme for diagram variations: a variation can be a \textbf{Correct} or \textbf{Incorrect} answer to the prompt, or \textbf{Discard}ed because the diagram is invalid for missing key components or lacking readability.

For \ref{rq:mut}, we define ``relatively few variations'' to be 10 diagrams, and consider \Edgeworth to have generated a translation problem in $n$ variations if at that point we have (possibly including the original diagram) at least one \textbf{Correct} diagram, at least one \textbf{Incorrect} diagram, and in total at least four diagrams that are either \textbf{Correct} or \textbf{Incorrect}.

We used \Edgeworth to generate 10 diagrams per problem for all 31 problems in the translation problem dataset (\cref{sec:edgeworth-case-studies}), which yielded 310 diagrams in total. To evaluate this coding scheme, we randomly sampled 2 problems from each of our 3 domains, for 60 generated diagrams total. The first two authors each coded all 60 of those sample diagrams, after which we calculated the Cohen's $\kappa$ \cite{cohen1960coefficient} statistic ($\kappa=1$). Then with the assumption that our coding scheme has reasonable inter-rater reliability, at least one author\footnote{The study is conducted jointly with authors of~\citet{ni_edgeworth_2024}.} coded all remaining diagrams, allowing us to determine the number of our prompts for which \Edgeworth was able to successfully generate a multiple-choice problem. The coding results are included in \cref{app:reliability}.

\subsection{Results}

\subsubsection{Reliability of Problem Generation}

For \ref{rq:mut}, we found that \Edgeworth generated valid multiple-choice problems for 27/31 prompts within 10 variations, and for 30/31 problems within 20 variations. For each of these four failures with 10 variations, \Edgeworth did generate at least four \textbf{Correct} examples, but we had to \textbf{Discard} all the other diagrams, leaving no \textbf{Incorrect} examples. For the one remaining failure with 20 variations, \Edgeworth never succeeded even after we increased the number of variations to 50.


\subsubsection{Distribution}

\begin{table}
    \centering
    \begin{tabular}{r|rrr|r}
        & \textbf{Correct} & \textbf{Incorrect} & \textbf{Discard} & \textit{total} \\
        \hline
        geometry & 52 & 54 & 64 & 170 \\
        chemistry & 3 & 54 & 13 & 70 \\
        discrete & 28 & 25 & 17 & 70 \\
        \hline
        \textit{total} & 85 & 133 & 94 & 310
    \end{tabular}
    \caption{Distribution of diagram variation classes.}
    % \Description{This table describes the coding results from the reliability evaluation. There are four rows (blank, geometry, chemistry, discrete, and total) and five columns (blank, correct, incorrect, discard, and total). The first row and first column contain the headers. The numbers starting from row 2, column 2, in row order are: Row 2: 52, 54, 64, 170; Row 3: 3, 54, 13, 70; Row 4: 28, 25, 17, 70; Row 5: 85, 133, 94, 310}
    \label{tab:distribution}
\end{table}

The original diagram is a \textbf{Correct} answer for every prompt, except for the two Euler circuit prompts, in which the original diagram is \textbf{Incorrect}. For \Edgeworth-generated variations, the full distribution of classes is shown in Table~\ref{tab:distribution}.

The chemistry domain had a far smaller proportion of \textbf{Correct} variations than the other two domains because the only way for a variation to be \textbf{Correct} is for it to coincidentally be identical to the original diagram. Interestingly, in the other two domains, there were about the same number of \textbf{Correct} and \textbf{Incorrect} variations.

In the geometry domain, \textbf{Discard}ed diagrams were primarily either diagrams missing elements referred to in the question prompt, or diagrams that were visually degenerate (\eg everything compressed into a single line). In chemistry, we \textbf{Discard}ed diagrams where the molecule was disconnected. Finally, in the graph domain, we \textbf{Discard}ed diagrams in which some nodes were labeled and others were unlabeled (\ie \Edgeworth had inserted new unlabeled nodes when all nodes in the original diagram were labeled).

\subsubsection{Inter-rater Agreement}

We sampled two problems per domain from the problems collected in ~\cref{sec:edgeworth-case-studies} to evaluate inter-rater agreement (six problems or sixty diagrams in total, 19\% of the dataset). We found perfect agreement on that sample, so $\kappa = 1$.

\section{Experimental Evaluation of Authoring Efficiency (\ref{rq:eff})}
\label{sec:edgeworth-user-study}

To answer \ref{rq:eff}, we conduct an experiment that compares \Edgeworth against a conventional drawing tool in translation problem authoring tasks. In this section, we describe the experimental setup and findings.

\subsection{Study Design}


\subsubsection{Participants}

We recruited 16 participants through advertisement in the university community (e.g. emails and Slack channels). Participants were screened to have some past experience using digital drawing tools. All participants reported that they have used Google Drawings and/or equivalent tools to make diagrams in the past. 3 out of 16 participants are Software Engineering Ph.D. students from Carnegie Mellon University and 13 participants are undergraduate students participating in an Research Experiences for Undergraduates program at Carnegie Mellon. All students have previously taken at least an introductory computer science course.

\subsubsection{Tasks}
\label{sec:edgeworth-user-tasks}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{assets/edgeworth-eval/user-study-tasks.pdf}
    \caption{Tasks used in the \Edgeworth experimental evaluation. Each participant is given a textual prompt and a correct diagram to this prompt at the beginning of each task. They are asked to first re-produce the correct diagram using the designated tool in the correct segment, and then edit this diagram to produce up to 10 incorrect diagrams to the prompt in the incorrect segment.}
    \label{fig:edgeworth-user-study-tasks}
\end{figure}

We selected four problem prompts from the translation problem dataset (\cref{sec:edgeworth-case-studies}), two from the chemistry domain and two from geometry, shown in \cref{fig:edgeworth-user-study-tasks}.

We segmented the authoring of the first correct diagram and subsequent incorrect diagrams in the tasks. This segmentation allows us to separately measure the authoring efficiency of creating the example scenario (\cref{sec:create-scenario}) and creating counterexamples (\cref{sec:select-diagrams}). For participants who used \Edgeworth, we were particularly interested in the upfront cost of making the first \Substance diagram in the \Penrose editor. 

For each task, the participant were given (1) a textual problem prompt and (2) an example diagram (\ie a correct response to the prompt). Participants were then given up to 20 minutes to complete each task, which involve two segments: (a) \textbf{correct segment}: participants first re-created one example visually similar to the given diagram and then (b) \textbf{incorrect segment}: made up to 10 incorrect diagrams by editing the diagram produced in sub-task (a). Each sub-task is time-bounded to 10 minutes. If the participant failed to produce 1 correct diagram in the first segment, they were provided with one so they could continue to the next segment. Each participant completed two problem prompts in chemistry or geometry. 

\subsubsection{Experimental Design}

\begin{table}[t]
\centering
\begin{tabular}{l|llllll}
Domain & Task 1 (Prompt 1) & Task 2 (Prompt 1) & Task 3 (Prompt 2) & Task 4 (Prompt 2)  \\ \hline
Chemistry &  Google Drawings & \Edgeworth & Google Drawings & \Edgeworth \\
Chemistry & \Edgeworth & Google Drawings & \Edgeworth & Google Drawings \\
Geometry &  Google Drawings & \Edgeworth & Google Drawings & \Edgeworth \\
Geometry & \Edgeworth & Google Drawings & \Edgeworth & Google Drawings \\
\end{tabular}
\caption{Participants were divided into 4 groups by the tools they used and diagramming domains of the tasks. Each row corresponds to the task sequence of one of the groups. Participants used both \Edgeworth and Google Drawings to author problems for two prompts in chemistry or geometry (\cref{fig:edgeworth-user-study-examples}).} \label{tab:edgeworth-experiment-groups}
\end{table}

% groups
The study was a within-subject design, where participants were divided into four groups by the ordering of tools they use and diagramming domains of their tasks. Participants used both \Edgeworth and Google Drawings to author diagrams in a random counterbalanced order. Participants were further randomly assigned into one of two subgroups: one subgroup made chemistry diagrams and the second subgroup made geometry diagrams. \cref{tab:edgeworth-experiment-groups} summarizes the four groups that resulted from the tool and domain assignments. Each group had four participants.

In the 90-minute study session, each participant was given two problem prompts in total, each repeated twice for \Edgeworth and Google Drawings, so four tasks in total. For instance, a participant in the chemistry-drawing group (the first row in \cref{tab:edgeworth-experiment-groups}) would spend up to 10 minutes making 1 correct diagram (correct segment) and then up to 10 minutes to make incorrect diagrams of \ensuremath{\mathrm{CH_2O}} using Google Drawings (incorrect segment) first, and then another 20 minutes on the same prompt using \Penrose for the correct segment and \Edgeworth for the incorrect segment. After that, this participant would repeat the same for  \ensuremath{\mathrm{HNO_3}}.


% authoring assistance

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{assets/edgeworth-eval/user-study-examples.pdf}
    \caption{Participants were provided both Google Drawings and \Substance examples throughout the study. The \SubstanceColored code (left) was given in the \Edgeworth tasks and a Google Drawings file that visually resembles the \Penrose output (right) was given for the Google Drawings tasks.}
    \label{fig:edgeworth-user-study-examples}
\end{figure}

At the start of each study session, participants were given 5-minute tutorials of \Edgeworth and Google Drawings, in which they were guided to draw either a right triangle or the Lewis structure of \ensuremath{\mathrm{O_2}}. The ordering of tutorials match the counterbalanced ordering of Google Drawings and \Edgeworth. Throughout the session, participants had access to one Google Drawings example and one \Substance example. \cref{fig:edgeworth-user-study-examples} shows the chemistry and geometry examples. The examples are samples from the translation problem dataset (\cref{sec:edgeworth-case-studies}) that are visually more complex than the actual study tasks. We provided them to the participants as an authoring aid so that they can copy elements from the examples to save time, analogous to the real-world experience of copying and pasting online examples reported in \cref{sec:edgeworth-formative}.

Participants received no more instructions during the tasks. The experimenter only observed the participant and used a stopwatch to measure the time on task. After completing each task, participants completed a survey that asked them if they agree with the following statements on a 5-point Likert scale:

\begin{itemize}
  \item I would use this problem for a class that I teach.
  \item The problem is pedagogically useful (\ie students will benefit from doing this problem).  
  \item The diagrams in the problem are of high visual quality.
\end{itemize}

The study took about 90 minutes per participant, using a provided MacBook Pro with the latest version of Chrome installed. The study sessions were audio-recorded and transcribed. All participants were compensated \$25 Amazon gift cards for their time.

\subsection{Results}

% completion
\cref{tab:edgeworth-user-study-timing} shows the average total time, diagrams produced, and time per diagram for all participants. The \textbf{Diagram Ct} column in \cref{tab:edgeworth-user-study-timing} shows how many diagrams participants produced in each segment of all tasks on average. Any number lower than 1 for correct segments and 10 for incorrect segments indicates that the corresponding participant did not complete the segment. 
All participants authored 10 incorrect diagrams within 10 minutes using \Edgeworth for both domains. In the geometry group, 6 out of 8 participants (11 out of 16 total segments) failed to do so using Google Drawings for at least one segment. In the chemistry group, 1 out of 8 participants failed for both incorrect segments. All participants were able to complete the correct diagram for both chemistry prompts using both tools. For the geometry tasks, all participants produced one correct diagram in the first segment using Google Drawings, but 3 failed using \Penrose. 

\begin{table}[h!]
\centering
\begin{tabular}{|l|l|l|r|r|r|}
\hline
\textbf{Domain} & \textbf{Segment} & \textbf{Tool} & \textbf{Total Time} & \textbf{Diagram Ct} & \textbf{Time/Diagram} \\ \hline
Chemistry & correct   & \Penrose        & 144.19s & 1.00  & 144.19s \\ \cline{3-6} 
          &           & Google Drawings & 231.81s & 1.00  & 231.81s \\ \cline{2-6}
          & incorrect & \Edgeworth      & 150.13s & 10.00 & 15.01s  \\ \cline{3-6} 
          &           & Google Drawings & 440.63s & 9.38  & 51.56s  \\ \hline
Geometry  & correct   & \Penrose        & 390.94s & 0.81  & 390.94s \\ \cline{3-6} 
          &           & Google Drawings & 228.50s & 1.00  & 228.50s \\ \cline{2-6}
          & incorrect & \Edgeworth      & 257.25s & 10.00 & 25.73s  \\ \cline{3-6} 
          &           & Google Drawings & 549.38s & 7.38  & 100.90s \\ \hline
\end{tabular}
\caption{Summary of Average Time, Diagram Count, and Time Per Diagram by Domain for both chemistry and geometry domains, and two segments of each task (\cref{sec:edgeworth-user-tasks}). Each participant produces up to 1 correct diagram first and then up to 10 incorrect diagrams. The time data reported under ``correct'' segment are for the correct diagram and the time for ``incorrect'' segment are for the incorrect diagrams.}
\label{tab:edgeworth-user-study-timing}
\end{table}

% time
A two-way repeated measures ANOVA was conducted to examine the within-subject effects of tool (\Edgeworth vs. Google Drawings) and task (correct vs. incorrect) on various outcomes. The analysis included task completion time and other performance metrics across the domains of chemistry and geometry.

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{assets/edgeworth-eval/timing-violin.pdf}
    \caption{Violin plots showing the distribution of time-on-task for both correct (\textbf{Left}) and incorrect (\textbf{Right)} segments of tasks. The shape of the violins represents a smoothed approximation of the data distribution, with wider sections representing higher density. The embedded box plots within the violins show the median (white line) and inter-quartile range (thick black bar), with the whiskers (thin black lines) extending to the data range.}
    \label{fig:timing-violin}
\end{figure}


For the incorrect segments, the analysis revealed significant differences in task completion times between the tools used, visualized in \cref{fig:timing-violin} (right). In the chemistry domain, participants completed tasks significantly faster (almost 3 times faster) using \Edgeworth (\textit{M} = 150.13s, \textit{SD} = 59.15s) compared to Google Drawings (\textit{M} = 440.63s, \textit{SD} = 113.04s), as indicated by the significant main effect of tool, $F(1, 7) = 53.33, p = 0.0002$. There was no significant effect of the task itself, $F(1, 7) = 1.29, p = 0.293$, suggesting that the difficulty of the tasks was consistent regardless of the tool used. Similarly, in the geometry domain, participants also completed tasks significantly faster (similarly, almost 3 times faster) with \Edgeworth (\textit{M} = 257.25s, \textit{SD} = 139.55s) compared to Google Drawings (\textit{M} = 549.38s, \textit{SD} = 91.28s), with a significant effect of the tool, $F(1, 7) = 90.97, p < 0.0001$. There was a marginal effect of task, $F(1, 7) = 3.64, p = 0.098$, indicating a trend towards task differences that did not reach statistical significance. 

For the correct segments, the analysis showed mixed results on \Penrose's performance depending on the domain, illustrated in \cref{fig:timing-violin} (left). In the chemistry domain, participants completed tasks significantly faster using \Penrose (\textit{M} = 144.19s, \textit{SD} = 79.22s) compared to Google Drawings (\textit{M} = 231.81s, \textit{SD} = 129.78s), as indicated by the significant main effect of tool, $F(1, 7) = 6.65, p = 0.037$. There was no significant effect of the task itself, $F(1, 7) = 0.99, p = 0.353$. In the geometry domain, however, Google Drawings outperformed \Penrose, with participants completing tasks faster using Google Drawings (\textit{M} = 228.5s, \textit{SD} = 71.74s) compared to \Penrose (\textit{M} = 390.94s, \textit{SD} = 149.74s), as indicated by the significant main effect of tool, $F(1, 7) = 15.95, p = 0.005$. Again, there was no significant effect of the task itself, $F(1, 7) = 0.28, p = 0.611$.

% survey

In the per-task survey, summarized in \cref{tab:edgeworth-user-study-survey}, participants provided feedback on the tasks across two domains and both tools. For the chemistry tasks, \Edgeworth was rated highly across all survey items, with participants expressing a strong likelihood of using the problems in their classes (\textit{M} = 4.19), finding them pedagogically useful (\textit{M} = 4.25), and rating the visual quality of the diagrams as excellent (\textit{M} = 4.50). In contrast, Google Drawings received lower ratings in chemistry, particularly in terms of visual quality (\textit{M} = 2.81). In the geometry domain, \Edgeworth also was rated well, with participants finding it useful (\textit{M} = 4.06) and visually acceptable (\textit{M} = 3.38), although the ratings were slightly lower compared to chemistry. Google Drawings in geometry was rated lower across all dimensions, with middling scores for usefulness (\textit{M} = 3.50) and visual quality (\textit{M} = 3.44). Overall, \Edgeworth consistently out-rates Google Drawings, particularly in terms of the visual quality of the diagrams and pedagogical usefulness, especially in the chemistry tasks.

% MANOVA

We conducted a Multivariate Analysis of Variance (MANOVA) on the survey data to quantitatively assess the impact of the tool (\Edgeworth{} vs. Google Drawings) and the domain (chemistry vs. geometry) on three dependent variables corresponding to the survey questions. The results showed a significant effect of the tool on the combined dependent variables, with Wilks' lambda indicating that the choice of tool had a statistically significant influence on the survey responses, $F(3, 59) = 3.3995$, $p = 0.0235$. The domain (chemistry vs. geometry) did not have a significant effect on the combined dependent variables, $F(3, 59) = 0.7550$, $p = 0.5239$. A significant intercept observed in the analysis, $F(3, 59) = 217.9321$, $p < 0.0001$, suggests that the overall mean response across all groups was significantly different from zero, indicating that participants generally provided positive ratings across all survey items. In summary, \Edgeworth{} was perceived more favorably across the three survey questions compared to Google Drawings.


\begin{table}[t]
\centering
\begin{tabular}{l|l|l|l|l}
\hline
\textbf{Domain} & \textbf{Tool} & \textbf{Would Use} & \textbf{Useful} & \textbf{High Quality} \\ \hline
\multirow{2}{*}{\centering Chemistry} 
    & \Edgeworth
    & \progressbar{4.19} 4.19 & \progressbar{4.25} 4.25 & \progressbar{4.50} 4.50 \\ \cline{2-5}
    & Google Drawings 
    & \progressbar{3.63} 3.63 & \progressbar{3.94} 3.94 & \progressbar{2.81} 2.81 \\ \hline

\multirow{2}{*}{\centering Geometry} 
    & \Edgeworth 
    & \progressbar{3.94} 3.94 & \progressbar{4.06} 4.06 & \progressbar{3.38} 3.38 \\ \cline{2-5}
    & Google Drawings 
    & \progressbar{3.38} 3.38 & \progressbar{3.50} 3.50 & \progressbar{3.44} 3.44 \\ \hline
\end{tabular}
\caption{Survey responses for chemistry and geometry tasks using \Edgeworth and Google Drawings. Higher numbers (visualized in green hue) indicates positive responses and lower numbers (yellow and red hue) negative responses.}
\label{tab:edgeworth-user-study-survey}
\end{table}

\subsection{Discussion}

The results show a trade-off between the time taken to create correct diagrams using \Penrose and the efficiency of generating incorrect variations using \Edgeworth. Participants might spend more time on the initial correct diagram using \Penrose, but are significantly and consistently faster at making incorrect diagrams using \Edgeworth than Google Drawings. On average, participants were 3--4$\times$ faster using \Edgeworth. In geometry, the initial correct diagram took more time with \Penrose ($390.94$s per diagram versus $228.50$s with Google Drawings), but the time per incorrect diagram was much lower with \Edgeworth ($25.73$s) compared to Google Drawings ($100.90$s). 

The initial investment in the first \Substance program differs depending on the language complexity and layout consistency. Generally speaking, the \Penrose chemistry domain is simpler to learn than the \Penrose geometry domain. The chemistry domain has a simpler grammar consisting of atoms, bonds, and valance electrons. The layout for chemistry diagrams is also more stable and consistent. In contrast, the \Penrose geometry domain includes many predicates among points, line segments, lines, rays, angles, and so on. The geometry \Style is also less polished than that of chemistry. We observed that participants were sometimes confused by bad layouts produced by \Penrose, and doubted the correctness of their \Substance programs. The timing data in the correct segment shows the difference: participants were $1.7\times$ faster to make the correct diagram using \Penrose on average in chemistry, but $1.7\times$ slower in geometry.


\section{Expert Walkthrough Demonstration and Feedback (\ref{rq:eco})}
\label{sec:expert-feedback}

The intended users of \Edgeworth are educators who create problems. These users are very important to the education system since other teachers make use of their problems. Therefore, we recruited educators who created visual practice problems in multiple domains and educational settings to evaluate ecological validity of \Edgeworth-generated problems (\ref{rq:eco}). While an expert survey may suffice for rating problem quality, we opted for walkthrough demonstration, based on prior research on evaluation methods by \citet{ledo_evaluation_2018}, to gather additional qualitative feedback on the value of having the toolkit in their day-to-day work.

\subsection{Participants and Procedure}
\label{sec:expert-procedure}

We recruited domain expert educators of chemistry, geometry, and graph theory. Experts were invited based on their extensive teaching experience in the domain and past experience in \emph{authoring} diagrammatic content. In contrast to the criteria in the formative study (\cref{sec:edgeworth-formative}), this study selected participants based on their domain-specific expertise in authoring problems. Recruited educators came from a wide range of institutions, including Massive Open Online Courses (MOOC) platforms, liberal arts colleges, community colleges, research universities, and secondary schools. The average teaching experience among the 9 expert educators (E1–E9) was 10.33 years, with a standard deviation of 8.39 years, highlighting a broad range of teaching experience. One of the participants is the original author of the chemistry problems reproduced in the translation problem dataset (\cref{sec:edgeworth-case-studies}).
\cref{tab:demographics} summarizes the demographic information for 9 expert educators (E1--9) who participated in the study. 

\begin{table}
    \centering
    \begin{tabular}{l|l|r|l}
        \textbf{ID} & \textbf{Occupation} & \textbf{Years of Experience} & \textbf{Domain(s)}    \\
        \hline
        E1 & MOOC Course Designer       &  7 & Chemistry           \\ 
        E2 & Liberal Arts College Professor &  4 & Chemistry, Geometry \\
        E3 & Community College Professor    & 30 & Chemistry           \\
        E4 & Liberal Arts College Professor & 11 & Graphs              \\
        E5 & Research University Professor  & 17 & Graphs              \\
        E6 & Research University Professor  &  5 & Graphs              \\
        E7 & Middle School Teacher      &  5 & Geometry            \\
        E8 & Undergraduate Teaching Assistant &  3 & Geometry, Graphs    \\
        E9 & High School Teacher        & 11 & Geometry, Graphs    \\
    \end{tabular}

    \caption{Demographics of walkthrough demonstration participants.}
    % \Description{A table showing the demographics of participants in walkthrough demonstration sessions. It has four columns labeled: ID, Occupation, Teaching Experience, and Domain(s). The entries are: E1 as a MOOC Course Designer with 7 years in Chemistry, E2 as a Professor at a Liberal Arts College with 4 years in Chemistry and Geometry, E3 as a Professor at a Community College with 30 years in Chemistry, E4 as a Professor at a Liberal Arts College with 11 years in Graphs, E5 as a Professor at a Research University with 17 years in Graphs, E6 as a Professor at a Research University with 5 years in Graphs, E7 as a Middle School Teacher with 5 years in Geometry, and E8 as an Elementary School Tutor and Undergraduate TA with 3 years in Geometry and Graphs.}
    \label{tab:demographics}
\end{table}

Each expert participated in a 60- to 90-minute session via video conferencing, which was recorded with their consent. At the start of each session, we demonstrated the workflow of \Edgeworth end-to-end, as described in \cref{sec:edgeworth-workflow}, on one problem outside of the expert's domain. For the remainder of the session, we asked the expert to assemble problems from the \Edgeworth output of two to four problem prompts randomly sampled from the translation problem dataset (\cref{sec:edgeworth-case-studies}) in their domain. Per prompt, the expert rated 10 diagram variations based on the categories described in \cref{sec:reliability-method}. In addition, we asked participants to provide more granular feedback on diagram quality. After rating the diagram variations, they were asked to pick diagrams to assemble a four-choice diagrammatic translation problem. After the problem was assembled and shown on the interface, we asked (1) if they would use the problem in their instruction and (2) how they would author the diagram using their own workflow. The full study protocols for both the chemistry and geometry group are included in \cref{app:edgeworth-user-study-protocol}.

% \subsection{Existing Problem Authoring and Diagramming Processes}

% % thesis: diagramming is a design problem and experts 

% % use of diagrams
% Experts report a wide range of diagram use such as problem sets (E1, E3, E4, \hl{E9}), worked examples (E2--\hl{9}), tests (E1, E2), and in-class activities (E2, E7). Most experts favor multiple-choice translation problems, especially when \quotei{the class size grows} (E2). A few experts favor free-response questions for better feedback to students but noted the scalability problem with them (E2, E3, E7, \hl{E9}). For instance, E3 pointed out that they \quotei{can't monitor how 75 students are drawing a Lewis structure.} 

% % diagram is hard
% To make diagrams, experts used tools such as Microsoft Powerpoint (E1, E3), LaTeX (E4, E5, E6, E8), InkScape~\cite{bah2011inkscape} (E4), Geogebra~\cite{geogebra5} (E2, E7), \hl{and Desmos}~\cite{desmos} (E9). Similar to prior studies on diagramming tools~\cite{naturalDiagramming}, they reported barriers to using these tools that led to \quotei{painful} (E1, E2, E5, E6, E8) diagramming processes. As a result, if possible, they often fell back to hand-drawn diagrams because they \quotei{take less time} (E1, E4, E6), but E5 noted drawing skill is \quotei{one of the talents I did not have and I wish I did.} High-quality diagrams also take significant crafting to get right. For example, E4 would still \quotei{easily spend a day} on a figure because \quotei{if I start trying to make a perfect vector graphics version of it, it's inevitable. We just go down a rabbit hole of trying to make it look nicer and nicer.} 

\subsection{Ecological Validity of Generated Problems}

Overall, experts were happy with the problems they assembled with \Edgeworth-generated diagrams. Experts (E1--9) indicated that they would use all of the problems they created using \Edgeworth in their coursework. Other experts said they would use \Edgeworth-generated problems \quotei{early in the learning process} (E3) and \quotei{as a warm up exercise at the start of the next lecture} (E4). In addition, expert said these problems could be used to review previously introduced concepts. For example, E3 found the diagram variations that break the octet rule to be useful for \quotei{after you've also introduced expanded octet or non-octet-rule things.} Experts plan to use \Edgeworth-generated problem to \quotei{focus on things that students struggle with} (E3) and when introducing concepts that are \quotei{all about visualization} (E5) such as planarity of graphs. E7 asked to see all problems we gathered in the translation problem dataset (\cref{sec:edgeworth-case-studies}) and was excited to them in their class because they were \quotei{going to be covering everything [on the list].} In addition to just asking students to select correct diagrams, E3 also pointed out that by prompting students to \quotei{tell me what is wrong rather than just which is the correct one,} the problem can be used to \quotei{dive deeper.} Similarly, E4 proposed to use \Edgeworth problems as \quotei{an interactive warm-up for reviewing the last lecture, where students vote on and explain why a diagram is correct.} E7 even plans to use \Edgeworth as \quotei{a creative instead of assessment piece} and \quotei{have the students be the teacher \dots{} playing this role more, they get better at tests, because they understand what the test makers are doing.} 

\subsection{Expert Feedback}

\subsubsection{Experts provided positive qualitative feedback on \Edgeworth}
% fast and good
Experts reacted positively to \Edgeworth. They found \Edgeworth to be a \quotei{perfect fit} (E1, E6, E8) for generating multiple-choice problems, especially \quotei{low-stake} (E2, E3, E5, E6, E8, E9) quizzes that \quotei{incentivize [students] to keep up with the class} (E8). Experts said the automatic layout of \Edgeworth \quotei{draws things really fast} (E5), \quotei{saves you the time of drawing multiple structures} (E3), and produces \quotei{beautiful} (E4, E7) diagrams. Comparing with their existing tools, \Edgeworth is a \quotei{nice time-saver} (E3) and the translation problems they authored during the session would take an \quotei{enormous amount of work} (E4), \quotei{infinitely longer than this took} (E6).

Notably, experts pointed out that \Edgeworth aids creativity by promoting \quotei{recognition over recall} (E6). Specifically, \Edgeworth helps with \quotei{the thinking about how to come up with the graphs} and simplifies the diagram layout such that \quotei{you just generate some mutations that you click refresh until it looks nice} (E6). E2 liked that \quotei{it can come up with different possibilities than the ones that would be immediately apparent to me.} 

In addition, experts commented that \Edgeworth can enable them to give students more practice. For instance, E4 noted that \quotei{there's a feedback loop where \dots{} if I had a really good tool for generating nice multi-choice questions, then I could envision doing that much more frequently.}  

Importantly, in the context of student authoring problems themselves, E7 thinks that lowering the barrier of problem authoring help students \quotei{feel they have ownership in their learning as well as sharing their ownership with other students in the class.}

\subsubsection{Experts used visual selection to express diverse standards on diagram quality}
\label{sec:edgeworth-expert-standards}

When rating diagram mutants, experts agreed with diagram ratings of \cref{sec:reliability-eval}, but expressed unique standards for selecting answer choices (\cref{sec:expert-procedure}). Since experts had different standards, they selected different diagrams to assemble problems. This suggests \Edgeworth's use of visual selection met experts' needs.

One group of experts (E2, E4, E6, E7, E8, E9) preferred to maintain a balanced mix of answer choices, \quotei{at least one that's obviously correct, at least one that's obviously incorrect, and then \dots{} two where you have to think about that a little bit} (E6). One rationale was to \quotei{make sure [the problem] is challenging enough, but also has some things that are accessible to students that haven't completely mastered the material} (E4). Another was to teach \quotei{the process of elimination} (E7, E9). Another group of experts (E1, E3, E5) had much higher standards for including a mutant in a multiple-choice problem. For example, E1 preferred problems to contain one correct answer and multiple distractor options that are \quotei{less obvious} such that students won't \quotei{pattern match without looking at the details.}

On a problem that E2 accepted 7 out of 10 mutants as good incorrect options, E3 discarded 8 out of 10 because they \quotei{violated the octet rule in egregious or blatantly egregious way.} However, E3 said whether the octet rule can be broken depends on \quotei{where students are in the course.}

The difference of standards is highly individual. From E2's knowledge of \quotei{colleagues [who] only give difficult distractors} and \quotei{certain profs [who] are legendary for having really hard multiple choice,} they guessed that harder problems \quotei{motivate the students to try harder,} but also pointed out that it \quotei{only works for certain students in my experience.} E3 stated that their choice in diagrams \quotei{hinges upon my perception of whether students will automatically disqualify something,} which they admitted is \quotei{a certain premise or bias.}  In E3's words: \quotei{Wow, it's really tough to \dots{} completely take off the instructor hat.} 

This comment reflects the concept of \textit{expert blind spot} in learning sciences literature, where experts fail to \quotei{understand the processes of novices who are struggling to understand new ideas during their constructive learning process} ~\cite{expertBlindspot}.

\subsubsection{Experts selected isomorphic diagrams to build conceptual understanding}
\label{sec:isomorphic-diagrams}

\Edgeworth sometimes produces isomorphic diagrams, \ie diagrams with identical content but different layouts. These diagrams occur when \Edgeworth's mutations have no net impact on the example diagram, \eg the mutator removes an edge from a graph and adds it back. Surprisingly, experts found value in these isomorphic diagrams. 
In their geometry course, E2 said that their textbook's diagrams \quotei{get drawn the same way over and over again. And some students get stuck into thinking that the concept is only communicated when the diagram is drawn [exactly] that way.} When assembling a problem about the $HCN$ molecule, E3 compared two isomorphic variations, and picked one over another because \quotei{it's drawn the opposite \dots which is interesting and I think students are going to get it wrong.}  Similarly, E1 finds isomorphic diagrams to be useful for \quotei{molecules with resonance structures.} 
% However, for simpler molecules, E1 cautioned that might mislead students to think that \quotei{they are different structures when they really are the same.} 
E5 found isomorphic planar graphs to be particularly useful because students find them \quotei{painstaking to visualize when they just started.} E5 planned to use \Edgeworth to \quotei{draw a graph that doesn't look like it could be planar first, but then untangle it to show that the graph is actually planar.}

% \cref{sec:problem-generation} further discusses this possibility in the context of existing template-based problem generation tools.


\section{Limitations of the Studies}

We discuss some limitations to the studies presented in this chapter. 

% \subsection{Sample Size and Generalizability}
% The sample sizes in the user study and expert walkthrough were relatively small, with only 16 participants in the user study and 9 expert educators in the walkthrough sessions. This limited number of participants may not fully represent the broader population of educators or the variety of instructional contexts in which \Edgeworth might be applied. Consequently, the generalizability of the findings to other domains, educational settings, or different types of diagrammatic problems is limited.

\subsection{Ecological Validity}

The studies primarily focused on specific instructional domains, including chemistry, geometry, and graph theory. While these domains were chosen from the translation problem dataset, the studies are limited by the scope of the dataset itself. The performance and effectiveness of \Edgeworth might vary when applied to other fields that require different types of visual problem representations, which were not explored in this research.

Although the expert walkthroughs provided valuable feedback on the ecological validity of \Edgeworth-generated problems, the artificial nature of the study environment might not fully capture the complexities and constraints of real-world educational settings. The experts' feedback were based on hypothetical scenarios and short-term interactions with the tool, which may not fully reflect the challenges and demands of long-term usage in a classroom or curriculum development context.

\subsection{Tool}
\label{sec:edgeworth-limitations}

Participants in the user study were provided with brief tutorials on using \Edgeworth and Google Drawings, which may not have been sufficient for them to become fully proficient with these tools. The learning curve associated with \Penrose and \Edgeworth, especially their unique approach to generating diagrammatic problems, might have influenced the results, particularly in the efficiency evaluation. Participants with more extensive experience or training in using either tool might exhibit different levels of efficiency and satisfaction than those observed in the study.

% \subsection{Focus on Multiple-Choice Problems}
% The studies were designed around the generation and evaluation of multiple-choice problems, which, while useful in many educational contexts, represent only a subset of the types of problems educators might want to create. Other problem types such as open-ended questions or interactive exercises, were not explored.

% \subsection{Technical Constraints and Evolution}

Both \Penrose and \Edgeworth are still under development, and the studies were conducted on particular versions of them. As they evolve, with potential updates and improvements, the findings presented here might become outdated. Future research would need to re-evaluate the tool's performance and usability in light of new changes. \cref{sec:penrose-limitations} and \cref{sec:limitations} discuss the specifics of \Penrose's and \Edgeworth's system limitations.

\subsection{Authoring Speed vs. Problem Quality}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{assets/edgeworth-eval/p4-drawings.pdf}
    \includegraphics[width=\linewidth]{assets/edgeworth-eval/p4-edgeworth.png}
    \caption{Screenshots of Google Drawings \figloc{(top)} and \Edgeworth selections \figloc{(bottom)} of diagrams by P4 of the user study (\cref{sec:edgeworth-user-study}). They are instances of ``shortcuts'' participants took when using both tools, avoiding large layout edits \figloc{(top)} in Google Drawings and selecting counterexamples seemingly at random in \Edgeworth \figloc{(bottom)}. }
    \label{fig:edgeworth-user-study-shortcuts}
\end{figure}


Participants in the user study described in \cref{sec:edgeworth-user-study} were not screened for their prior experience nor expertise in diagrammatic problem authoring. In fact, a majority of them are undergraduate students. Therefore, their judgment of what makes a good problem (or lack thereof) might influence the task performance data reported in \cref{sec:edgeworth-user-study}. Theoretically, to get through the tasks, participants could pick incorrect diagrams at random using \Edgeworth, or make minimal edits to the correct diagram to get alternatives using Google Drawings. To address this limitation of the user study, we conducted  expert demonstration walkthroughs \cref{sec:expert-feedback} to gain a deeper understanding of the ecological validity of \Edgeworth-generated problems.

During the user study, we did in fact observe participants taking shortcuts in both conditions. Here we show some examples of them and contrast them with the experts' opinions. For instance, \cref{fig:edgeworth-user-study-shortcuts} (top) shows P4's pattern of making small edits to the correct diagram to quickly produce incorrect diagrams. These edits avoid moving many diagram components around while maintaining a good layout. The similar layouts of atoms and bonds contrast experts' feedback on isomorphic diagrams (\cref{sec:isomorphic-diagrams}, \eg \quotei{some students get stuck into thinking that the concept is only communicated when the diagram is drawn [exactly] that way.} by E2). As another example for \Edgeworth, \cref{fig:edgeworth-user-study-shortcuts} (bottom) shows two diagrams selected by a participant as suitable incorrect answers for chemistry prompt 1 (\textit{Which of the following diagrams shows the correct Lewis structure for \ensuremath{\mathrm{CH_2O}}?}) in \cref{fig:edgeworth-user-study-tasks}. Per their feedback in \cref{sec:edgeworth-expert-standards}, E3 would not accept ``Mutated Diagram \#6'' in the figure as a good incorrect options, because it \quotei{violated the octet rule in egregious or blatantly egregious way.}\footnote{The Carbon (C) atom in the diagram has 6 valance electrons, 2 double bonds, and one single bond, which is way beyond the expected 8 electrons and bonds combined per the octet rule.} Notably, as we discussed in \cref{sec:edgeworth-expert-standards}, experts did not agree on a single standard of high problem quality among themselves. 

Overall, while participants took shortcuts in both user study conditions, we do not know how much impact their quality standards have on the task performance. Importantly, there is not a single standard for ``good'' judgment of problem quality, as the expert demonstration walkthroughs showed a diversity of standards among experienced educators. Therefore, future research is needed to tease out (1) what is an acceptable quality standard for diagrammatic translation problems and (2) the effect of problem quality on authoring speed of diagrammatic problems. 

\section{Limitations of the \Edgeworth System}
\label{sec:limitations}

In this section, we further discuss some limitations of the \Edgeworth system in general.

\subsection{Numerical and textual variations}

\Edgeworth cannot produce numerical and textual variations like traditional problem generators~\cite{aleven_cognitive_2006, ASSISTment} do. It is, however, possible to build this functionality on top of \Edgeworth to produce further problem variations. 

\subsection{Usability of UI components}

The design presented in \cref{sec:edgeworth-system-design} focuses on generating diagram variations and selecting diagram mutants to create problems. We use the \Substance language and \Penrose's textual interface without modification. Any limitations of \Substance and its UI are inherited by \Edgeworth. We use standard Material UI elements\footnote{\url{https://mui.com/}} to allow users to configure \Edgeworth (\eg a standard text box for changing diagram variations in \cref{fig:edgeworth-interface}\uilabel{c}). While these components might be usable as-is, they are not designed explicitly for the problem authoring workflow. 

\subsection{New domains of instruction}
\label{sec:extension}

As shown in \cref{sec:edgeworth-mutation}, the design of the \Edgeworth mutator is domain-agnostic, as the mutation operators do not require any domain-specific knowledge to produce mutants. However, improving \Style requires domain expertise. Therefore, future \Edgeworth authors may not have the technical background or the time to invest in a new \Style program, which might prohibit them from using \Edgeworth if the domain is not well supported by the \Penrose ecosystem. However, as noted in \cref{chp:penrose}, the effort to build \Penrose stylesheets for new domains is only necessary once per domain and not once per diagram or problem.

\subsection{Mismatches with the author's intents} 
\label{sec:edgeworth-limit-intent}

\Edgeworth provides a mixed-initiative~\cite{allen1999mixedinitiative} workflow: authors focus on specifying the content and the general direction of variations through the example scenario, while \Edgeworth fully automates the details of variation generation and layout. The evaluation studies presented in \cref{chp:edgeworth-eval} showed that this workflow improves authoring speed and can produce useful diagrams to educators already. In this section, we focus on the current state of \Edgeworth's outputs and propose future work for improving problem quality.

As discussed in \cref{sec:expert-feedback}, experts used terms like \quotei{obviously incorrect} (E6) and \quotei{less obvious} (E1) to characterize the quality of problem options in a multiple-choice translation problem. Based on their feedback, we divide these options into four categories: given a set of mathematical statements describing logical entities and their relationships, a diagram can be associated with them in one of the following ways:

\vspace{0.5em}
\begin{figure}[h]
\begin{minipage}[b]{0.48\linewidth}
$\bullet$ \textbf{Example}: the diagram represents the math statements, \ie all the statements hold true in the diagram. 
    \vspace{3pt}
    
$\bullet$ \textbf{Counterexample}: the diagram clearly violates the math statements, \ie one or more statements are false in the diagram.
    \vspace{3pt}
    
$\bullet$ \textbf{Positive edge case}: the diagram is an example of the math statements, but contains extraneous entities and/or more specialized relationships. 
    \vspace{3pt}
    
$\bullet$ \textbf{Negative edge case}: the diagram is a counterexample, but only requires a few changes to become an example.
\end{minipage}
\hfill
\begin{minipage}[b]{0.45\linewidth}
    \centering
    \includegraphics[width=\textwidth]{assets/appendix/definitions-examples.pdf}
\end{minipage}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{assets/appendix/edgeworth-bad-output.pdf}
    \caption{A screenshot of the \Edgeworth interface, after generating examples for a translation problem focusing on improper subsets. The first pool of mutants isn't suitable for this problem.}
    \label{fig:edgeworth-bad-output}
\end{figure}

Using \Edgeworth, the author creates an example scenario and \Edgeworth's mutator generates a set of diagrams. When these diagrams don't satisfy the needs of the author (\eg missing counterexamples that are important for an educational goal), the author can only generate more variations and hope to get better ones. For example, suppose an author would like to create problems that test students' knowledge of improper subsets, especially the fact that if $A \subseteq B$, $A = B$ is allowed. Using the \Edgeworth, the author first creates a \Substance program and clicks ``Generate Diagrams.'' 

\noindent\hspace*{\fill}
\begin{minipage}[c]{0.23\columnwidth}
\begin{mdframed}[style=SUBCode]
\begin{lstlisting}[language=Sub-SET,escapechar=@,numbers=none]
Set A, B, C
IsSubset(B, A)
IsSubset(C, A)
\end{lstlisting}
\end{mdframed}
\end{minipage}
\hspace*{\fill}

Ideally, \Edgeworth should generate a set of examples of the subset relations that include the edge cases of $A = B$, $A = C$, or $B = C$, and counterexamples of $B \not\subseteq A$ or $C \not\subseteq A$. However, those particular mutated programs are extremely unlikely to be generated by \Edgeworth. The default \Edgeworth output for this scenario is show in  \cref{fig:edgeworth-bad-output}. There are useful counterexamples, but none of the diagrams include edge cases such as:

\noindent\hspace*{\fill}
\begin{minipage}[c]{0.23\columnwidth}
\begin{mdframed}[style=SUBCode]
\begin{lstlisting}[language=Sub-SET,escapechar=@,numbers=none]
Set A, B, C
IsSubset(B, A)
IsSubset(C, A)
Equal(B, C)
\end{lstlisting}
\end{mdframed}
\end{minipage}
\hspace*{\fill}

\noindent In our experience, it is not uncommon for\Edgeworth to miss important edge cases. In addition, the author cannot express their intent easily with the current version of \Edgeworth and may have trouble finding good mutants if they intend to create specific types of examples. We discuss the possibility of augmenting \Edgeworth with domain-specific knowledge, and allowing the user to express their pedagogical intents in \cref{sec:knowledge}

\section{Summary}

This chapter presents an evaluation of \Edgeworth through three studies focusing on its reliability, efficiency, and ecological validity. The research questions addressed are whether \Edgeworth can reliably generate translation problems with minimal variations (\ref{rq:mut}), if it enhances authoring efficiency compared to conventional tools (\ref{rq:eff}), and whether educators find the generated problems useful in real-world contexts (\ref{rq:eco}).

The first study (\cref{sec:reliability-eval}) assessed the reliability of \Edgeworth by analyzing 310 diagram variations across 31 problems. The results indicated that \Edgeworth successfully generated valid multiple-choice problems for most prompts within 10 diagram variations, demonstrating its reliability in producing consistent and usable outputs (\ref{rq:mut}).

The second study (\cref{sec:edgeworth-user-study}) compared the efficiency of authoring translation problems using \Edgeworth versus Google Drawings. The findings revealed that participants were significantly faster when using \Edgeworth (\ref{rq:eff}).

The final study (\cref{sec:expert-feedback}) involved expert educators who provided feedback on the ecological validity of \Edgeworth-generated problems. The educators found the problems to be pedagogically useful and expressed interest in using \Edgeworth in their instructional practices. Their feedback also emphasized \Edgeworth's potential to save time and enhance creativity in problem design (\ref{rq:eco}).

Overall, these studies demonstrate that \Edgeworth is a reliable and efficient tool for authoring educational problems, with strong support from educators for its application in diverse instructional contexts.