Publications
2026
- CHI’26MetaMate: Understanding How Educational Researchers Experience AI-Assisted Data Extraction for Systematic ReviewsXue Wang*, and Gaoxiang Luo*In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , Barcelona, Spain, 2026
Systematic reviews are essential for evidence synthesis in education, yet data extraction remains a bottleneck: labor-intensive and error-prone. Large language models offer automation potential, but questions remain about AI performance compared to human coders and how researchers experience these tools in practice. We present MetaMate, an open-access web-based tool for automated data extraction in educational systematic reviews. Our mixed-methods evaluation combines a quantitative validation study benchmarking MetaMate against trained human coders across 32 studies and 20 data elements with a qualitative user study involving six educational researchers using think-aloud protocols. MetaMate achieves precision (81-96%), recall (90-100%), and F1 scores (88-96%) comparable to or exceeding human coders, with strengths in mathematical reasoning and semantic comprehension. Qualitative findings reveal insights about trust calibration, verification behaviors, usability challenges, and human-AI collaboration. We contribute empirical evidence on LLM extraction capabilities and design implications for AI-assisted research tools balancing automation with human oversight. MetaMate is available at https://metamate.online.
- NatureA benchmark of expert-level academic questions to assess AI capabilitiesCenter AI Safety, Scale AI, and HLE Contributors ConsortiumNature, Jan 2026
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
@article{Phan2026, author = {AI Safety, Center and AI, Scale and Consortium, HLE Contributors}, title = {A benchmark of expert-level academic questions to assess AI capabilities}, journal = {Nature}, year = {2026}, month = jan, day = {01}, volume = {649}, number = {8099}, pages = {1139-1146}, issn = {1476-4687}, doi = {10.1038/s41586-025-09962-4}, url = {https://doi.org/10.1038/s41586-025-09962-4}, } - ReportImproving Evidence Synthesis with Artificial IntelligenceAmir Mehr, Joshua Howard, Cyrus Nouroozi, Behrad Khorramnazari , and 12 more authorsJan 2026
Scientific knowledge is represented by approximately 3.3 million new journal articles each year and is expanding at an unprecedented pace, increasing in total size by 59% between 2012 and 2022. Systematic reviews and meta-analyses provide a structured means of evidence synthesis, but they are slow and labor-intensive, often requiring more than a year to complete. This bottleneck constrains scientific progress and is especially consequential in contexts such as public health crises (e.g., the COVID-19 pandemic), where timely evidence is essential for guiding policy and practice. Here we show that artificial intelligence methods can substantially improve both the efficiency and accuracy of systematic reviews. Using diverse datasets and examining over 30,000 data points, our AI-assisted approach matched or exceeded human performance while greatly reducing the risk of overlooking relevant evidence. In multiple tests of screening performance, the AI achieved 97.2% sensitivity and 96.84% specificity. With respect to extraction, the AI obtained 96.96% extraction accuracy, outperforming human efforts, and completed tasks up to 99% faster. These results demonstrate that AI augmentation can enable more timely and comprehensive evidence synthesis, facilitate living systematic reviews, and better support researchers, policymakers, and practitioners in responding to fast-moving scientific developments. Integrating AI into evidence synthesis represents a decisive advance in the accumulation of scientific knowledge.
@misc{mehrimproving, title = {Improving Evidence Synthesis with Artificial Intelligence}, author = {Mehr, Amir and Howard, Joshua and Nouroozi, Cyrus and Khorramnazari, Behrad and Banks, George Christopher and Tay, Louis and Cuijpers, Pim and Miguel, Clara and Harrer, Mathias and Meyer, John P. and Stanley, David and Wang, Xue and Luo, Gaoxiang and Huo, Bright and Liu, Jason and Rousseau, Denise}, doi = {10.31222/osf.io/6d2wm_v1}, publisher = {MetaArXiv}, year = {2026}, month = jan, }
2025
- Journal of School ChoiceSelf-Regulated Learning and Self-Efficacy Among Homeschooled Students: A Systematic ReviewXue Wang, Genevieve Smith, and Angela WatsonJournal of School Choice, Oct 2025
Self-regulated learning and self-efficacy are fundamental competencies for academic success, yet their development in homeschooling contexts remains understudied despite growing enrollment in home education. This systematic review synthesized research examining these critical learning competencies among homeschooled students. We identified 23 studies (6,312 participants) meeting inclusion criteria. Results revealed that homeschooled students generally reported moderate to high levels of self-regulated learning and self-efficacy. Comparative studies suggested possible advantages for homeschooled students in autonomy and academic self-efficacy, with broadly similar outcomes in other domains. However, the non-causal nature of the studies precludes conclusions on whether and how homeschooling influences these outcomes.
@article{Wang23102025, author = {Wang, Xue and Smith, Genevieve and Watson, Angela}, title = {Self-Regulated Learning and Self-Efficacy Among Homeschooled Students: A Systematic Review}, journal = {Journal of School Choice}, volume = {0}, number = {0}, pages = {1--26}, month = oct, year = {2025}, publisher = {Routledge}, doi = {10.1080/15582159.2025.2575559}, url = {https://doi.org/10.1080/15582159.2025.2575559}, eprint = {https://doi.org/10.1080/15582159.2025.2575559}, } - Large-scale Assess EducPositive teacher feedback and adolescents’ reading self-efficacy: a quasi-experimental analysis using PISA 2018Xue Wang, Qiyang Zhang, and Marcia H. DavisLarge-scale Assessments in Education, May 2025
Positive teacher feedback plays a crucial role in student learning. While previous studies on positive teacher feedback have primarily focused on its relationship with student achievement, much less attention has been given to how such feedback associates with student self-efficacy, particularly reading self-efficacy—a key predictor of both reading and overall academic success—among adolescents. The existing research that has explored this relationship often relied on small-sample, correlational studies, which often failed to account for a broad range of potential confounding variables. This study addresses these gaps by applying propensity score matching and weighting to a nationally representative sample of 4,838 U.S. adolescents from the Programme for International Student Assessment (PISA) 2018 data, controlling for an extensive set of covariates. The rationale for using propensity score matching and weighting was to reduce selection bias and better estimate the relationship between positive teacher feedback and student reading self-efficacy. Results show that positive teacher feedback is significantly associated with higher reading self-efficacy among adolescents, with stronger associations observed among students who receive it less frequently. This means increasing positive feedback could potentially benefit adolescents’ reading self-efficacy overall, but may be particularly beneficial for certain groups of students. This study highlights the potential value of teacher training programs focused on positive feedback to support students’ reading self-efficacy.
@article{10.1186/s40536-025-00253-y, title = {Positive teacher feedback and adolescents’ reading self-efficacy: a quasi-experimental analysis using {PISA} 2018}, volume = {13}, issn = {2196-0739}, shorttitle = {Positive teacher feedback and adolescents’ reading self-efficacy}, url = {https://doi.org/10.1186/s40536-025-00253-y}, doi = {10.1186/s40536-025-00253-y}, number = {1}, urldate = {2025-06-02}, journal = {Large-scale Assessments in Education}, author = {Wang, Xue and Zhang, Qiyang and Davis, Marcia H.}, month = may, year = {2025}, keywords = {Adolescent, Programme for international student assessment (PISA), Propensity-score matching, Reading self-efficacy, Teacher feedback}, pages = {17}, }
2024
- IJERThe effects of language learning strategy instruction on college students’ English achievement and learner autonomy in mainland China: A meta-analysisInternational Journal of Educational Research, Aug 2024
This meta-analysis synthesized recent research on promoting learner autonomy through strategy instruction in tertiary English classrooms in China for two main purposes: (a) to estimate the effects of strategy instruction on English achievement and outcomes related to learner autonomy; and (b) to examine the moderating effects of a set of intervention and contextual characteristics. A total of 49 studies contributed to 111 effect sizes for this meta-analysis. Using a random-effects model, this study finds that the overall effect size of strategy instruction was 0.92 (p < 0.001). The type of outcome measure significantly moderated the impact of strategy instruction. Interventions that assessed multiple aspects of learner autonomy (cognitive, metacognitive, motivational, and social) had a significantly higher mean effect size than those assessing a single aspect. The results suggest strategy instruction is a viable instructional tool for promoting learner autonomy among college students in English-as-a-foreign-language classrooms. The results emphasize that fully understanding the impact of a strategy instruction intervention on learner autonomy requires an examination of its various dimensions.
@article{WANG2024102442, title = {The effects of language learning strategy instruction on college students' English achievement and learner autonomy in mainland China: A meta-analysis}, journal = {International Journal of Educational Research}, volume = {127}, pages = {102442}, year = {2024}, month = aug, issn = {0883-0355}, doi = {https://doi.org/10.1016/j.ijer.2024.102442}, url = {https://www.sciencedirect.com/science/article/pii/S0883035524001289}, author = {Wang, Xue and Zhang, Qiyang and Chen, Huanchun and Neitzel, Amanda J. and Davis, Marcia H.}, keywords = {Learner autonomy, Strategy instruction, English as a foreign language education, Meta-analysis, Mainland China}, } - JESPARLightning Squad: Assessing the Dosage Effect of Computer-Assisted Tutoring with Cooperative Learning for Struggling ReadersXue Wang, Amanda Neitzel, and Nancy MaddenJournal of Education for Students Placed at Risk (JESPAR), Aug 2024
This quantitative study examines the dosage effect of a computer-assisted tutoring approach, Tutoring with the Lightning Squad, on the reading achievement of struggling students. Using a multi-site, cluster-randomized controlled trial design, 188 students in Grades 1-3 from six schools in Minnesota and Virginia were randomly assigned to receive either Lightning Squad tutoring or regular reading instruction between November 2016 and January 2017. Implementation results revealed that many students in the treatment group did not receive the full intended dosage of treatment. Dosage analyses indicated that high attendance in the intervention had a significant positive effect on reading outcomes, including Passage Comprehension and Word Attack. The findings emphasized the critical role of implementation fidelity in program effectiveness and suggested that a higher dosage of the Lightning Squad intervention may lead to better outcomes for struggling readers.
@article{doi:10.1080/10824669.2024.2388242, author = {Wang, Xue and Neitzel, Amanda and Madden, Nancy}, title = {Lightning Squad: Assessing the Dosage Effect of Computer-Assisted Tutoring with Cooperative Learning for Struggling Readers}, journal = {Journal of Education for Students Placed at Risk (JESPAR)}, volume = {0}, number = {0}, pages = {1--19}, year = {2024}, month = aug, publisher = {Routledge}, doi = {10.1080/10824669.2024.2388242}, url = {https://doi.org/10.1080/10824669.2024.2388242}, eprint = {https://doi.org/10.1080/10824669.2024.2388242}, } - SREE’24, AREA’25, SREE’25MetaMate: Large Language Model to the Rescue of Automated Data Extraction for Educational Systematic Reviews and Meta-analysesXue Wang, and Gaoxiang LuoMay 2024
Systematic reviews and meta-analyses are crucial for synthesizing evidence but are time-consuming and labor-intensive, especially during data extraction. To address this challenge, we developed MetaMate, an open-access web-based tool leveraging large language models (LLMs) for automated data extraction in educational systematic reviews and meta-analyses. MetaMate utilizes a hierarchical schema and divide-and-conquer approach in its extraction chain, and a from-global-to-local lens and example retriever in its verification chain. We evaluated MetaMate’s performance on 32 empirical studies, extracting 20 data elements related to participants and interventions. MetaMate achieved high precision, recall, and F1 scores, with performance comparable to human coders when benchmarked against an expert-defined gold standard. Notably, MetaMate demonstrated advanced mathematical reasoning and semantic comprehension, surpassing keyword-based approaches and avoiding common human errors. As the first LLM-powered data extraction tool designed specifically for educational research, MetaMate has the potential to significantly streamline the systematic review process and reduce time and effort for researchers. MetaMate is available at https://metamate.online.
@misc{wang_luo_2024, title = {MetaMate: Large Language Model to the Rescue of Automated Data Extraction for Educational Systematic Reviews and Meta-analyses}, url = {osf.io/preprints/edarxiv/wn3cd}, doi = {10.35542/osf.io/wn3cd}, publisher = {EdArXiv}, author = {Wang, Xue and Luo, Gaoxiang}, year = {2024}, month = may, }
2022
- RELCSegmental versus Suprasegmental: Which One is More Important to Teach?Xue WangRELC Journal, May 2022
This article considers the continuing debate in pronunciation instruction (PI) about whether segmental or suprasegmental features are more important in teaching English to speakers of other languages. While evidence has accumulated on both sides of the debate, the emergence of the notion of English as a Lingua Franca (ELF) further complicates the issue. This article provides a review of current research supporting the different views in the segmental/suprasegmental debate. The review highlights research evidence that examines either the impact of segmental and suprasegmental features on intelligibility or the effectiveness of teaching these features to improve intelligibility. A review of this line of research underlines the context-specific nature of the debate and a third view that blurs the boundary between segmentals and suprasegmentals.
@article{doi:10.1177/0033688220925926, author = {Wang, Xue}, title = {Segmental versus Suprasegmental: Which One is More Important to Teach?}, journal = {RELC Journal}, volume = {53}, number = {1}, pages = {194-202}, year = {2022}, doi = {10.1177/0033688220925926}, url = {https://doi.org/10.1177/0033688220925926}, eprint = {https://doi.org/10.1177/0033688220925926}, }
2019
- RELCExploring EFL Learners’ Accent Preferences for Effective ELF CommunicationFan-Wei Kung, and Xue WangRELC Journal, May 2019
This study explores Chinese English as a foreign language (EFL) learners’ attitudes towards accent for effective English as a lingua franca (ELF) communication. Notwithstanding the research conducted on EFL learners’ perceptions of different variations of the English language for their language learning, little attempts have been made to investigate their perspectives in detail within the context of China. This inquiry thus intends to bridge this gap by exploring EFL learners’ accent preferences for ELF communication. Data were collected qualitatively from 34 students at an international university in China to examine their experiences of EFL learning and ELF communication. Data were classified and categorized based on learners’ accent preferences and then coded for analysis from their learning discourses, cultural media, material conditions and social agents. The results point to various sociocultural and sociohistorical variables that have reified their language choices and ideology to further underpin their native speaker (NS) and non-native speaker (NNS) dichotomy.
@article{doi:10.1177/0033688218765306, author = {Kung, Fan-Wei and Wang, Xue}, title = {Exploring EFL Learners' Accent Preferences for Effective ELF Communication}, journal = {RELC Journal}, volume = {50}, number = {3}, pages = {394-407}, year = {2019}, doi = {10.1177/0033688218765306}, url = {https://doi.org/10.1177/0033688218765306}, eprint = {https://doi.org/10.1177/0033688218765306}, }