We introduce LearnWeak, an automated training framework for domain specialization of small computer-use agents (CUAs) that:
On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains.
Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open CUAs are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small CUAs that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student-awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small CUAs in diverse domains.
LearnWeak decomposes domain specialization into two coupled stages: LearnWeak-Gen, an annotation-free data generation loop driven by student failures, and LearnWeak-DPO, an error-aware preference optimization that adaptively targets each failure type.
A small set of seed tasks per domain is initialized with no trajectory annotation required.
Teacher and student are run on the same tasks. Cases where the teacher succeeds but the student fails are collected, and a verifier summarizes recurring failure modes into a weakness report.
New tasks are synthesized via two strategies: weakness-focused (conditioned on the weakness report) and exploration-focused (conditioned on screenshots), balancing targeted repair with domain coverage.
Steps 2 and 3 repeat for multiple iterations, progressively focusing the task distribution on unresolved student weaknesses.
The teacher trajectory is replayed step by step. At each step, the student is queried under the teacher's context, and steps with differing tool executions form preference pairs.
Each pair is labeled as a planning error (different action type) or an execution error (same action type, different parameters).
A token-level mask restricts gradient updates to the relevant failure site. Reasoning traces are excluded; action descriptions are supervised only for planning errors; tool executions are always supervised.
Domain knowledge is stored in lightweight LoRA adapters $\{\Delta_d\}$. The base student is shared across domains, and the matching adapter is activated at inference time.
LearnWeak yields consistent improvements across all eight domains for both student models.
| Model | Gimp | Calc | Impress | Writer | OS | Thunderbird | VLC | VSCode | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Generalized Models | |||||||||
| Kimi K2.6 | 73.08 | 80.85 | 82.19 | 73.91 | 79.17 | 80.00 | 75.71 | 91.30 | 79.53 |
| Claude Sonnet 4.6 | 69.23 | 74.47 | 70.21 | 86.83 | 91.67 | 66.67 | 81.41 | 72.73 | 76.65 |
| Qwen3.5-27B | 39.74 | 22.70 | 43.97 | 52.17 | 41.67 | 66.67 | 44.12 | 47.83 | 44.86 |
| Domain Specialized CUA Models | |||||||||
| SEAgent | 42.30 | — | 22.70 | 31.80 | — | — | 35.30 | 40.50 | — |
| OSExpert | 30.80 | 44.70 | 42.60 | 34.70 | — | — | — | — | — |
| CUA Models | |||||||||
| EvoCUA-32B Teacher | 76.29 | 51.06 | 52.98 | 65.22 | 75.00 | 60.00 | 64.65 | 65.22 | 63.80 |
| OpenCUA-32B | 74.36 | 35.46 | 48.21 | 56.52 | 61.11 | 57.78 | 37.25 | 72.73 | 55.43 |
| EvoCUA-8B | 66.15 | 28.07 | 37.66 | 50.43 | 60.83 | 65.33 | 45.71 | 51.30 | 50.69 |
| EvoCUA-8B + LearnWeak | 82.05 | 41.13 | 50.35 | 55.07 | 66.67 | 73.33 | 56.86 | 72.46 | 62.24 |
| Δ | +15.9 | +13.1 | +12.7 | +4.6 | +5.8 | +8.0 | +11.2 | +21.2 | +11.6 |
| OpenCUA-7B | 48.46 | 11.91 | 31.49 | 30.43 | 40.00 | 54.67 | 32.94 | 51.30 | 37.65 |
| OpenCUA-7B + LearnWeak | 57.69 | 19.15 | 36.88 | 40.58 | 59.42 | 66.67 | 47.06 | 62.32 | 48.72 |
| Δ | +9.2 | +7.2 | +5.4 | +10.2 | +19.4 | +12.0 | +14.1 | +11.0 | +11.1 |
The specialized EvoCUA-8B outperforms the 32B teacher on Gimp, Thunderbird, and VSCode, showing that targeted corrective supervision can go beyond imitation.
EvoCUA-8B improves most on VSCode, Gimp, Calc, and Impress, while OpenCUA-7B gains most on OS, VLC, Thunderbird, and VSCode. This suggests specialization depends more on how each student adapts to a domain's interaction patterns than on domain difficulty alone.
Domain-level statistics of the generated datasets, including teacher-pass/student-fail trajectory counts and the breakdown of planning and execution errors. The heterogeneous error distribution across domains confirms that different software environments expose different student failure modes.
We compare LearnWeak-Gen against existing datasets and alternative data construction pipelines under a matched training budget (EvoCUA-8B student).
| Method | Calc | Impress | VLC | VSCode | Avg. |
|---|---|---|---|---|---|
| Zero-shot | 28.07 | 37.66 | 45.71 | 51.30 | 40.69 |
| Existing Data | |||||
| AgentNet | 34.04 | 39.01 | 49.01 | 69.57 | 47.91 |
| AgentNet (N.) | 32.62 | 40.43 | 49.02 | 63.77 | 46.46 |
| Minimal Human Annotation | |||||
| Trajectory Boosting | 30.50 | 19.88 | 45.10 | 49.28 | 36.19 |
| Zero Human Annotation | |||||
| AgentSynth | 31.21 | 39.01 | 39.22 | 71.01 | 45.11 |
| OS-Genesis | 31.91 | 37.59 | 45.10 | 68.12 | 45.68 |
| ZeroGUI | 36.17 | 40.43 | 48.86 | 62.30 | 46.94 |
| WebSTAR | 31.21 | 40.43 | 52.94 | 73.91 | 49.62 |
| LearnWeak (Ours) | 41.13 | 50.35 | 56.86 | 72.46 | 55.20 |
LearnWeak outperforms all baselines on average, achieving a 5.58 pp gain over WebSTAR. Note that WebSTAR is a step-level filtering method applied to the same weakness-aware generated data, so the two methods differ only in the filtering stage. Reusing existing data (AgentNet) or generating without weakness targeting (AgentSynth, OS-Genesis, ZeroGUI) provides limited gains.
We ablate the key design choices of LearnWeak-Gen: iterative generation and weakness-report conditioning.
| Domain Specialization |
Iterative Generation |
Weakness Report |
Calc | Impress | VLC | VSCode | Avg. |
|---|---|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 28.07 | 37.66 | 45.71 | 51.30 | 40.69 |
| ✓ | ✗ | ✗ | 34.57 | 39.72 | 47.06 | 73.91 | 48.82 |
| ✓ | ✓ | ✗ | 24.82 | 42.55 | 43.14 | 72.46 | 45.74 |
| ✓ | ✓ | ✓ | 41.13 | 50.35 | 56.86 | 72.46 | 55.20 |
One-shot domain specialization (no iteration) already provides a useful baseline (40.69 → 48.82 avg.). However, adding iteration without the weakness report actually hurts performance (45.74 avg.), showing that exploration-only generation fails to collect effectively targeted training samples. The full pipeline combining iteration with weakness-report conditioning achieves the best result (55.20 avg.), demonstrating that the benefit of iterative expansion depends on student-aware guidance.
| Teacher Policy | Teacher | Specialized Student | ||
|---|---|---|---|---|
| Calc | VSCode | Calc | VSCode | |
| Zero-shot | — | — | 28.07 | 51.30 |
| Claude Haiku 4.6 | 36.17 | 69.60 | 30.50 | 71.01 |
| EvoCUA-32B | 51.06 | 65.22 | 41.13 | 72.46 |
| Kimi K2.5 | 63.83 | 86.96 | 41.13 | 73.91 |
Teacher strength matters up to a point: the weaker Claude Haiku 4.6 yields smaller student gains than EvoCUA-32B or Kimi K2.5. Yet despite a large gap in their own success rates, EvoCUA-32B and Kimi K2.5 produce nearly identical specialized student performance. Once the teacher is strong enough to generate reliable trajectories, further gains depend on whether its supervision targets actionable weaknesses for the student, not on raw teacher performance.
We train each target model on datasets constructed from weakness reports derived from different source students. Because failure cases and weakness types differ across student models, a student-aware generator produces the most useful data when the weakness report is derived from the target model itself.
| Target Model $\pi_\theta$ / Source Student $\pi_S$ | Calc | Impress | VLC | VSCode |
|---|---|---|---|---|
| OpenCUA-7B (target) | ||||
| Zero-shot | 11.91 | 31.49 | 32.94 | 51.30 |
| Source: EvoCUA-8B | 9.93 | 27.54 | 45.10 | 50.72 |
| Source: UI-TARS-1.5-7B | 7.80 | 33.35 | — | 49.28 |
| Source: OpenCUA-7B (own) | 19.15 | 36.88 | 47.06 | 62.32 |
| EvoCUA-8B (target) | ||||
| Zero-shot | 28.07 | 37.66 | 45.71 | 51.30 |
| Source: UI-TARS-1.5-7B | 22.70 | 31.21 | — | 71.01 |
| Source: OpenCUA-7B | 39.01 | 43.26 | 47.06 | 73.91 |
| Source: EvoCUA-8B (own) | 41.13 | 50.35 | 56.86 | 72.46 |
Both models achieve the highest performance when trained on datasets generated from their own failure cases, while cross-student datasets yield consistently lower gains. This confirms that LearnWeak-Gen captures model-specific deficiencies rather than generic task distributions.
We compare LearnWeak-DPO with standard SFT, DPO, and variants with different error-aware supervision masks, all trained on the same dataset from LearnWeak-Gen.
| Method | Calc | Impress | VLC | VSCode | Avg. |
|---|---|---|---|---|---|
| Zero-shot | 28.07 | 37.66 | 45.71 | 51.30 | 40.69 |
| SFT | |||||
| Standard SFT ($m = \varnothing$) | 29.08 | 39.72 | 45.10 | 68.12 | 45.51 |
| LearnWeak-SFT | 34.04 | 46.81 | 45.10 | 69.57 | 48.88 |
| DPO | |||||
| Standard DPO ($m = \varnothing$) | 27.66 | 40.43 | 49.02 | 65.22 | 45.58 |
| DPO, planning-only mask | 18.44 | 17.02 | 41.18 | 63.77 | 35.10 |
| DPO, execution-only mask | 24.82 | 39.72 | 45.10 | 71.01 | 45.16 |
| LearnWeak-DPO (Ours) | 41.13 | 50.35 | 56.86 | 72.46 | 55.20 |
Full-response optimization (standard SFT/DPO) improves only modestly. Error-aware masking consistently improves SFT, and LearnWeak-DPO outperforms standard DPO by 9.62 points on average. Using only planning-level or only execution-level supervision is insufficient; effective specialization requires both.
The following case studies compare OSWorld benchmark trajectories before and after specialization in LibreOffice Calc and LibreOffice Impress.
Each study shows how specialization changes the agent's local decision-making behavior, not merely the final task outcome.
🔍 Click any screenshot to enlarge.
The baseline selects only columns B–D, so the sort operates on a partial table rather than the full sheet range, misaligning rows.
The specialized model explicitly expands the selection to A1:D19 before sorting, preserving row integrity across all columns.
#NAME? error in Sheet2!A2.The baseline commits to an unnecessarily complex VLOOKUP-based construction and ends with a #NAME? error instead of a usable row-wise formula.
=Sheet1.A2&"_"&INT(Sheet1.J2), producing a valid result.The specialized model uses a simple direct row-wise reference that immediately yields a valid Year_Profit value, replacing brittle lookup logic with the simplest pattern compatible with the task.
#NAME? error. The specialized model instead uses =Sheet1.A2&"_"&INT(Sheet1.J2), the simplest formula consistent with the task, and propagates it down the column.
The baseline falls into a long recovery loop, repeatedly clicking empty slide space without returning to a successful object-move workflow.
The specialized model recovers by leaving text-edit mode and then resumes the correct object-drag sequence until the title reaches the bottom region.
Esc, and completes the object-level drag needed to reposition the title.
The baseline stops after recoloring two obvious textboxes, leaving the broader "all textboxes" requirement only partially satisfied.
The specialized model continues beyond the first successful edits and only finishes once all required text elements on slide 5 are yellow.
@article{kim2026learnweaknessesautomateddomain,
title = {Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents},
author = {Kim, Suji and Kim, Kangsan and Hwang, Sung Ju},
journal = {arXiv preprint arXiv:2605.28775},
year = {2026}
}