LearnWeak: Automated Domain Specialization for Small Computer-Use Agents

TL;DR

We introduce LearnWeak, an automated training framework for domain specialization of small computer-use agents (CUAs) that:

✓ requires no human trajectory annotation,

✓ constructs synthetic training datasets focused on the student's weaknesses, and

✓ trains the student using DPO with adaptive loss selection based on error types.

On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains.

LearnWeak Concept — **Left:** Conceptual illustration of LearnWeak: unlike existing approaches that aim for broad domain coverage, LearnWeak targets the specific weaknesses of the current student. **Right:** Representative performance gains across target software domains after domain specialization with LearnWeak.

Abstract

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open CUAs are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small CUAs that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student-awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small CUAs in diverse domains.

Method

LearnWeak decomposes domain specialization into two coupled stages: LearnWeak-Gen, an annotation-free data generation loop driven by student failures, and LearnWeak-DPO, an error-aware preference optimization that adaptively targets each failure type.

Stage 1

LearnWeak-Gen: Weakness-Aware Data Generation

1

Seed Query Setup

A small set of seed tasks per domain is initialized with no trajectory annotation required.

2

Weakness Discovery

Teacher and student are run on the same tasks. Cases where the teacher succeeds but the student fails are collected, and a verifier summarizes recurring failure modes into a weakness report.

3

Screenshot-Guided Query Generation

New tasks are synthesized via two strategies: weakness-focused (conditioned on the weakness report) and exploration-focused (conditioned on screenshots), balancing targeted repair with domain coverage.

4

Iterative Refinement

Steps 2 and 3 repeat for multiple iterations, progressively focusing the task distribution on unresolved student weaknesses.

Stage 2

LearnWeak-DPO: Error-Aware Preference Optimization

1

Teacher-Replay Preference Construction

The teacher trajectory is replayed step by step. At each step, the student is queried under the teacher's context, and steps with differing tool executions form preference pairs.

2

Error-Type Classification

Each pair is labeled as a planning error (different action type) or an execution error (same action type, different parameters).

3

Selective Supervision Masking

A token-level mask restricts gradient updates to the relevant failure site. Reasoning traces are excluded; action descriptions are supervised only for planning errors; tool executions are always supervised.

4

Domain-Scalable LoRA Deployment

Domain knowledge is stored in lightweight LoRA adapters $\{\Delta_d\}$. The base student is shared across domains, and the matching adapter is activated at inference time.

Experiments

Domain Specialization Results on OSWorld

LearnWeak yields consistent improvements across all eight domains for both student models.

Model	Gimp	Calc	Impress	Writer	OS	Thunderbird	VLC	VSCode	Avg.
Generalized Models
Kimi K2.6	73.08	80.85	82.19	73.91	79.17	80.00	75.71	91.30	79.53
Claude Sonnet 4.6	69.23	74.47	70.21	86.83	91.67	66.67	81.41	72.73	76.65
Qwen3.5-27B	39.74	22.70	43.97	52.17	41.67	66.67	44.12	47.83	44.86
Domain Specialized CUA Models
SEAgent	42.30	—	22.70	31.80	—	—	35.30	40.50	—
OSExpert	30.80	44.70	42.60	34.70	—	—	—	—	—
CUA Models
EvoCUA-32B Teacher	76.29	51.06	52.98	65.22	75.00	60.00	64.65	65.22	63.80
OpenCUA-32B	74.36	35.46	48.21	56.52	61.11	57.78	37.25	72.73	55.43

EvoCUA-8B	66.15	28.07	37.66	50.43	60.83	65.33	45.71	51.30	50.69
EvoCUA-8B + LearnWeak	82.05	41.13	50.35	55.07	66.67	73.33	56.86	72.46	62.24
Δ	+15.9	+13.1	+12.7	+4.6	+5.8	+8.0	+11.2	+21.2	+11.6

OpenCUA-7B	48.46	11.91	31.49	30.43	40.00	54.67	32.94	51.30	37.65
OpenCUA-7B + LearnWeak	57.69	19.15	36.88	40.58	59.42	66.67	47.06	62.32	48.72
Δ	+9.2	+7.2	+5.4	+10.2	+19.4	+12.0	+14.1	+11.0	+11.1

+11.6 pp
avg. improvement
EvoCUA-8B
+11.1 pp
avg. improvement
OpenCUA-7B
3 / 8
domains where EvoCUA-8B
surpasses the 32B teacher

Finding 1: Small students can surpass the teacher

The specialized EvoCUA-8B outperforms the 32B teacher on Gimp, Thunderbird, and VSCode, showing that targeted corrective supervision can go beyond imitation.

Finding 2: Gains are student-dependent, not domain-dependent

EvoCUA-8B improves most on VSCode, Gimp, Calc, and Impress, while OpenCUA-7B gains most on OS, VLC, Thunderbird, and VSCode. This suggests specialization depends more on how each student adapts to a domain's interaction patterns than on domain difficulty alone.

Data Generation Statistics

Domain-level statistics of the generated datasets, including teacher-pass/student-fail trajectory counts and the breakdown of planning and execution errors. The heterogeneous error distribution across domains confirms that different software environments expose different student failure modes.

Data statistics for EvoCUA-8B — Generated data statistics: EvoCUA-8B student with EvoCUA-32B teacher.

Data statistics for OpenCUA-7B — Generated data statistics: OpenCUA-7B student with EvoCUA-32B teacher.

Data Construction Comparison

We compare LearnWeak-Gen against existing datasets and alternative data construction pipelines under a matched training budget (EvoCUA-8B student).

Method	Calc	Impress	VLC	VSCode	Avg.
Zero-shot	28.07	37.66	45.71	51.30	40.69
Existing Data
AgentNet	34.04	39.01	49.01	69.57	47.91
AgentNet (N.)	32.62	40.43	49.02	63.77	46.46
Minimal Human Annotation
Trajectory Boosting	30.50	19.88	45.10	49.28	36.19
Zero Human Annotation
AgentSynth	31.21	39.01	39.22	71.01	45.11
OS-Genesis	31.91	37.59	45.10	68.12	45.68
ZeroGUI	36.17	40.43	48.86	62.30	46.94
WebSTAR	31.21	40.43	52.94	73.91	49.62
LearnWeak (Ours)	41.13	50.35	56.86	72.46	55.20

LearnWeak outperforms all baselines on average, achieving a 5.58 pp gain over WebSTAR. Note that WebSTAR is a step-level filtering method applied to the same weakness-aware generated data, so the two methods differ only in the filtering stage. Reusing existing data (AgentNet) or generating without weakness targeting (AgentSynth, OS-Genesis, ZeroGUI) provides limited gains.

Analysis

Data Generation Pipeline Ablation

We ablate the key design choices of LearnWeak-Gen: iterative generation and weakness-report conditioning.

Domain Specialization	Iterative Generation	Weakness Report	Calc	Impress	VLC	VSCode	Avg.
✗	✗	✗	28.07	37.66	45.71	51.30	40.69
✓	✗	✗	34.57	39.72	47.06	73.91	48.82
✓	✓	✗	24.82	42.55	43.14	72.46	45.74
✓	✓	✓	41.13	50.35	56.86	72.46	55.20

One-shot (no iteration, no weakness report) Iterative, no weakness report Full pipeline (LearnWeak)

Key Insight: Iteration alone is insufficient; weakness-aware guidance is essential

One-shot domain specialization (no iteration) already provides a useful baseline (40.69 → 48.82 avg.). However, adding iteration without the weakness report actually hurts performance (45.74 avg.), showing that exploration-only generation fails to collect effectively targeted training samples. The full pipeline combining iteration with weakness-report conditioning achieves the best result (55.20 avg.), demonstrating that the benefit of iterative expansion depends on student-aware guidance.

Effect of Teacher Choice

Teacher Policy	Teacher		Specialized Student
Teacher Policy	Calc	VSCode	Calc	VSCode
Zero-shot	—	—	28.07	51.30
Claude Haiku 4.6	36.17	69.60	30.50	71.01
EvoCUA-32B	51.06	65.22	41.13	72.46
Kimi K2.5	63.83	86.96	41.13	73.91

Teacher strength matters up to a point: the weaker Claude Haiku 4.6 yields smaller student gains than EvoCUA-32B or Kimi K2.5. Yet despite a large gap in their own success rates, EvoCUA-32B and Kimi K2.5 produce nearly identical specialized student performance. Once the teacher is strong enough to generate reliable trajectories, further gains depend on whether its supervision targets actionable weaknesses for the student, not on raw teacher performance.

Student-Awareness of Dataset Generation

We train each target model on datasets constructed from weakness reports derived from different source students. Because failure cases and weakness types differ across student models, a student-aware generator produces the most useful data when the weakness report is derived from the target model itself.

Target Model $\pi_\theta$ / Source Student $\pi_S$	Calc	Impress	VLC	VSCode
OpenCUA-7B (target)
Zero-shot	11.91	31.49	32.94	51.30
Source: EvoCUA-8B	9.93	27.54	45.10	50.72
Source: UI-TARS-1.5-7B	7.80	33.35	—	49.28
Source: OpenCUA-7B (own)	19.15	36.88	47.06	62.32
EvoCUA-8B (target)
Zero-shot	28.07	37.66	45.71	51.30
Source: UI-TARS-1.5-7B	22.70	31.21	—	71.01
Source: OpenCUA-7B	39.01	43.26	47.06	73.91
Source: EvoCUA-8B (own)	41.13	50.35	56.86	72.46

Both models achieve the highest performance when trained on datasets generated from their own failure cases, while cross-student datasets yield consistently lower gains. This confirms that LearnWeak-Gen captures model-specific deficiencies rather than generic task distributions.

Training Objective Ablation

We compare LearnWeak-DPO with standard SFT, DPO, and variants with different error-aware supervision masks, all trained on the same dataset from LearnWeak-Gen.

Method	Calc	Impress	VLC	VSCode	Avg.
Zero-shot	28.07	37.66	45.71	51.30	40.69
SFT
Standard SFT ($m = \varnothing$)	29.08	39.72	45.10	68.12	45.51
LearnWeak-SFT	34.04	46.81	45.10	69.57	48.88
DPO
Standard DPO ($m = \varnothing$)	27.66	40.43	49.02	65.22	45.58
DPO, planning-only mask	18.44	17.02	41.18	63.77	35.10
DPO, execution-only mask	24.82	39.72	45.10	71.01	45.16
LearnWeak-DPO (Ours)	41.13	50.35	56.86	72.46	55.20

Full-response optimization (standard SFT/DPO) improves only modestly. Error-aware masking consistently improves SFT, and LearnWeak-DPO outperforms standard DPO by 9.62 points on average. Using only planning-level or only execution-level supervision is insufficient; effective specialization requires both.

Qualitative Results

The following case studies compare OSWorld benchmark trajectories before and after specialization in LibreOffice Calc and LibreOffice Impress. Each study shows how specialization changes the agent's local decision-making behavior, not merely the final task outcome.
🔍 Click any screenshot to enlarge.

Case Study #1: LibreOffice Calc

"Sort the records according to the amounts ascendingly."

Before Specialization

Step 4

Selected columns B–D only, starting from Customer column.

Step 5

Range B1:D19 selected (3 of 4 columns, missing column A).

Step 6

Opened Data menu to proceed with partial-range sort.

The baseline selects only columns B–D, so the sort operates on a partial table rather than the full sheet range, misaligning rows.

After Specialization

Step 4

Selected entire data range from A1 including all four columns.

Step 5

Range A1:D19 selected (all 19 rows, all 4 columns).

Step 6

Opened Data menu with the correct full-table selection active.

The specialized model explicitly expands the selection to A1:D19 before sorting, preserving row integrity across all columns.

Analysis: The decisive difference is the selected range. The failing baseline sorts only B1:D19, violating the full-table requirement. The specialized model selects A1:D19, matching the weakness category identified during data generation (incorrect range selection before table operations).

Case Study #2: LibreOffice Calc

"Fill the Gross profit column, then create Year_Profit column in Sheet2 combining year and integer gross profit."

Before Specialization

Step 47

Prepared to replace formula bar with a VLOOKUP-based construction.

Step 48

Typed a complex nested VLOOKUP concatenation formula.

Step 49

Formula resulted in a #NAME? error in Sheet2!A2.

The baseline commits to an unnecessarily complex VLOOKUP-based construction and ends with a #NAME? error instead of a usable row-wise formula.

After Specialization

Step 13

Entered header "Year_Profit" in Sheet2!A1 correctly.

Step 14

Confirmed header and moved to next row for formula entry.

Step 15

Entered direct row-wise formula =Sheet1.A2&"_"&INT(Sheet1.J2), producing a valid result.

The specialized model uses a simple direct row-wise reference that immediately yields a valid Year_Profit value, replacing brittle lookup logic with the simplest pattern compatible with the task.

Analysis: The baseline chooses an unnecessarily complex VLOOKUP formula that repeatedly fails with a #NAME? error. The specialized model instead uses =Sheet1.A2&"_"&INT(Sheet1.J2), the simplest formula consistent with the task, and propagates it down the column.

Case Study #3: LibreOffice Impress

"Move the title of page 2 to the bottom of the slide."

Before Specialization

Step 43

Clicked empty slide space trying to deselect before moving title.

Step 45

Still stuck in the same deselection loop; title remains at top.

Step 47

Same ineffective click action repeated without progress.

The baseline falls into a long recovery loop, repeatedly clicking empty slide space without returning to a successful object-move workflow.

After Specialization

Step 16

Explicitly exited text-edit mode with Esc after recognizing the incorrect interaction state.

Step 18

Dragged the title downward once object handles became visible.

Step 22

Final drag places the title at the bottom region of the slide.

The specialized model recovers by leaving text-edit mode and then resumes the correct object-drag sequence until the title reaches the bottom region.

Analysis: The core difference is recovery from an incorrect interaction mode. The baseline remains trapped in a repeated deselection loop. The specialized model recognizes it is in text-edit mode, exits with Esc, and completes the object-level drag needed to reposition the title.

Case Study #4: LibreOffice Impress

"Navigate to slide 5 and set the font color of all textboxes to yellow."

Before Specialization

Step 7

Switched from title to subtitle after partially recoloring the slide.

Step 9

Opened yellow color picker for the subtitle text.

Step 11

Marked DONE after recoloring only title and subtitle, leaving other text elements unchanged.

The baseline stops after recoloring two obvious textboxes, leaving the broader "all textboxes" requirement only partially satisfied.

After Specialization

Step 26

Re-selected the subtitle with Ctrl+A to ensure all text in the box was targeted.

Step 28

Applied yellow through the font-color control after returning to the correct formatting context.

Step 31

Continued until all text elements on slide 5 were yellow before marking DONE.

The specialized model continues beyond the first successful edits and only finishes once all required text elements on slide 5 are yellow.

Analysis: This example highlights improved task-completion coverage. The baseline performs the obvious recoloring steps and terminates early. The specialized model keeps checking whether the entire slide matches the instruction, re-enters the formatting context when needed, and finishes only after all required textboxes are yellow.

BibTeX

@article{kim2026learnweaknessesautomateddomain,
  title   = {Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents},
  author  = {Kim, Suji and Kim, Kangsan and Hwang, Sung Ju},
  journal = {arXiv preprint arXiv:2605.28775},
  year    = {2026}
}

Learn from Weaknesses:Automated Domain Specializationfor Small Computer-Use Agents

TL;DR

Abstract

Method

LearnWeak-Gen: Weakness-Aware Data Generation

Seed Query Setup

Weakness Discovery

Screenshot-Guided Query Generation

Iterative Refinement

LearnWeak-DPO: Error-Aware Preference Optimization

Teacher-Replay Preference Construction

Error-Type Classification

Selective Supervision Masking

Domain-Scalable LoRA Deployment

Experiments

Domain Specialization Results on OSWorld

Finding 1: Small students can surpass the teacher

Finding 2: Gains are student-dependent, not domain-dependent

Data Generation Statistics

Data Construction Comparison

Analysis

Data Generation Pipeline Ablation

Key Insight: Iteration alone is insufficient; weakness-aware guidance is essential

Effect of Teacher Choice

Student-Awareness of Dataset Generation

Training Objective Ablation

Qualitative Results

BibTeX

Learn from Weaknesses:
Automated Domain Specialization
for Small Computer-Use Agents