2. Task Type
Detection: find all instances | Grounding: match description | OCR: extract text | GUI: locate UI element | Pointing: point to target
4. Inference Mode
fast: MTP parallel decoding | slow: standard AR decoding | hybrid: auto-switch for best quality-speed balance