Ghosthand
Ghosthand is a loopback HTTP server on the Android phone. All interaction is via HTTP GET, POST, and a small amount of DELETE to http://127.0.0.1:5583.
Always do this first:
| Step | Command | Purpose |
|---|---|---|
| 1 | GET /ping | Is Ghosthand alive? |
| 2 | GET /state | Is the runtime healthy, and is the capability you need usable now? |
| 3 | GET /screen?source=accessibility | What is the current actionable surface? |
Use this skill to operate Ghosthand as an Android agent substrate.
Ghosthand is not generic Android advice. It is a local runtime with a route-based control surface. Use this skill only when the task is actually about Ghosthand routes, Ghosthand capability state, or acting through Ghosthand.
What Ghosthand is
Ghosthand exposes a local HTTP API for Android observation and control. The important categories are:
- runtime and health:
/ping,/health,/state,/device,/foreground,/commands,/capabilities - structured UI inspection:
/screen,/tree,/focused,/find - semantic or coordinate interaction:
/click,/tap,/input,/type,/setText,/scroll,/swipe,/longpress,/gesture - app and navigation control:
/back,/home,/recents - sensing and transport:
/screenshot,/wait,/clipboard,/notify
Treat /commands as the current machine-readable capability catalog when route details matter.
When to use this skill
Use it when the task requires any of the following:
- checking whether Ghosthand is running or ready
- checking whether a capability is both authorized by Android and allowed by Ghosthand policy
- inspecting the current Android surface before acting
- finding or clicking UI targets by
text,desc, orid - recovering from Ghosthand misses or ambiguous action results
- using Ghosthand to type, scroll, swipe, wait, read clipboard, or read notifications
- debugging Ghosthand-specific behaviors such as partial output, stale assumptions about selectors, or snapshot-scoped node IDs
Do not use it for:
- generic Android usage advice unrelated to Ghosthand
- root-only methods that Ghosthand does not expose
- imaginary routes or undocumented behavior when
/commandscan answer directly
Operating model
1. Start from truth, not intent
Before acting, establish three things:
- Is Ghosthand alive and usable?
- What surface is actually visible now?
- Which selector surface and route shape are most plausible for the target?
Typical order:
- read
/ping - read
/state - read
/commandsif route shape, selector support, or response fields are uncertain - read
/screen?source=accessibilityfor the current actionable surface - if accessibility read is unavailable or clearly insufficient, retry with
/screen?source=hybridor/screen?source=ocr - only then choose
/find,/click, or/tap
2. Capability access has two layers
A capability is usable only when both are true:
- Android/system authorization exists
- Ghosthand policy allows the capability
Do not confuse “permission granted” with “usable now”. Read /state before diagnosing failures, especially for accessibility and screenshot capture.
/state is the best live summary. /capabilities is the fuller catalog-style view when an agent needs route-capability mapping and availability details.
3. Node IDs are snapshot-scoped
Treat nodeId as ephemeral. Do not cache it across fresh observations unless the snapshot context is clearly the same. Prefer re-resolving via /screen, /find, or selector-based /click instead of assuming old node IDs remain valid.
Primitive selection
/screen
Use /screen first when you need a compact actionable view. The default mode is source=accessibility.
Use it to answer:
- what is visible now
- which elements are actionable, editable, or scrollable
- whether coordinates are trustworthy enough for
/tap - whether the current surface even contains the target
Important details:
source=accessibilityis the default and supportseditable,scrollable,clickable, andpackagefilterssource=hybridorsource=ocris useful when accessibility is temporarily unavailable or operationally insufficientsummaryOnly=trueis for compact orientation, not detailed targetingpreviewPathis a hint that a lightweight screenshot fetch is available;/screendoes not embed image bytes
If /screen reports partialOutput=true, warnings, foreground drift, or fallback hints, do not assume you saw the whole surface. Escalate to /tree, /screenshot, or a non-accessibility screen mode before blaming the app.
/tree
Use /tree when you need fuller structure, raw hierarchy, or to inspect why /screen may have omitted or shaped output. Use it for diagnosis and structural truth, not as your default first read.
/find
Use /find when you already have a selector hypothesis and want a bounded lookup.
Prefer it when you need:
- selector testing before interaction
- disambiguation by
index - confirmation that a target exists before a coordinate fallback
- inspection of whether a visible label is discoverable on
text,contentDesc,resourceId, or only as a focused node
A miss usually means one of four things:
- wrong screen
- wrong selector surface
- wrong match semantics
- target not exposed the way you assumed
Supported strategies are text, textContains, contentDesc, contentDescContains, resourceId, and focused. text, desc, and id are convenience aliases in the request body; Ghosthand normalizes them internally.
/click
Prefer /click over /tap when you have a plausible semantic target. Ghosthand can resolve wrapper targets, bounded selector fallbacks, and clickable ancestors, then expose how it actually landed on an actionable node.
Use /click first for:
- text-labeled controls
- content-description labeled controls
- stable resource IDs
- cases where ancestor click resolution may help
For selector-based /click, Ghosthand treats clickable=true as the default unless you explicitly set clickable=false. That default is optimized for action, not inspection. Use /find or disable clickable resolution when you need to inspect the raw matched node.
/tap
Use /tap only when coordinates come from the current trusted surface. Do not guess coordinates. Coordinate fallback is justified only after semantic targeting has narrowed the uncertainty.
/input and /setText
Use /input for the focused editable field. Prefer it over /type when you need explicit text mutation or Enter dispatch semantics.
Use /type only for simpler focused text entry when the current focus is already correct.
Use /setText only when you have a trusted same-snapshot editable nodeId and need to target that exact node.
When entering text, do not assume the Enter key will successfully submit or confirm the input. If Enter does not work or the field remains uncommitted, use the on-screen IME confirmation action instead, typically the confirm button in the bottom-right corner of the keyboard.
/scroll and /swipe
Use /scroll when the goal is container movement or list advancement.
Use /swipe when the task is truly geometric.
Do not interpret performed=true as proof that content changed. Check returned change fields, then verify with /screen, /tree, or /wait.
/wait
Use /wait after actions that may change UI state.
There are two different uses:
GET /wait: wait for UI change and inspect final settled statePOST /wait: wait for a selector condition
Do not confuse changed=false with action failure. It only means a transition was not observed during the wait window. Re-check the final surface before concluding the action failed.
For POST /wait, the supported strategies are bounded and query rules matter: focused takes no query, while text/content-description/resource-id waits require one.
/clipboard, /notify, /screenshot
Use /clipboard as a transport primitive for long text or repeated entry.
Use /notify to read or post local notifications only when the task is explicitly notification-related.
Use /screenshot when visual truth is needed and structured UI output is insufficient, ambiguous, or suspected stale.
Important details:
/screenshotsupportsGETandPOST- width and height must be provided together or omitted together
- screenshot capability is separately policy-gated from accessibility
- if
/screenpublishespreviewPath, use that exact path before inventing a new screenshot size
Selector judgment
Selectors are not interchangeable.
text
Use text when the visible label is likely the actual text field of the node.
desc
Use desc when the control is icon-like, accessibility-labeled, nav-like, or visibly sparse. Many controls that look label-based are actually better matched through content description.
id
Use id when a meaningful resourceId is present. This is often the strongest selector.
Exact vs contains
Do not over-read exact-match misses.
If the visible phrase may be part of a longer text block, retry with a contains-style strategy where the route supports it. A visible phrase on screen is not proof that exact text lookup should succeed.
/find supports explicit contains strategies. /click can use bounded contains fallback internally and tells you when it did so; do not mistake that for an exact selector hit.
Recovery rules
When a Ghosthand action misses, do not branch into random retries. Make one bounded correction:
- re-read
/screen - if accessibility is unavailable or weak, re-read
/screen?source=hybridor/screen?source=ocr - switch
texttodescorid - switch exact semantics to contains semantics when justified
- if text entry succeeded but submission did not, use the on-screen IME confirm action instead of retrying Enter
- move from
/clickto/taponly after trustworthy coordinates exist - use
/capabilitieswhen the route exists but capability availability is ambiguous - use
/waitto settle state before the next action
Repeated misses should be classified, not brute-forced.
Minimal workflows
Check whether Ghosthand is ready
- read
/ping - read
/state - if needed, read
/capabilities - if needed, read
/commands
Operate a visible control safely
- read
/screen?source=accessibility - choose
text,desc, orid - call
/click - call
/waitor re-read/screen - if accessibility surface truth is weak, retry
/screen?source=hybridor/screenshot - only use
/tapif semantic action remains weak but coordinates are trusted
Enter text and confirm it reliably
- focus the intended editable field
- use
/inputfor the focused field,/typefor simple focused typing, or/setTextfor a trusted same-snapshot editablenodeId - verify the text appears in the field or the focused surface reflects the update
- if Enter does not submit or confirm the input, use the on-screen IME confirm action, typically the bottom-right keyboard button
- call
/waitor re-read/screento confirm the post-input state
Diagnose a miss
- confirm Ghosthand and capability state with
/state - re-read
/screen?source=accessibility - inspect selector surface mismatch
- escalate to
/screen?source=hybrid,/tree, or/screenshotif accessibility output is partial, unavailable, or misleading - retry one bounded correction
Reporting standard
When summarizing a Ghosthand run, report only:
- what route you used
- what state changed
- whether the target was achieved
- the first narrow failing step if it was not
- the next best correction
Do not dump logs unless the task is explicitly diagnostic.
Reference files
Detailed route notes are in resources/references/ghosthand-api-quick-reference.md.