Android Use

Overview

This skill enables Claude to control Android devices using ADB (Android Debug Bridge) and the Android Accessibility API (uiautomator). It works by capturing the device's UI hierarchy as structured XML text, parsing interactive elements with their coordinates, and executing actions like tapping, typing, swiping, and navigation.

IMPORTANT: This skill uses TEXT-BASED UI automation, NOT image/screenshot processing.

Why Text-Based (Not Screenshots)?

Approach Cost Speed Accuracy

Screenshots + Vision ~$0.15/action 3-5s 70-80%

Text UI Dump ~$0.01/action <1s 99%+

The Android Accessibility API provides structured XML with:

Element text content (exact, no OCR errors)
Precise coordinates (deterministic)
Clickable/focusable state
Resource IDs for identification

Only use screenshots as a LAST RESORT when:

UI text is ambiguous or unclear
You need to verify visual state (colours, images)
Text dump is empty but screen clearly has content

Prerequisites

Before using this skill, ensure:

ADB installed: brew install android-platform-tools (macOS) or equivalent
Device connected: Via USB with "USB debugging" enabled in Developer Options
Device authorised: Accept the RSA key prompt on the device when first connecting

To verify connection:

adb devices -l

Quick Start Workflow

To control an Android device:

Get screen state - Dump UI hierarchy as text
Parse elements - Extract interactive components with coordinates
Decide action - Based on text content, find target element
Execute action - Tap, type, swipe, back, home
Wait & repeat - Allow UI to update, then reassess

Core Commands

Device Management

List all connected devices

adb devices -l

Target specific device (when multiple connected)

adb -s <DEVICE_ID> <command>

UI Inspection (Text-Based)

Dump UI hierarchy to device

adb -s <DEVICE_ID> shell uiautomator dump /sdcard/window_dump.xml

Pull to local machine

adb -s <DEVICE_ID> pull /sdcard/window_dump.xml /tmp/screen.xml

Parse and extract interactive elements

python3 scripts/parse_ui.py /tmp/screen.xml

One-liner for convenience:

adb -s <DEVICE_ID> shell uiautomator dump /sdcard/window_dump.xml &&
adb -s <DEVICE_ID> pull /sdcard/window_dump.xml /tmp/screen.xml &&
python3 scripts/parse_ui.py /tmp/screen.xml

The parser outputs interactive elements like:

TAPPABLE: 👆 "Submit" @ (540, 1200) 👆 "Cancel" @ (200, 1200) 👆 [search_button] @ (980, 156)

INPUT FIELDS: ⌨️ "Enter email" @ (540, 600)

TEXT/INFO: 👁️ "Welcome back, John" @ (540, 300)

Actions

Tap at coordinates

adb shell input tap <x> <y>

Type text (use %s for spaces)

adb shell input text "hello%sworld"

Press keys

adb shell input keyevent KEYCODE_HOME # Home button adb shell input keyevent KEYCODE_BACK # Back button adb shell input keyevent KEYCODE_ENTER # Enter key adb shell input keyevent KEYCODE_DEL # Backspace

Swipe (x1, y1 to x2, y2 over duration_ms)

adb shell input swipe <x1> <y1> <x2> <y2> <duration_ms>

Long press (swipe with same start/end)

adb shell input swipe <x> <y> <x> <y> 1000

App Management

Launch app by package name (simplest)

adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1

Launch app with specific activity

adb shell am start -n <package>/<activity>

List installed packages

adb shell pm list packages | grep <search>

Force stop app

adb shell am force-stop <package>

Standard Workflow

Step 1: Get Current Screen State (TEXT ONLY)

adb -s <DEVICE_ID> shell uiautomator dump /sdcard/window_dump.xml &&
adb -s <DEVICE_ID> pull /sdcard/window_dump.xml /tmp/screen.xml &&
python3 scripts/parse_ui.py /tmp/screen.xml

This gives you a text list of all interactive elements with coordinates.

Step 2: Find Target Element

From the parsed output, identify the element you need:

Match by text content: "Submit" , "Login" , "Search"
Match by resource ID: [login_button] , [search_field]
Match by type: Button, EditText, TextView

Step 3: Execute Action

To tap "Submit" button at (540, 1200)

adb -s <DEVICE_ID> shell input tap 540 1200

To type in a field - first tap it, then type

adb -s <DEVICE_ID> shell input tap 540 600 adb -s <DEVICE_ID> shell input text "user@email.com"

Step 4: Wait and Repeat

Wait 1-2 seconds for UI to update

sleep 2

Dump again and reassess

adb -s <DEVICE_ID> shell uiautomator dump /sdcard/window_dump.xml &&
adb -s <DEVICE_ID> pull /sdcard/window_dump.xml /tmp/screen.xml &&
python3 scripts/parse_ui.py /tmp/screen.xml

When to Use Screenshots (LAST RESORT ONLY)

Only take a screenshot if:

UI dump returns empty/incomplete but you know the screen has content
You need to verify something visual (image loaded, colour state)
Text elements are ambiguous and you need visual context
User explicitly asks to "see" the screen

Screenshot command (USE SPARINGLY - costs ~15x more to process)

adb -s <DEVICE_ID> shell screencap -p /sdcard/screen.png &&
adb -s <DEVICE_ID> pull /sdcard/screen.png /tmp/screen.png

Common Patterns

Opening an App and Navigating

1. Launch app

adb -s DEVICE shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1

2. Wait for launch

sleep 2

3. Get UI state (text)

adb -s DEVICE shell uiautomator dump /sdcard/window_dump.xml adb -s DEVICE pull /sdcard/window_dump.xml /tmp/screen.xml python3 scripts/parse_ui.py /tmp/screen.xml

4. Find and tap target element based on text output

adb -s DEVICE shell input tap <x> <y>

Searching in an App

1. Find search field in UI dump

Output shows: ⌨️ "Search" @ (540, 200)

2. Tap search field

adb shell input tap 540 200

3. Type search query

adb shell input text "cheesecake"

4. Press enter

adb shell input keyevent KEYCODE_ENTER

Scrolling to Find Elements

If an element isn't visible in the UI dump, scroll and re-dump:

Scroll down

adb shell input swipe 540 1500 540 500 300

Wait and re-dump

sleep 1 adb shell uiautomator dump /sdcard/window_dump.xml

... parse again

Handling Popups/Dialogs

Popups appear in UI dump with their own elements:

👆 "Allow" @ (700, 1200) 👆 "Deny" @ (300, 1200)

Just tap the appropriate button coordinates.

Parser Output Reference

The parse_ui.py script outputs elements grouped by action type:

TAPPABLE (clickable=true)

Elements that respond to taps: buttons, links, icons

👆 "Button Text" @ (x, y) 👆 [resource_id] @ (x, y)

INPUT FIELDS (EditText)

Text input fields

⌨️ "Placeholder text" @ (x, y)

TEXT/INFO (readable)

Non-interactive text elements (limited to 10)

👁️ "Display text" @ (x, y)

JSON Output

For programmatic use:

python3 scripts/parse_ui.py /tmp/screen.xml --json

Returns:

[ { "id": "com.app:id/submit_btn", "text": "Submit", "type": "Button", "bounds": "[400,1100][680,1300]", "center": [540, 1200], "clickable": true, "action": "tap" } ]

Troubleshooting

"error: device not found"

Check USB connection
Run adb devices to verify
Try adb kill-server && adb start-server

UI dump returns empty/incomplete

Screen may be loading - wait and retry
Some apps have accessibility restrictions
Only then consider a screenshot to diagnose

Taps not registering

Verify coordinates are within screen bounds
Check if element is actually clickable
Some overlays may intercept touches

Text input issues

Replace spaces with %s
Special characters may need escaping
For passwords, some apps block programmatic input

Element not found in dump

It may be off-screen - try scrolling
It may be in a WebView (limited accessibility)
It may be loading - wait and retry

Reference

Common Package Names

App Package

Uber Eats com.ubercab.eats

DoorDash com.dd.dasher

Chrome com.android.chrome

Settings com.android.settings

Messages com.google.android.apps.messaging

Common Keycodes

Key Code

Home KEYCODE_HOME

Back KEYCODE_BACK

Enter KEYCODE_ENTER

Delete KEYCODE_DEL

Tab KEYCODE_TAB

Escape KEYCODE_ESCAPE

Resources

scripts/parse_ui.py
Parses Android UI XML and outputs interactive elements
Based on android-action-kernel approach

android-use

Safety Notice

Copy this and send it to your AI assistant to learn

List all connected devices

Target specific device (when multiple connected)

Dump UI hierarchy to device

Pull to local machine

Parse and extract interactive elements

Tap at coordinates

Type text (use %s for spaces)

Press keys

Swipe (x1, y1 to x2, y2 over duration_ms)

Long press (swipe with same start/end)

Launch app by package name (simplest)

Launch app with specific activity

List installed packages

Force stop app

To tap "Submit" button at (540, 1200)

To type in a field - first tap it, then type

Wait 1-2 seconds for UI to update

Dump again and reassess

Screenshot command (USE SPARINGLY - costs ~15x more to process)

1. Launch app

2. Wait for launch

3. Get UI state (text)

4. Find and tap target element based on text output

1. Find search field in UI dump

Output shows: ⌨️ "Search" @ (540, 200)

2. Tap search field

3. Type search query

4. Press enter

Scroll down

Wait and re-dump

... parse again

Source Transparency

Related Skills

twilio-phone

deepgram-transcription

email-himalaya

skill-creator