Voice Mode
GitClaw supports real-time bidirectional voice via two adapters:
OpenAI Realtime (default)
- Model:
gpt-realtime-2025-08-28 - Real-time audio streaming over WebSocket
- Supports image input (camera frames)
- Requires:
OPENAI_API_KEY
Gemini Live
- Model:
gemini-2.0-flash - Alternative voice provider
- Free tier available
- Requires:
GEMINI_API_KEY
# OpenAI voice (default)
gitclaw --voice --dir ~/assistant
# Gemini voice
gitclaw --voice gemini --dir ~/assistant
Text-Only Fallback
If no voice API key is set, GitClaw still starts the web UI server but with voice disabled. A warning banner appears in the UI, mic/camera/speaker buttons are hidden, and text input routes directly to the agent via query().
Camera
- Front/back camera toggle (mobile)
- Captures frames every 1 second as JPEG
- Frames injected into conversation as images
- Auto-captures on "memorable moments" (laughter, excitement)
Web UI
The voice server runs at http://localhost:3333 and provides a full-featured web interface.
Tabs
| Tab | Features |
|---|---|
| Chat | Real-time conversation, voice controls, camera, agent vitals, file system viewer |
| Skills | Browse and install skills from the marketplace |
| Integrations | Connect Composio services (Gmail, Calendar, Slack, GitHub) |
| Communication | Telegram bot setup, WhatsApp connection, phone/SMS webhook |
| SkillFlows | Visual workflow builder — chain skills into multi-step flows |
| Scheduler | Create cron jobs — run prompts on a schedule |
| Settings | Model selection, API keys, custom base URL — saves to .env and agent.yaml |
Agent Vitals
Real-time metrics displayed in the Chat tab:
- CPU — Delta-based percentage (blue)
- Memory — RSS in MB (orange)
- Tokens — Total tokens used in session (purple)
- Uptime — Server uptime synced from backend (green)
- Pulse — CPU wave visualization
Mobile Responsive
The UI is responsive under 700px:
- Tabs become a scrollable horizontal strip
- Camera panel stacks vertically
- Controls have 44px touch targets
- Sidebar overlays instead of pushing content
- All views stack vertically
