Writing Article

Splitting the Stack and Making Setup Actually Work

ยท 4 min read

Five days ago, I published the first post about GPUShare. Since then, the project has moved from "technically functional" to "something people can actually install without messaging me for help."

The biggest change is architectural. The other changes make it usable.

The Backend Split

The original FastAPI server did everything: handled user requests, ran inference, rendered Blender frames, monitored the GPU, and managed billing. That meant every route was exposed to the internet via the Cloudflare Tunnel, including hardware operations that had no business being public.

I've split it into two services:

Middleware (port 8000): the public-facing API. Handles authentication, billing, user management, Stripe integration, and response caching. This is what the Cloudflare Tunnel exposes.

Backend (port 8080): internal-only service that never touches the internet. Runs Ollama inference, Blender rendering, R2 uploads, GPU monitoring, and Tapo energy tracking. All routes require an X-Internal-Token header that only the middleware knows.

The middleware proxies hardware operations to the backend using a shared INTERNAL_SECRET. Inference streams are forwarded byte-for-byte using Server-Sent Events, so the chat interface still gets real-time token streaming.

This also let me add server-side caching. Health checks are cached for 15 seconds, model lists for 30 seconds, and the model picker data for 6 hours. The frontend used to make 9 separate API calls to load the account page; now it makes one aggregated request using asyncio.gather.

One-Click Installers

The setup process was a 12-step README that assumed you knew what PostgreSQL was and had opinions about Docker networking. That's fine for developers. It's not fine if you want anyone else to try this.

I rewrote the setup scripts from scratch. There are now three ways to install:

macOS/Linux:

curl -fsSL https://raw.githubusercontent.com/Slaymish/GPUShare/main/install.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/Slaymish/GPUShare/main/setup.ps1 | iex

Windows (double-click): Download setup.bat and run it.

All three scripts do the same thing:

  • Detect your GPU (NVIDIA, AMD, Intel, or Apple Silicon)
  • Check VRAM and recommend a model size (4B for 4GB, 8B for 8GB, 14B for 16GB, 32B for 24GB+)
  • Estimate GPU wattage based on detected hardware
  • Install dependencies (Ollama, cloudflared, Docker if missing)
  • Generate secure secrets (JWT, internal token, bootstrap token)
  • Run health checks before and after startup
  • Verify the server is reachable

There's a --quick flag for completely non-interactive installs with sensible defaults, and a --dry-run flag to preview what would happen without executing anything.

GPU wattage is now auto-detected at runtime. NVIDIA GPUs query nvidia-smi for the actual power limit. Apple Silicon and AMD GPUs use chip-family estimates. You can still override it in .env if the detection is wrong, but most people won't need to.

OpenAI-Compatible Tool Calling

OpenCode (and other OpenAI-compatible clients) can now use function calling through GPUShare. The middleware passes through tools, tool_choice, tool_calls, and tool_call_id to both Ollama and OpenRouter backends.

Models were returning raw XML <tool_call> tags instead of structured JSON. That's fixed. Function calls now work as expected.

Bug Fixes

The account page pie chart was showing incorrect costs. Cloud inference was being calculated from only the first 50 usage logs instead of the full total, so anyone with more than 50 entries saw wrong numbers. The backend now tracks cloud_inference_usage as a separate ledger type and returns the real total.

The donut chart was deriving render costs from totalUsed - inferenceCost, which included non-usage ledger entries (top-ups, adjustments, invoice payments). It now uses per-type ledger sums queried from the backend, matching the admin dashboard logic.

CORS errors from the Vercel deployment are fixed. Models with vision support now correctly filter file attachments by capability.

What's Next

The core platform works. Setup is straightforward. Costs are metered correctly. The architecture is cleaner.

Next priorities:

  • Usage analytics dashboard (per-model breakdowns, hourly patterns)
  • Render queue improvements (priority scheduling, better progress feedback)
  • Multi-GPU support for people running more than one card

The code is open source at github.com/Slaymish/GPUShare. If you have a GPU sitting idle and people who'd use it, try it out.