QUICKSTART
Get an inference call working in 60 seconds. All you need is a $SHIVER API key from the dashboard.
curl https://api.shrk.cloud/v1/chat \
-H "Authorization: Bearer YOUR_SHIVER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "shark-llama-3.3-70b",
"messages": [{"role":"user","content":"hello shiver"}]
}'
The response shape matches OpenAI's chat completions API so you can drop SHARK in as a one-line provider swap — see Migrate from OpenAI below for a single-line change for the OpenAI Python and JS SDKs.
Authentication
Every request needs a bearer key in the Authorization header. Keys are issued on the dashboard, prefixed shvr_, and scoped to a single billing wallet. Keys can be rotated and revoked at any time.
Three key tiers, all priced the same per token but with different concurrency:
shvr_dev_— sandbox key. 10 concurrent requests. No billing — gated to a $5/day cap.shvr_pro_— production key. 200 concurrent requests. Pay-per-token in $SHIVER.shvr_ent_— enterprise key. Custom concurrency, dedicated routing tier, optional SLAs.
For client-side use, prefer the gateway's session-token endpoint (POST /v1/sessions) which mints a short-lived browser-safe token from a server-held parent key.
Migrate from OpenAI in one line
Python (openai 1.x):
from openai import OpenAI
client = OpenAI(
base_url="https://api.shrk.cloud/v1",
api_key="YOUR_SHIVER_KEY",
)
r = client.chat.completions.create(
model="shark-llama-3.3-70b",
messages=[{"role":"user","content":"hi"}],
)
print(r.choices[0].message.content)
TypeScript / Node:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.shrk.cloud/v1",
apiKey: process.env.SHIVER_KEY,
});
const r = await client.chat.completions.create({
model: "shark-mistral-24b-cold",
messages: [{ role: "user", content: "hi" }],
});
That's literally it. Everything else — streaming, function calling, JSON mode, response formatting — works identically.
Chat completions
Endpoint: POST https://api.shrk.cloud/v1/chat
Request body:
model— one of the published Shark model slugs (see below).messages— OpenAI-style array of{role, content}. Supportssystem,user,assistant, andtoolroles.stream— boolean; if true the response is SSE-streamed (see Streaming).temperature,top_p,max_tokens,frequency_penalty,presence_penalty— standard sampling controls.tools,tool_choice— function-calling spec (see Function calling).response_format— set{"type":"json_object"}to constrain the model to valid JSON output.
Response shape matches OpenAI chat.completions exactly, plus two extra response headers:
x-shiver-node— pseudonymous ID of the node that served you (useful for debugging routing).x-shiver-cost— the $SHIVER amount deducted for this request.
Streaming responses (SSE)
Set stream: true to get token-by-token output as an text/event-stream. Each chunk is a JSON line prefixed with data: , terminated by data: [DONE].
data: {"choices":[{"delta":{"content":"hel"}}]}
data: {"choices":[{"delta":{"content":"lo"}}]}
data: {"choices":[{"finish_reason":"stop"}]}
data: [DONE]
The OpenAI Python and JS SDKs handle this transparently — no code change beyond setting stream=True.
Function calling
Define tools in the same format as OpenAI's tools API. The model returns tool_calls in the assistant message; you execute the function on your end and feed the result back as a tool-role message.
{
"model": "shark-llama-3.3-70b",
"messages": [{"role":"user","content":"weather in eindhoven"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type":"string"}},
"required": ["city"]
}
}
}]
}
Vision (beta)
Vision input is in beta on shark-llama-3.3-70b-vision and shark-gemma-9b-vision. Pass image URLs or base64 data inline in the user message content array:
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "what's in this image?"},
{"type": "image_url", "image_url": {"url": "https://…"}}
]
}]
Pricing: vision tokens are billed at 2× the text-token rate.
Embeddings
Endpoint: POST https://api.shrk.cloud/v1/embeddings. Currently serving shark-embed-large-v1 (1536 dims, multilingual). Same request shape as OpenAI text-embedding-3-large.
{
"model": "shark-embed-large-v1",
"input": ["text to embed", "another text"]
}
Available models
shark-llama-3.3-70b— flagship reasoning + code (beta). 128k context. Best for complex tasks.shark-llama-3.3-70b-vision— same base + vision adapter (beta).shark-mistral-24b-cold— fast, cheap, multilingual. 32k context. Best for throughput.shark-gemma-9b— lightweight, edge-friendly. 8k context. Best for cost-sensitive workloads.shark-gemma-9b-vision— same + vision (beta).shark-embed-large-v1— embeddings, 1536 dims.
All weights also live on HuggingFace under shark-shiver/ — you're free to download and self-host if you don't want to use the Shiver Network.
Sampling parameters
Sensible defaults if not set: temperature=0.7, top_p=1.0, max_tokens=4096. The vision and embedding endpoints ignore most sampling params.
Run a node
One-liner install (Linux + macOS):
curl -fsSL https://shrk.cloud/install | sh
The installer will:
- Detect your GPU(s) and propose a safe thermal cap (default 70 °C).
- Ask you for the Base address that should receive $SHIVER payouts.
- Pull the latest node binary and register your endpoint with the routing layer.
- Start serving inference in the background as a systemd service.
Monitor with shrk status. Stop with shrk stop. No staking, no penalty for stopping.
Multi-GPU & advanced node configs
The client supports a few patterns out of the box:
- Multi-GPU on a single host — the client shards models across visible CUDA devices automatically. Override with
SHRK_VISIBLE_DEVICES=0,2. - Model whitelist — by default the client serves every model that fits in VRAM. Restrict with
SHRK_MODELS=shark-mistral-24b-cold,shark-gemma-9b. - Multi-host pool — point the client at a shared cluster controller (
SHRK_CONTROLLER=tcp://10.0.0.1:9100) to serve a single virtual node across many physical machines. - Schedule — only serve during off-peak hours:
SHRK_SCHEDULE="22:00-08:00". Useful for shared workstations. - Custom thermal target —
SHRK_TEMP_MAX=68. The router will avoid sending requests if the GPU is within 2 °C of the cap.
All config can also live in /etc/shrk/config.toml; the env vars override it.
Billing & $SHIVER
Inference is priced per million tokens, denominated in $SHIVER. Roughly 30% cheaper than centralized providers at the same model size — and stays cheap because new node supply keeps coming online.
Indicative pricing (per 1M tokens, may vary with $SHIVER price):
shark-llama-3.3-70b— ~3500 $SHIVER input / ~7000 $SHIVER outputshark-mistral-24b-cold— ~800 $SHIVER input / ~1600 $SHIVER outputshark-gemma-9b— ~300 $SHIVER input / ~600 $SHIVER outputshark-embed-large-v1— ~150 $SHIVER per 1M tokens
The protocol takes a 1% fee on every inference call. That fee is converted to $SHIVER and burned, creating deflationary pressure tied directly to network usage.
Limits & SLAs
- No per-minute rate limits on production keys — you're throughput-capped only by available nodes for your model.
- No content guardrails — outputs are unaligned. You are responsible for usage.
- Latency — TTFT typically under 600ms when routed to a nearby cold node.
- Uptime — peer-to-peer with N=3 redundancy per model; one node failure is invisible to callers.
- Max request size — 1 MB (covers most use cases including vision data URIs).
- Max context window — model-dependent, see Available models.
Error codes
400— malformed request (bad JSON, unknown model, missing field).401— invalid or revoked API key.402— insufficient $SHIVER balance on the billing wallet.409— concurrent-request cap exceeded for your key tier.429— gateway rate-limit (only triggers for abusive patterns, not normal use).503— no node currently available for the requested model. Retry or pick a different model.
Every error response includes a code field, a human-readable message, and an x-shiver-request-id header you can include in bug reports.
SDKs
Because the API is OpenAI-compatible, the official OpenAI SDKs work out of the box with base_url overridden. Dedicated SDKs are in development:
- Python —
pip install shrk(drop-in superset ofopenai, adds node-pinning + $SHIVER balance helpers). - TypeScript / Node —
npm i @shrk/sdk. - Go —
go get cloud.shrk.io/sdk-go. - Rust —
cargo add shrk.
Webhooks
Subscribe to events at the dashboard. Available event types:
node.online/node.offline— for your owned nodes.payout.epoch— when an epoch payout is finalized for your wallet.balance.low— when your billing wallet drops below a configurable threshold.request.failed— for monitoring inference reliability at scale.
Every webhook is signed with an HMAC-SHA256 header (x-shrk-signature) so you can verify origin.
Community
Talk to the team and other node operators in the SHARK Telegram and Discord. Bug reports and pull requests welcome on the open-source repos. Anyone shipping with the API or running a node is welcome to a free monthly office-hours call.
- Telegram —
t.me/sharknet - Discord —
discord.gg/sharknet - GitHub —
github.com/shrk-cloud - Status page —
status.shrk.cloud