Docs — SHARK

QUICKSTART

Get an inference call working in 60 seconds. All you need is a $SHIVER API key from the dashboard.

curl https://api.shrk.cloud/v1/chat \
  -H "Authorization: Bearer YOUR_SHIVER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "shark-llama-3.3-70b",
    "messages": [{"role":"user","content":"hello shiver"}]
  }'

The response shape matches OpenAI's chat completions API so you can drop SHARK in as a one-line provider swap — see Migrate from OpenAI below for a single-line change for the OpenAI Python and JS SDKs.

Authentication

Every request needs a bearer key in the Authorization header. Keys are issued on the dashboard, prefixed shvr_, and scoped to a single billing wallet. Keys can be rotated and revoked at any time.

Three key tiers, all priced the same per token but with different concurrency:

shvr_dev_ — sandbox key. 10 concurrent requests. No billing — gated to a $5/day cap.
shvr_pro_ — production key. 200 concurrent requests. Pay-per-token in $SHIVER.
shvr_ent_ — enterprise key. Custom concurrency, dedicated routing tier, optional SLAs.

For client-side use, prefer the gateway's session-token endpoint (POST /v1/sessions) which mints a short-lived browser-safe token from a server-held parent key.

Migrate from OpenAI in one line

Python (openai 1.x):

from openai import OpenAI
client = OpenAI(
    base_url="https://api.shrk.cloud/v1",
    api_key="YOUR_SHIVER_KEY",
)
r = client.chat.completions.create(
    model="shark-llama-3.3-70b",
    messages=[{"role":"user","content":"hi"}],
)
print(r.choices[0].message.content)

TypeScript / Node:

import OpenAI from "openai";
const client = new OpenAI({
  baseURL: "https://api.shrk.cloud/v1",
  apiKey: process.env.SHIVER_KEY,
});
const r = await client.chat.completions.create({
  model: "shark-mistral-24b-cold",
  messages: [{ role: "user", content: "hi" }],
});

That's literally it. Everything else — streaming, function calling, JSON mode, response formatting — works identically.

Chat completions

Endpoint: POST https://api.shrk.cloud/v1/chat

Request body:

model — one of the published Shark model slugs (see below).
messages — OpenAI-style array of {role, content}. Supports system, user, assistant, and tool roles.
stream — boolean; if true the response is SSE-streamed (see Streaming).
temperature, top_p, max_tokens, frequency_penalty, presence_penalty — standard sampling controls.
tools, tool_choice — function-calling spec (see Function calling).
response_format — set {"type":"json_object"} to constrain the model to valid JSON output.

Response shape matches OpenAI chat.completions exactly, plus two extra response headers:

x-shiver-node — pseudonymous ID of the node that served you (useful for debugging routing).
x-shiver-cost — the $SHIVER amount deducted for this request.

Streaming responses (SSE)

Set stream: true to get token-by-token output as an text/event-stream. Each chunk is a JSON line prefixed with data: , terminated by data: [DONE].

data: {"choices":[{"delta":{"content":"hel"}}]}
data: {"choices":[{"delta":{"content":"lo"}}]}
data: {"choices":[{"finish_reason":"stop"}]}
data: [DONE]

The OpenAI Python and JS SDKs handle this transparently — no code change beyond setting stream=True.

Function calling

Define tools in the same format as OpenAI's tools API. The model returns tool_calls in the assistant message; you execute the function on your end and feed the result back as a tool-role message.

{
  "model": "shark-llama-3.3-70b",
  "messages": [{"role":"user","content":"weather in eindhoven"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a city",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type":"string"}},
        "required": ["city"]
      }
    }
  }]
}

Vision (beta)

Vision input is in beta on shark-llama-3.3-70b-vision and shark-gemma-9b-vision. Pass image URLs or base64 data inline in the user message content array:

"messages": [{
  "role": "user",
  "content": [
    {"type": "text", "text": "what's in this image?"},
    {"type": "image_url", "image_url": {"url": "https://…"}}
  ]
}]

Pricing: vision tokens are billed at 2× the text-token rate.

Embeddings

Endpoint: POST https://api.shrk.cloud/v1/embeddings. Currently serving shark-embed-large-v1 (1536 dims, multilingual). Same request shape as OpenAI text-embedding-3-large.

{
  "model": "shark-embed-large-v1",
  "input": ["text to embed", "another text"]
}

Available models

shark-llama-3.3-70b — flagship reasoning + code (beta). 128k context. Best for complex tasks.
shark-llama-3.3-70b-vision — same base + vision adapter (beta).
shark-mistral-24b-cold — fast, cheap, multilingual. 32k context. Best for throughput.
shark-gemma-9b — lightweight, edge-friendly. 8k context. Best for cost-sensitive workloads.
shark-gemma-9b-vision — same + vision (beta).
shark-embed-large-v1 — embeddings, 1536 dims.

All weights also live on HuggingFace under shark-shiver/ — you're free to download and self-host if you don't want to use the Shiver Network.

Sampling parameters

Sensible defaults if not set: temperature=0.7, top_p=1.0, max_tokens=4096. The vision and embedding endpoints ignore most sampling params.

Run a node

One-liner install (Linux + macOS):

curl -fsSL https://shrk.cloud/install | sh

The installer will:

Detect your GPU(s) and propose a safe thermal cap (default 70 °C).
Ask you for the Base address that should receive $SHIVER payouts.
Pull the latest node binary and register your endpoint with the routing layer.
Start serving inference in the background as a systemd service.

Monitor with shrk status. Stop with shrk stop. No staking, no penalty for stopping.

Multi-GPU & advanced node configs

The client supports a few patterns out of the box:

Multi-GPU on a single host — the client shards models across visible CUDA devices automatically. Override with SHRK_VISIBLE_DEVICES=0,2.
Model whitelist — by default the client serves every model that fits in VRAM. Restrict with SHRK_MODELS=shark-mistral-24b-cold,shark-gemma-9b.
Multi-host pool — point the client at a shared cluster controller (SHRK_CONTROLLER=tcp://10.0.0.1:9100) to serve a single virtual node across many physical machines.
Schedule — only serve during off-peak hours: SHRK_SCHEDULE="22:00-08:00". Useful for shared workstations.
Custom thermal target — SHRK_TEMP_MAX=68. The router will avoid sending requests if the GPU is within 2 °C of the cap.

All config can also live in /etc/shrk/config.toml; the env vars override it.

Billing & $SHIVER

Inference is priced per million tokens, denominated in $SHIVER. Roughly 30% cheaper than centralized providers at the same model size — and stays cheap because new node supply keeps coming online.

Indicative pricing (per 1M tokens, may vary with $SHIVER price):

shark-llama-3.3-70b — ~3500 $SHIVER input / ~7000 $SHIVER output
shark-mistral-24b-cold — ~800 $SHIVER input / ~1600 $SHIVER output
shark-gemma-9b — ~300 $SHIVER input / ~600 $SHIVER output
shark-embed-large-v1 — ~150 $SHIVER per 1M tokens

The protocol takes a 1% fee on every inference call. That fee is converted to $SHIVER and burned, creating deflationary pressure tied directly to network usage.

Limits & SLAs

No per-minute rate limits on production keys — you're throughput-capped only by available nodes for your model.
No content guardrails — outputs are unaligned. You are responsible for usage.
Latency — TTFT typically under 600ms when routed to a nearby cold node.
Uptime — peer-to-peer with N=3 redundancy per model; one node failure is invisible to callers.
Max request size — 1 MB (covers most use cases including vision data URIs).
Max context window — model-dependent, see Available models.

Error codes

400 — malformed request (bad JSON, unknown model, missing field).
401 — invalid or revoked API key.
402 — insufficient $SHIVER balance on the billing wallet.
409 — concurrent-request cap exceeded for your key tier.
429 — gateway rate-limit (only triggers for abusive patterns, not normal use).
503 — no node currently available for the requested model. Retry or pick a different model.

Every error response includes a code field, a human-readable message, and an x-shiver-request-id header you can include in bug reports.

SDKs

Because the API is OpenAI-compatible, the official OpenAI SDKs work out of the box with base_url overridden. Dedicated SDKs are in development:

Python — pip install shrk (drop-in superset of openai, adds node-pinning + $SHIVER balance helpers).
TypeScript / Node — npm i @shrk/sdk.
Go — go get cloud.shrk.io/sdk-go.
Rust — cargo add shrk.

Webhooks

Subscribe to events at the dashboard. Available event types:

node.online / node.offline — for your owned nodes.
payout.epoch — when an epoch payout is finalized for your wallet.
balance.low — when your billing wallet drops below a configurable threshold.
request.failed — for monitoring inference reliability at scale.

Every webhook is signed with an HMAC-SHA256 header (x-shrk-signature) so you can verify origin.

Community

Talk to the team and other node operators in the SHARK Telegram and Discord. Bug reports and pull requests welcome on the open-source repos. Anyone shipping with the API or running a node is welcome to a free monthly office-hours call.

Telegram — t.me/sharknet
Discord — discord.gg/sharknet
GitHub — github.com/shrk-cloud
Status page — status.shrk.cloud