Bring Your Own Model

Paste a HuggingFace model id. Watch it download. The moment it lands, every cartridge that knows how to use it picks it up — embeddings, transcription, vision, generation. Apple’s MLX, llama.cpp, Candle: pick the runtime, pick the size, pick the quant.

The 0.5B model that fits in a few hundred megabytes works. The 70B beast that needs your whole GPU works. Same surface, same right-click, same Transmute menu — only the speed and the answer change.

This page covers the user-facing side of model management — model specs, supported backends, model-family detection, where models live on disk, structured generation, what LlmGenerationRequest lets you tune.

For the cartridge-developer side (declaring an LLM cap, building a LlmGenerationRequest, parsing LlmStreamMessages), see the SDK Reference.

Model Specs

A model is referred to by a model spec string. The most common form is HuggingFace:

hf:OWNER/REPO
hf:OWNER/REPO?include=*PATTERN*

Examples:

hf:meta-llama/Llama-3.2-1B-Instruct
hf:MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF?include=*Q8_0*
hf:mlx-community/Llama-3.2-1B-Instruct-4bit

The include= filter is matched against filenames in the repo, useful for picking a specific quantization out of a multi-file GGUF release. Without a filter, the cartridge picks a default — usually the largest available quant that fits the runtime’s preferences.

The model spec string is also what cartridges send in LlmGenerationRequest.model_spec and what the dispatch rule routes on. Two cartridges that both serve cap:llm-inference;... will be ranked by how specifically they match the spec — a Mistral-aware cartridge that names the family in its tags will beat a generic one for Mistral models.

Backends

Three runtimes are first-class today:

Backend Source Models
GGUF (llama.cpp bindings) BACKEND_GGUF Quantized GGUF models — Llama, Mistral, Qwen, Phi, and the long tail of *-GGUF repos on HuggingFace
MLX BACKEND_MLX Apple’s MLX — Apple Silicon native, mlx-community/* models
Candle BACKEND_CANDLE Rust-native inference for selected model families

A cartridge advertises which backend it implements; the SDK exposes a backend_for_model_spec(model_spec) helper that classifies a spec into one of these. Cartridges that wrap a single backend use the helper to advertise only the model specs they can handle.

The standard cap URN constants the cartridge SDK provides correspond directly to these backends:

  • CAP_LLM_INFERENCE_GGUF
  • CAP_LLM_INFERENCE_MLX
  • CAP_LLM_INFERENCE_CANDLE
  • CAP_LLM_INFERENCE_CONSTRAINED — constrained generation (JSON Schema, regex, grammar, tool-call) routed to whichever backend supports the constraint

When you ask a generic LLM cap, dispatch picks the backend automatically based on the model spec. When you call a backend-specific cap directly, you’ve named the runtime yourself.

Model Families

Different LLM families speak different prompt formats. Mistral uses [INST] ... [/INST]; Llama 3 uses header-based turns (<|start_header_id|>user<|end_header_id|>...); Phi uses <|user|> ... <|end|>; Qwen uses ChatML-style <|im_start|>user ... <|im_end|>. If a model receives the wrong turn markers, it doesn’t fail loudly — it generates plausible-looking but hallucinated multi-turn conversations.

MachineFabric handles this automatically. When a generation cap is dispatched, the host inspects the model spec string, detects the family from the path (mistral, llama, phi, qwen keywords), and picks the correct adapter — a small piece of code that knows the family’s prompt format, its conversation-boundary stop sequences, and how to parse its response.

Today the recognised families are:

Family Recognized by Adapter handles
Mistral mistral in path [INST]/[/INST] formatting; stop on [INST], <<SYS>>
Llama 3 llama in path Header-based turns; stop on <\|start_header_id\|>, <\|eot_id\|>
Phi phi in path Phi-3 special tokens; stop on <\|user\|>, <\|end\|>
Qwen qwen in path ChatML format; stop on <\|im_start\|>, <\|im_end\|>
(fallback) none of the above Generic User:/Assistant: formatting

New families are added by adding family detection in the host, defining the adapter, and adding the corresponding media URN entries. Cartridges that wrap a model don’t have to know about any of this — the adapter is selected by the host before the cartridge sees the request.

Generation Parameters

A generation request is a typed payload (LlmGenerationRequest, carried as media:llm-generation-request;json;record). The fields a cartridge can read:

Field Purpose
prompt The user-facing input text. Required.
model_spec The model spec string (HuggingFace id with optional include filter). Required.
request_type Generate, GetVocab, or GetInfo. Defaults to Generate.
system_prompt Optional system prompt prepended ahead of the user prompt.
max_tokens, temperature The usual sampling controls.
top_k, top_p, min_p Truncation/nucleus sampling.
seed Optional integer seed for reproducible sampling.
repeat_penalty Default 1.1 — penalises recently-generated tokens, breaks repetition loops.
stop_sequences Additional caller-supplied stop strings, merged with the family adapter’s.
chat_template Optional override for the model’s chat template.
max_context_length, batch_size Runtime-tuning knobs.
rope_freq_base, rope_freq_scale RoPE scaling for context-extended models.
grammar, json_schema, constraint See Structured generation.
hf_token HuggingFace token if the model needs auth at download time.

Defaults are applied by LlmGenerationRequest::with_defaults() in the SDK; callers only set what they want to override.

Streaming Responses

Generation responses arrive as a stream of LlmStreamMessage values, one per line of media:llm-text-stream;ndjson:

Message Carries
Token A single generated token (or text fragment) — for live display
Status Progress info (preparing, generating, finalising)
Complete End-of-generation marker with final stats (tokens generated, time spent)
ToolRequest Constrained tool-call output — the model has produced a structured tool invocation
Error A failure during generation; the stream ends after this

Cartridges that consume the stream can render tokens live, count throughput, and surface tool calls back to whatever orchestrates the conversation. The MachineFabric UI uses exactly the same stream — the live progress shown during a generation comes from the Token and Status messages.

Structured Generation

For structured outputs the request can carry a constraint:

Constraint Carries
JsonSchema A JSON Schema describing the expected output shape
Regex A regular expression the output must match
Grammar A formal grammar (Lark-style) the output must conform to
ToolCall A list of ToolDefinitions the model can invoke

Constrained generation is dispatched through CAP_LLM_INFERENCE_CONSTRAINED. The runtime enforces the constraint at sampling time — the model cannot generate output that violates the constraint, so structured outputs (JSON, tool calls, classifications) come back well-formed without a parser-and-retry loop.

Standard URN helpers in the SDK pre-build the cap URNs for common structured patterns:

  • generate_json_urn(lang_code) — generate a JSON record matching a schema
  • make_decision_urn(lang_code) — pick one of a set of options
  • make_multiple_decisions_urn(lang_code) — pick a subset of a set of options
  • llm_summarization_urn(lang_code) — summarize text
  • llm_codegeneration_urn(lang_code) — generate code

The full URN catalogue is in capdag.standard.caps.

Embeddings and Vision

Embeddings and image description follow the same shape — typed cap URNs, typed payloads:

Cap constant What it does
CAP_GENERATE_EMBEDDINGS Produce a vector embedding from text or other input
CAP_EMBEDDINGS_DIMENSIONS Report the embedding dimensions for a model
CAP_DESCRIBE_IMAGE Generate a textual description of an image

Embeddings are routed by model spec the same way generation is — paste a HuggingFace embeddings model id and any cartridge that knows how to run it picks it up.

Where Models Live

Downloaded models are cached on disk so they don’t have to be re-fetched. The exact directory layout is a host-implementation detail that may change; under the hood, MachineFabric tracks each downloaded model with a cap:model-status cap and exposes installation, removal, and disk-usage views in the UI. The cartridge SDK exposes the model lifecycle through these standard URN builders:

  • model_download_urn() — download a model by spec
  • model_list_urn() — list installed models
  • model_status_urn() — query a model’s installation status
  • model_contents_urn() — list the files in an installed model
  • model_availability_urn() — check whether a spec resolves to a downloadable model
  • model_path_urn() — get the on-disk path of an installed model

Cartridges that need a model file can resolve it through model_path_urn() rather than hardcoding paths. The host enforces that a model is fully downloaded before the path resolves; cartridges don’t have to handle partial downloads.

Mixing Local and Remote

Cartridges currently run in heavily sandboxed processes with no default network access. The protocol does not preclude a cartridge that proxies to a cloud service — a cartridge that declares network entitlements and gets reviewed for inclusion can route requests to a remote model with the same media URNs every other cartridge uses. Users can mix local and remote capabilities in the same pipeline with clear semantics: the dispatch rule, the planner graph, and the live progress all behave identically. The only difference is where the inference physically runs.

The default, today, is local. Network proxying is opt-in.