Bring Your Own Model
Paste a HuggingFace model id. Watch it download. The moment it lands, every cartridge that knows how to use it picks it up — embeddings, transcription, vision, generation. Apple’s MLX, llama.cpp, Candle: pick the runtime, pick the size, pick the quant.
The 0.5B model that fits in a few hundred megabytes works. The 70B beast that needs your whole GPU works. Same surface, same right-click, same Transmute menu — only the speed and the answer change.
This page covers the user-facing side of model management — model specs, supported backends, model-family detection, where models live on disk, structured generation, what LlmGenerationRequest lets you tune.
For the cartridge-developer side (declaring an LLM cap, building a LlmGenerationRequest, parsing LlmStreamMessages), see the SDK Reference.
Model Specs
A model is referred to by a model spec string. The most common form is HuggingFace:
hf:OWNER/REPO
hf:OWNER/REPO?include=*PATTERN*
Examples:
hf:meta-llama/Llama-3.2-1B-Instruct
hf:MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF?include=*Q8_0*
hf:mlx-community/Llama-3.2-1B-Instruct-4bit
The include= filter is matched against filenames in the repo, useful for picking a specific quantization out of a multi-file GGUF release. Without a filter, the cartridge picks a default — usually the largest available quant that fits the runtime’s preferences.
The model spec string is also what cartridges send in LlmGenerationRequest.model_spec and what the dispatch rule routes on. Two cartridges that both serve cap:llm-inference;... will be ranked by how specifically they match the spec — a Mistral-aware cartridge that names the family in its tags will beat a generic one for Mistral models.
Backends
Three runtimes are first-class today:
| Backend | Source | Models |
|---|---|---|
GGUF (llama.cpp bindings) |
BACKEND_GGUF |
Quantized GGUF models — Llama, Mistral, Qwen, Phi, and the long tail of *-GGUF repos on HuggingFace |
| MLX | BACKEND_MLX |
Apple’s MLX — Apple Silicon native, mlx-community/* models |
| Candle | BACKEND_CANDLE |
Rust-native inference for selected model families |
A cartridge advertises which backend it implements; the SDK exposes a backend_for_model_spec(model_spec) helper that classifies a spec into one of these. Cartridges that wrap a single backend use the helper to advertise only the model specs they can handle.
The standard cap URN constants the cartridge SDK provides correspond directly to these backends:
CAP_LLM_INFERENCE_GGUFCAP_LLM_INFERENCE_MLXCAP_LLM_INFERENCE_CANDLECAP_LLM_INFERENCE_CONSTRAINED— constrained generation (JSON Schema, regex, grammar, tool-call) routed to whichever backend supports the constraint
When you ask a generic LLM cap, dispatch picks the backend automatically based on the model spec. When you call a backend-specific cap directly, you’ve named the runtime yourself.
Model Families
Different LLM families speak different prompt formats. Mistral uses [INST] ... [/INST]; Llama 3 uses header-based turns (<|start_header_id|>user<|end_header_id|>...); Phi uses <|user|> ... <|end|>; Qwen uses ChatML-style <|im_start|>user ... <|im_end|>. If a model receives the wrong turn markers, it doesn’t fail loudly — it generates plausible-looking but hallucinated multi-turn conversations.
MachineFabric handles this automatically. When a generation cap is dispatched, the host inspects the model spec string, detects the family from the path (mistral, llama, phi, qwen keywords), and picks the correct adapter — a small piece of code that knows the family’s prompt format, its conversation-boundary stop sequences, and how to parse its response.
Today the recognised families are:
| Family | Recognized by | Adapter handles |
|---|---|---|
| Mistral | mistral in path |
[INST]/[/INST] formatting; stop on [INST], <<SYS>> |
| Llama 3 | llama in path |
Header-based turns; stop on <\|start_header_id\|>, <\|eot_id\|> |
| Phi | phi in path |
Phi-3 special tokens; stop on <\|user\|>, <\|end\|> |
| Qwen | qwen in path |
ChatML format; stop on <\|im_start\|>, <\|im_end\|> |
| (fallback) | none of the above | Generic User:/Assistant: formatting |
New families are added by adding family detection in the host, defining the adapter, and adding the corresponding media URN entries. Cartridges that wrap a model don’t have to know about any of this — the adapter is selected by the host before the cartridge sees the request.
Generation Parameters
A generation request is a typed payload (LlmGenerationRequest, carried as media:llm-generation-request;json;record). The fields a cartridge can read:
| Field | Purpose |
|---|---|
prompt |
The user-facing input text. Required. |
model_spec |
The model spec string (HuggingFace id with optional include filter). Required. |
request_type |
Generate, GetVocab, or GetInfo. Defaults to Generate. |
system_prompt |
Optional system prompt prepended ahead of the user prompt. |
max_tokens, temperature |
The usual sampling controls. |
top_k, top_p, min_p |
Truncation/nucleus sampling. |
seed |
Optional integer seed for reproducible sampling. |
repeat_penalty |
Default 1.1 — penalises recently-generated tokens, breaks repetition loops. |
stop_sequences |
Additional caller-supplied stop strings, merged with the family adapter’s. |
chat_template |
Optional override for the model’s chat template. |
max_context_length, batch_size |
Runtime-tuning knobs. |
rope_freq_base, rope_freq_scale |
RoPE scaling for context-extended models. |
grammar, json_schema, constraint |
See Structured generation. |
hf_token |
HuggingFace token if the model needs auth at download time. |
Defaults are applied by LlmGenerationRequest::with_defaults() in the SDK; callers only set what they want to override.
Streaming Responses
Generation responses arrive as a stream of LlmStreamMessage values, one per line of media:llm-text-stream;ndjson:
| Message | Carries |
|---|---|
Token |
A single generated token (or text fragment) — for live display |
Status |
Progress info (preparing, generating, finalising) |
Complete |
End-of-generation marker with final stats (tokens generated, time spent) |
ToolRequest |
Constrained tool-call output — the model has produced a structured tool invocation |
Error |
A failure during generation; the stream ends after this |
Cartridges that consume the stream can render tokens live, count throughput, and surface tool calls back to whatever orchestrates the conversation. The MachineFabric UI uses exactly the same stream — the live progress shown during a generation comes from the Token and Status messages.
Structured Generation
For structured outputs the request can carry a constraint:
| Constraint | Carries |
|---|---|
JsonSchema |
A JSON Schema describing the expected output shape |
Regex |
A regular expression the output must match |
Grammar |
A formal grammar (Lark-style) the output must conform to |
ToolCall |
A list of ToolDefinitions the model can invoke |
Constrained generation is dispatched through CAP_LLM_INFERENCE_CONSTRAINED. The runtime enforces the constraint at sampling time — the model cannot generate output that violates the constraint, so structured outputs (JSON, tool calls, classifications) come back well-formed without a parser-and-retry loop.
Standard URN helpers in the SDK pre-build the cap URNs for common structured patterns:
generate_json_urn(lang_code)— generate a JSON record matching a schemamake_decision_urn(lang_code)— pick one of a set of optionsmake_multiple_decisions_urn(lang_code)— pick a subset of a set of optionsllm_summarization_urn(lang_code)— summarize textllm_codegeneration_urn(lang_code)— generate code
The full URN catalogue is in capdag.standard.caps.
Embeddings and Vision
Embeddings and image description follow the same shape — typed cap URNs, typed payloads:
| Cap constant | What it does |
|---|---|
CAP_GENERATE_EMBEDDINGS |
Produce a vector embedding from text or other input |
CAP_EMBEDDINGS_DIMENSIONS |
Report the embedding dimensions for a model |
CAP_DESCRIBE_IMAGE |
Generate a textual description of an image |
Embeddings are routed by model spec the same way generation is — paste a HuggingFace embeddings model id and any cartridge that knows how to run it picks it up.
Where Models Live
Downloaded models are cached on disk so they don’t have to be re-fetched. The exact directory layout is a host-implementation detail that may change; under the hood, MachineFabric tracks each downloaded model with a cap:model-status cap and exposes installation, removal, and disk-usage views in the UI. The cartridge SDK exposes the model lifecycle through these standard URN builders:
model_download_urn()— download a model by specmodel_list_urn()— list installed modelsmodel_status_urn()— query a model’s installation statusmodel_contents_urn()— list the files in an installed modelmodel_availability_urn()— check whether a spec resolves to a downloadable modelmodel_path_urn()— get the on-disk path of an installed model
Cartridges that need a model file can resolve it through model_path_urn() rather than hardcoding paths. The host enforces that a model is fully downloaded before the path resolves; cartridges don’t have to handle partial downloads.
Mixing Local and Remote
Cartridges currently run in heavily sandboxed processes with no default network access. The protocol does not preclude a cartridge that proxies to a cloud service — a cartridge that declares network entitlements and gets reviewed for inclusion can route requests to a remote model with the same media URNs every other cartridge uses. Users can mix local and remote capabilities in the same pipeline with clear semantics: the dispatch rule, the planner graph, and the live progress all behave identically. The only difference is where the inference physically runs.
The default, today, is local. Network proxying is opt-in.