Models - MachineFabric Docs

Bring Your Own Model

Paste a HuggingFace model id. Watch it download. The moment it lands, every cartridge that knows how to use it picks it up — embeddings, transcription, vision, generation. Apple’s MLX, llama.cpp, Candle: pick the runtime, pick the size, pick the quant.

The 0.5B model that fits in a few hundred megabytes works. The 70B beast that needs your whole GPU works. Same surface, same right-click, same Transmute menu — only the speed and the answer change.

This page covers the user-facing side of model management — model specs, supported backends, model-family detection, where models live on disk, structured generation, what LlmGenerationRequest lets you tune.

For the cartridge-developer side (declaring an LLM cap, building a LlmGenerationRequest, parsing LlmStreamMessages), see the SDK Reference.

Model Specs

A model is referred to by a model spec string. The most common form is HuggingFace:

hf:OWNER/REPO
hf:OWNER/REPO?include=*PATTERN*

Examples:

hf:meta-llama/Llama-3.2-1B-Instruct
hf:MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF?include=*Q8_0*
hf:mlx-community/Llama-3.2-1B-Instruct-4bit

The include= filter is matched against filenames in the repo, useful for picking a specific quantization out of a multi-file GGUF release. Without a filter, the cartridge picks a default — usually the largest available quant that fits the runtime’s preferences.

The model spec string is also what cartridges send in LlmGenerationRequest.model_spec and what the dispatch rule routes on. Two cartridges that both serve cap:llm-inference;... will be ranked by how specifically they match the spec — a Mistral-aware cartridge that names the family in its tags will beat a generic one for Mistral models.

Backends

Three runtimes are first-class today:

Backend	Source	Models
GGUF (`llama.cpp` bindings)	`BACKEND_GGUF`	Quantized GGUF models — Llama, Mistral, Qwen, Phi, and the long tail of `*-GGUF` repos on HuggingFace
MLX	`BACKEND_MLX`	Apple’s MLX — Apple Silicon native, `mlx-community/*` models
Candle	`BACKEND_CANDLE`	Rust-native inference for selected model families

A cartridge advertises which backend it implements; the SDK exposes a backend_for_model_spec(model_spec) helper that classifies a spec into one of these. Cartridges that wrap a single backend use the helper to advertise only the model specs they can handle.

The standard cap URN constants the cartridge SDK provides correspond directly to these backends:

CAP_LLM_INFERENCE_GGUF
CAP_LLM_INFERENCE_MLX
CAP_LLM_INFERENCE_CANDLE
CAP_LLM_INFERENCE_CONSTRAINED — constrained generation (JSON Schema, regex, grammar, tool-call) routed to whichever backend supports the constraint

When you ask a generic LLM cap, dispatch picks the backend automatically based on the model spec. When you call a backend-specific cap directly, you’ve named the runtime yourself.

Model Families

MachineFabric handles this automatically. When a generation cap is dispatched, the host inspects the model spec string, detects the family from the path (mistral, llama, phi, qwen keywords), and picks the correct adapter — a small piece of code that knows the family’s prompt format, its conversation-boundary stop sequences, and how to parse its response.

Today the recognised families are:

Family	Recognized by	Adapter handles
Mistral	`mistral` in path	`[INST]/[/INST]` formatting; stop on `[INST]`, `<<SYS>>`
Llama 3	`llama` in path	Header-based turns; stop on `<\\|start_header_id\\|>`, `<\\|eot_id\\|>`
Phi	`phi` in path	Phi-3 special tokens; stop on `<\\|user\\|>`, `<\\|end\\|>`
Qwen	`qwen` in path	ChatML format; stop on `<\\|im_start\\|>`, `<\\|im_end\\|>`
(fallback)	none of the above	Generic `User:/Assistant:` formatting

New families are added by adding family detection in the host, defining the adapter, and adding the corresponding media URN entries. Cartridges that wrap a model don’t have to know about any of this — the adapter is selected by the host before the cartridge sees the request.

Generation Parameters

A generation request is a typed payload (LlmGenerationRequest, carried as media:llm-generation-request;json;record). The fields a cartridge can read:

Field	Purpose
`prompt`	The user-facing input text. Required.
`model_spec`	The model spec string (HuggingFace id with optional include filter). Required.
`request_type`	`Generate`, `GetVocab`, or `GetInfo`. Defaults to `Generate`.
`system_prompt`	Optional system prompt prepended ahead of the user prompt.
`max_tokens`, `temperature`	The usual sampling controls.
`top_k`, `top_p`, `min_p`	Truncation/nucleus sampling.
`seed`	Optional integer seed for reproducible sampling.
`repeat_penalty`	Default `1.1` — penalises recently-generated tokens, breaks repetition loops.
`stop_sequences`	Additional caller-supplied stop strings, merged with the family adapter’s.
`chat_template`	Optional override for the model’s chat template.
`max_context_length`, `batch_size`	Runtime-tuning knobs.
`rope_freq_base`, `rope_freq_scale`	RoPE scaling for context-extended models.
`grammar`, `json_schema`, `constraint`	See Structured generation.
`hf_token`	HuggingFace token if the model needs auth at download time.

Defaults are applied by LlmGenerationRequest::with_defaults() in the SDK; callers only set what they want to override.

Streaming Responses

Generation responses arrive as a stream of LlmStreamMessage values, one per line of media:llm-text-stream;ndjson:

Message	Carries
`Token`	A single generated token (or text fragment) — for live display
`Status`	Progress info (preparing, generating, finalising)
`Complete`	End-of-generation marker with final stats (tokens generated, time spent)
`ToolRequest`	Constrained tool-call output — the model has produced a structured tool invocation
`Error`	A failure during generation; the stream ends after this

Cartridges that consume the stream can render tokens live, count throughput, and surface tool calls back to whatever orchestrates the conversation. The MachineFabric UI uses exactly the same stream — the live progress shown during a generation comes from the Token and Status messages.

Structured Generation

For structured outputs the request can carry a constraint:

Constraint	Carries
`JsonSchema`	A JSON Schema describing the expected output shape
`Regex`	A regular expression the output must match
`Grammar`	A formal grammar (Lark-style) the output must conform to
`ToolCall`	A list of `ToolDefinition`s the model can invoke

Constrained generation is dispatched through CAP_LLM_INFERENCE_CONSTRAINED. The runtime enforces the constraint at sampling time — the model cannot generate output that violates the constraint, so structured outputs (JSON, tool calls, classifications) come back well-formed without a parser-and-retry loop.

Standard URN helpers in the SDK pre-build the cap URNs for common structured patterns:

generate_json_urn(lang_code) — generate a JSON record matching a schema
make_decision_urn(lang_code) — pick one of a set of options
make_multiple_decisions_urn(lang_code) — pick a subset of a set of options
llm_summarization_urn(lang_code) — summarize text
llm_codegeneration_urn(lang_code) — generate code

The full URN catalogue is in capdag.standard.caps.

Embeddings and Vision

Embeddings and image description follow the same shape — typed cap URNs, typed payloads:

Cap constant	What it does
`CAP_GENERATE_EMBEDDINGS`	Produce a vector embedding from text or other input
`CAP_EMBEDDINGS_DIMENSIONS`	Report the embedding dimensions for a model
`CAP_DESCRIBE_IMAGE`	Generate a textual description of an image

Embeddings are routed by model spec the same way generation is — paste a HuggingFace embeddings model id and any cartridge that knows how to run it picks it up.

Where Models Live

Downloaded models are cached on disk so they don’t have to be re-fetched. The exact directory layout is a host-implementation detail that may change; under the hood, MachineFabric tracks each downloaded model with a cap:model-status cap and exposes installation, removal, and disk-usage views in the UI. The cartridge SDK exposes the model lifecycle through these standard URN builders:

model_download_urn() — download a model by spec
model_list_urn() — list installed models
model_status_urn() — query a model’s installation status
model_contents_urn() — list the files in an installed model
model_availability_urn() — check whether a spec resolves to a downloadable model
model_path_urn() — get the on-disk path of an installed model

Cartridges that need a model file can resolve it through model_path_urn() rather than hardcoding paths. The host enforces that a model is fully downloaded before the path resolves; cartridges don’t have to handle partial downloads.

Mixing Local and Remote

Cartridges currently run in heavily sandboxed processes with no default network access. The protocol does not preclude a cartridge that proxies to a cloud service — a cartridge that declares network entitlements and gets reviewed for inclusion can route requests to a remote model with the same media URNs every other cartridge uses. Users can mix local and remote capabilities in the same pipeline with clear semantics: the dispatch rule, the planner graph, and the live progress all behave identically. The only difference is where the inference physically runs.

The default, today, is local. Network proxying is opt-in.