
Houdini Cloud Rendering: A VFX Simulation Deep Dive for Pyro, FLIP, Vellum, Destruction, and Crowds
Overview
| Field | Value |
|---|---|
| metaTitle | Houdini Cloud Rendering: VFX Simulation Deep Dive |
| metaDescription | Cache, upload, and render Houdini Pyro, FLIP, Vellum, RBD, and Crowds on a cloud farm — substep tuning, .bgeo.sc sizing, and honest limits. |
| slug | houdini-cloud-rendering-vfx-simulation-deep-dive-2026 |
| author | Thierry Marc |
| categories | Rendering |
| tags | Houdini, VFX, Cloud Rendering, Advanced, Performance, GPU Rendering |
| operation | strapi-create |
Houdini scenes have a way of generating output long before they generate frames. A FLIP simulation that takes nine hours to cache locally, a Pyro plume baked across 240 frames, a Vellum cloth solve that fills a 4 TB scratch disk — and that is before a single Karma sample lands on the beauty pass. For the FX TDs and lookdev artists we work with, the bottleneck is rarely the render itself. It is the simulation, the cache, and the version-juggling that consumes the week, and then "the render" becomes the thing that has to fit into Friday afternoon.
That gap — between the sim that finishes and the frames that ship — is where cloud rendering decisions live. We have been operating Super Renders Farm since 2017, with the team running distributed rendering for FX-heavy production work since 2010. The questions we hear from Houdini FX TDs are almost never "should we cloud-render?" They are "will my Pyro cache survive the trip?" and "if I move my Vellum bake to the farm, will the substeps still be stable?" The answer depends on what the sim is doing, which is why this article is organized per simulation type rather than per workflow stage.
What follows is a per-sim-type optimization manual. For end-to-end workflow setup — scene preparation, $HIP and $JOB paths, USD asset resolution, plugin version pinning — see our Houdini cloud render farm setup guide. For a vendor comparison of managed Houdini farms across pricing, hardware, and renderer support, see our head-to-head comparison of Houdini render farms. This deep-dive assumes the scene is upload-clean and focuses on the per-sim-type knobs that decide whether the simulation itself survives the trip to a worker fleet, and what comes back from it.
Why Houdini Sims Stress a Cloud Farm Differently
Most render farm content frames the workload as "frames per hour" — a fixed scene rendered N times across N workers. That model fits a static lookdev pass on Karma or Redshift. It does not fit a Houdini sim, because in Houdini the "scene" is not finalized until the sim cache is finalized. A Pyro plume, a FLIP volume, a Vellum cloth pre-roll — these are intermediate state, not scene state. The farm has to either receive that intermediate state pre-baked, or rebuild it from a .hip file the worker just received, and those two paths have very different cost profiles.
The single-machine bound on most Houdini solvers is the operative constraint here. DOPnet substep coherence — the requirement that frame N depends on frame N-1 in the same solver context — means Pyro, FLIP, Vellum, and RBD solves are mostly not parallelizable across worker nodes mid-sim. PDG distributes wedges and frame-independent SOP work; the underlying solve loop generally does not fan out. Practical implication: a sim either fits on one worker and on the workstation, or it does not run on the farm at all. The farm wins on render-side parallelism, not sim-side parallelism.
On our farm, the CPU side runs Dual Intel Xeon E5-2699 V4 nodes with 96–256 GB RAM (20,000+ cores aggregate), which is the relevant tier for cache rebuild and CPU sim/render passes; the GPU side runs RTX 5090 cards with 32 GB VRAM each, which is the tier that Karma XPU and Redshift consume at the render phase. The sim phase and the render phase land on different fleets, which is why pricing on Houdini work is almost always a two-line item — CPU GHz-hours for the cache rebuild (if needed), GPU node-hours for the render.
The canonical command-line entry points are hbatch (classic Houdini scene execution) and husk (husk command-line reference — Solaris/USD stage rendering). Most farm-side automation runs through one of these, with the .hip uploaded once and either re-executed per frame range (hbatch) or rendered against a pre-baked USD stage (husk). Per-sim-type, the question is: do we ship a baked cache and run husk, or do we ship the .hip and let hbatch rebuild?
Pyro: Smoke, Fire, Explosion Caches at Cloud Scale
Pyro is the Houdini smoke, fire, and explosion solver — a sparse-grid combustion model built on the Pyro Solver DOPnet, writing .vdb volumes per frame. Combustion produces temperature, density, fuel, velocity, and divergence fields, and the voxel grid is the primary control on cache size: halving voxel size roughly 8x the memory and disk cost (cubic scaling). For full technical context, see SideFX Pyro documentation.
Cache strategy. Almost always bake to .vdb (OpenVDB sparse) rather than .bgeo.sc, because Pyro fields are sparse by nature — most voxels are empty air. OpenVDB's narrow-band storage drops dead voxels from disk. Substep count matters here: Pyro solves cleanly at 1–2 substeps for slow-moving plumes, 4–8 substeps for fast combustion or shockwaves. Higher substep on the farm means the worker spends more CPU per frame; lower substep means the worker spends less but the sim can lose coherence on fast motion. Pin the substep count in the DOPnet, do not rely on farm-default behavior.
Voxel size, advection scheme (Semi-Lagrangian vs Trilinear vs MacCormack), and the combustion model parameters together set the per-frame .vdb size. A mid-complexity Pyro plume at 0.05 voxel size, 240 frames, typically lands in the 20–60 GB total range. Pre-flight check the per-frame cache size before upload — bandwidth on the upload is often the bottleneck, not the render.
Cloud farm considerations. Pyro renders on GPU via Karma XPU or Redshift volume rendering, both of which consume .vdb natively. The sim itself is CPU-bound and OpenCL-accelerable, but the OpenCL acceleration mostly helps workstation bake speed, not farm-side parallel-frame sim (because each frame still depends on the prior frame). Practical pattern: bake locally, upload the .vdb sequence, render on the GPU fleet.
# Render a cached Pyro plume via husk on a USD stage,
# with Karma XPU on the GPU node and volume samples raised.
husk --renderer karma \
--frame 1 --frame-count 240 --frame-inc 1 \
--verbose 3a \
--output "$HIP/render/pyro_plume.\$F4.exr" \
--settings xpu \
--override "/Render/rendersettings:karma:volumesamples=8" \
"$HIP/stage/pyro_plume_volumes.usd"
The husk invocation against a USD stage that references the cached .vdb sequence lets the GPU worker draw the volume without re-solving. Raising volumesamples from the Karma default 4 to 8 reduces noise on dense plumes at the cost of roughly 1.5–2x render time. Use 16 for hero shots, leave at 4 for pre-vis.
FLIP: Liquids, Surface Reconstruction, Narrow-Band
FLIP — the Fluid Implicit Particle solver — combines particle and grid representations to simulate water, viscous liquids, and free-surface flow. The output is two things: a particle cache (.bgeo.sc packed sequence) and, optionally, a reconstructed surface mesh (also .bgeo.sc). Both go to the farm, which means FLIP almost always doubles its disk footprint compared to Pyro on equivalent complexity.
Cache strategy. Separate particle cache and surface mesh into two cache directories — particles bake first, mesh reconstructs from particles in a downstream SOP network. This split lets you re-mesh without re-simming, which matters when surface tension or particle separation needs a second pass. A 200-frame mid-complexity FLIP sim at 0.02 particle separation often lands at 80–200 GB on the particle side and 20–40 GB on the mesh side. Narrow-band FLIP — where only particles near the surface are stored at full density — cuts the particle cache by 60–80% on shots where the deep volume is not visible. Turn it on when the camera does not look through the water.
Viscosity stiffness and CFL constraints set the substep count. Water sims at viscosity 0 typically run at 1–2 substeps; honey or molten metal at high viscosity often needs 5–10 substeps to remain stable. CFL violation produces particle explosions, which on a farm are far more expensive than on a workstation because you do not see them until the render finishes.
Cloud farm considerations. The cache upload time is the dominant cost for FLIP on a cloud farm. A 100 GB particle cache over a 100 Mbps client uplink takes roughly 2.5 hours before the first render frame can start. On a 1 Gbps uplink, the same cache uploads in ~15 minutes — the difference is often what decides whether cloud FLIP is operational for a shot. Audit cache size before uploading.
# Hython probe — run from a workstation or worker to compute
# per-frame cache size before paying upload bandwidth on a
# multi-TB FLIP sim. Use as pre-flight gate.
import os, hou
cache_dir = hou.expandString("$HIP/cache/flip/v003")
total = 0
frames = 0
for f in sorted(os.listdir(cache_dir)):
if f.endswith(".bgeo.sc"):
size = os.path.getsize(os.path.join(cache_dir, f))
total += size
frames += 1
print(f"{frames} frames, {total/1e9:.2f} GB total, "
f"{(total/frames)/1e6:.1f} MB/frame avg")
If the per-frame average exceeds 500 MB and the total exceeds 100 GB, either accept the upload window or revisit particle separation, narrow-band, and the surface mesh threshold before transferring.
Vellum: Cloth, Soft Body, Constraint Serialization
Vellum is the Houdini position-based dynamics framework — cloth, soft body, hair, grains, fluids on a constrained-position formulation. Output is a .bgeo.sc cache per frame, but unlike Pyro or FLIP, Vellum caches carry constraint state in addition to point positions. The constraint graph (pin, stretch, bend, attach-to-static) must serialize cleanly or the farm worker re-solves with broken constraints. See Vellum solver documentation for the constraint type matrix.
Cache strategy. Cache after the Vellum Solver DOPnet, before any downstream SOP cleanup. Use the Vellum I/O SOP rather than a generic File Cache, because Vellum I/O preserves constraint attributes (__constraintnetwork, restlength, stiffness) that a generic cache will strip. Pre-roll matters: cloth needs 20–40 frames of settle before the camera frame range starts, and the pre-roll must be in the cache or the worker will render frame 1 with un-settled cloth. Most Vellum production rigs bake pre-roll into frames -20 through 0 of the cache.
Substep count is the single most common cause of farm-side Vellum failures. The Vellum Solver default substeps (5) work for slow drapes and basic character cloth, but fast motion, high stretch ratios, and tight pin networks often need 10–20 substeps to remain stable. Pin the substep count explicitly in the DOPnet — letting the farm worker use its default is where cross-frame stability tends to break, because workstation-baked caches at substep 10 do not match worker-rebuilt caches at substep 5.
Cloud farm considerations. Vellum is CPU-bound, single-threaded per solver (with limited multi-thread on constraint resolution), and the cache size is modest — usually 5–20 GB per shot — so the upload bottleneck is less acute than FLIP. The dominant cost on a farm is the rebuild time if the .hip is shipped instead of a baked cache. A 4 GB Vellum cache (mid-complexity costume) typically takes 30–90 minutes to bake on a single E5-2699 worker. If you bake locally first and upload the cache, the farm sees only the render cost.
# Batch-bake a Vellum cloth solve to .bgeo.sc with explicit
# substep override. Default substep counts are where farm-side
# stability tends to break on production-grade cloth.
hbatch -c "render -f 1 240 -i 1 \
-v vellum_substeps=10 \
-v cache_format=bgeo.sc \
/obj/cloth_sim/dop_cache_OUT" \
-d "/obj/cloth_sim/dopnet1" \
"$HIP/scenes/cloth_main_v007.hip"
The -v vellum_substeps=10 override pins substep count regardless of what the .hip's saved DOPnet parameter says. This is the single safest hedge against farm-side Vellum stability drift.
Destruction: RBD, Bullet, Constraint Networks
Destruction in Houdini means rigid-body dynamics — RBD Solver, Bullet, and the constraint networks that glue, pin, or snap fractured geometry. The cache format is .bgeo.sc packed primitives, with each packed prim representing a fractured chunk and the constraint network stored as a separate .bgeo.sc per frame. Fracture upstream of the sim — done in SOPs, not DOPs — and only the post-fracture geometry plus constraint network goes to the solver.
Cache strategy. Two caches matter: the fractured geometry (static, one frame) and the dynamic transform-per-frame state. Cache only the transforms during the sim, then apply them at render time via the Packed Disk Primitive workflow. This separates the heavy geometry (often 5–50 GB of fractured pieces) from the cheap dynamic state (typically 10–50 MB per frame). The geometry uploads once; the dynamic cache uploads per shot.
Constraint network serialization is the gotcha. RBD constraints (glue, hard, soft, cone-twist) carry a __constraintnetwork attribute that the Vellum I/O equivalent — the RBD I/O SOP — handles correctly, but a generic File Cache will not. Use RBD I/O for the constraint side; use the standard packed-prim cache for the transforms.
Cloud farm considerations. RBD sims are deterministic if random seeds are pinned. The default behavior — $F-driven seeds, time-of-day seeds, or unset seeds — produces different fracture patterns on different workers. On a farm where one worker rebuilds the cache and another renders against an expected pattern (e.g., a comp setup pre-built on a workstation cache), seed drift produces visible mismatches that only surface after the render lands. Pin every random seed before the bake.
# Deterministic RBD bake — pin RBD_SEED so two workers
# rendering the same frame range produce identical fracture caches.
# Without this, fracture solves can desync between workstation
# and farm, surfacing as comp-time mismatches.
hbatch -c "set -g \$RBD_SEED 42; \
render -f 1 200 \
-v packed_prims=1 \
/obj/destruction/dop_constraint_OUT" \
"$HIP/scenes/destruction_v012.hipnc"
Bullet vs RBD Solver: Bullet is faster for large piece counts (1000+ chunks) and acceptable for mid-quality destruction; RBD Solver is more accurate for hero-shot dynamics, stack-collapse, and constraint-driven setups, at roughly 3–5x the per-frame solve cost. On a farm, Bullet is the practical default unless the shot is hero.
Crowds: Agents, LOD, Ragdoll Handoff
Houdini Crowds is the agent simulation framework — populations of agents with motion clip libraries, behavior states, and LOD variants. The cache is more complex than other sim types: agent caches (.bclip.sc motion clips), crowd transform caches (.bgeo.sc per frame), and agent prim packed-prim references that resolve at render time. Each agent has its own LOD hierarchy, swapped at render time via Solaris variant sets.
Cache strategy. Bake the crowd sim to a .bgeo.sc transform cache (one file per frame, holding per-agent transform plus motion-clip index). The agent geometry — the actual mesh data — lives in a separate .bclip.sc library that is referenced, not baked per frame. This split is the entire reason crowd renders are tractable: a thousand-agent shot might have a 200 MB transform cache per frame but only 2 GB of agent geometry total, referenced from disk.
Motion clip caching matters because crowds animate by blending clips, not by per-frame keyframes. The clip library has to be on the worker before the render starts. Bake the clip library once, upload to the worker's persistent storage, then per-shot uploads are only the transform cache.
Ragdoll handoff — where an agent transitions from clip-driven animation to RBD-driven physics — needs special treatment on the farm. The ragdoll state cache is separate from the crowd transform cache, and the handoff frame must be deterministic, which means pinning seeds and ragdoll start frames explicitly. Otherwise different workers produce different ragdoll trajectories.
Cloud farm considerations. Crowds render on Karma (CPU and XPU) via Solaris stages, with agent LOD variants resolved at render time. Render-time LOD swap means you can change LOD per shot without re-simming — the high-LOD agents render for hero shots, low-LOD for wide shots, without touching the cache.
# Render a Solaris crowd stage with agent LOD selected at husk
# invocation. Karma honors agent prim variants for LOD swapping
# without rebuilding the crowd sim.
husk --renderer karma --settings xpu \
--frame 1 --frame-count 240 \
--output "$HIP/render/crowd.\$F4.exr" \
--override "/World/crowd:variantSet:lodVariant:value=mid" \
"$HIP/stage/crowd_main.usd"
The lodVariant:value=mid override selects the mid-LOD agent variant set at render time. Swap to low for distant background passes and high for hero foreground without re-running the crowd sim. This is the largest single-shot cost saver in cloud crowd rendering — render-time LOD lets one cache serve every shot in a sequence.
Honest Limits: When the Farm Is Not the Right Tool
A cloud farm is not the answer to every Houdini sim problem, and being explicit about that prevents shots from ending up on the farm because nobody asked the upstream question.
Distributed simulation is largely infeasible. Pyro, FLIP, Vellum, and RBD solvers are mostly single-machine-bound by DOPnet substep coherence. PDG can distribute wedges and frame-independent SOP work, but the inner solve loop generally cannot fan out across worker nodes mid-sim. If your sim does not fit on one machine — and the farm worker is typically a Dual Xeon E5-2699 V4 with 96–256 GB RAM, not radically different from a high-end workstation — moving it to the farm does not solve the problem.
Cache upload bandwidth math. A 100 GB FLIP cache over a 100 Mbps client uplink takes roughly 2.5 hours before the first render frame starts. Cache uploads are wall-clock-time you pay before any rendering happens. Gigabit uplink helps; client-side workstation bandwidth often does not.
GPU vs CPU sim trade-off is settled, but not the way users expect. Pyro and FLIP have OpenCL paths that accelerate substep solves on the workstation. The farm-side win is parallel-frame rendering of the cached sim, not parallel-frame sim. Reframe: GPU on the farm equals render acceleration via Karma XPU or Redshift; CPU on the farm equals cache rebuild if you ship a .hip instead of a cache.
Iteration latency on cloud-side sim adjustment. If you tweak a Pyro density parameter and need to re-sim, you re-upload the modified .hip, re-cache on the worker, then render. The cycle on the farm is often 4–8x the local cycle for sim-heavy work. Cache on the workstation if the workstation can hold the resolution; the farm wins on the render side, not the iteration side.
License token availability for Houdini Engine. Render-only utilization on a managed farm covers Houdini render workers, but Houdini Engine licenses (for HDA-heavy procedural pipelines, game-asset workflows) are a separate seat type. Confirm with the farm whether Engine tokens are pooled and how concurrency is handled before submitting Engine-dependent scenes. When the farm is the right tool but the question becomes which farm, our head-to-head comparison of Houdini render farms covers five managed providers across Houdini-specific criteria.
Workflow Recommendations Summary
Per sim type, the cache-locally-versus-bake-on-farm decision usually falls out like this. Pyro: bake locally, upload .vdb, render GPU. FLIP: bake locally if uplink supports the cache size, otherwise consider hbatch rebuild on the worker. Vellum: almost always bake locally — caches are small, rebuild times are non-trivial. Destruction: bake locally with seeds pinned, upload transform cache (not full geometry per frame). Crowds: bake the transform cache locally, upload once with the agent library, render with LOD variants per shot.
The decision tree we walk new Houdini clients through on our Houdini cloud render farm landing page covers the renderer matrix at the buyer level; this article covered the FX-TD-level optimization knobs underneath. CPU pricing on our published rate is $0.004/GHz-Hr — relevant when sizing a multi-day cache rebuild against a workstation alternative. The renderer matrix on our farm supports Karma XPU and Karma CPU, Mantra, Redshift, Arnold, V-Ray for Houdini, and Octane.
FAQ
Q: What is the best .bgeo.sc compression setting for cloud upload bandwidth?
A: For FLIP particle caches, the default .bgeo.sc (packed sparse compression) is already near-optimal for upload — the format is designed for it. The largest single bandwidth win is upstream: turn on narrow-band FLIP when the camera does not look through the deep volume, which can cut particle cache size by 60–80% before the compression even runs. For Vellum and RBD caches, .bgeo.sc is similarly already optimal; gains come from caching only what changes (transforms, not full geometry per frame), not from changing the format.
Q: Can I run distributed Pyro or FLIP simulations across multiple cloud workers? A: No, not for the inner solve loop. Pyro, FLIP, Vellum, and RBD all rely on DOPnet substep coherence — frame N depends on frame N-1 in the same solver context — so the solve cannot fan out across worker nodes mid-sim. PDG can distribute wedges (parameter sweeps) and frame-independent SOP work, but the actual solver runs on one machine. The farm's win on sim work is parallel-frame rendering of the baked cache, not parallel-frame simulation.
Q: Should I cache Vellum on the workstation or the farm? A: Almost always on the workstation. Vellum caches are modest (typically 5–20 GB per shot), workstation bake times are manageable (30–90 minutes for a mid-complexity costume on a single CPU), and the cache uploads cheap. Letting the farm worker rebuild a Vellum cache from the .hip means paying CPU GHz-hours on the worker side that you would have spent locally for free. The exception: shot-revision scenarios where you tweak the .hip and the substep change makes the cache invalid; in those cases, farm rebuild is reasonable.
Q: Does Karma XPU support volume rendering of cached Pyro .vdb files on a cloud worker?
A: Yes. Karma XPU consumes OpenVDB volumes natively via Solaris stages, and a husk invocation against a USD stage that references the cached .vdb sequence renders the volume without re-solving. The GPU worker draws the volume directly; the sim does not have to be present on the worker. Raise karma:volumesamples from the default 4 to 8 for production-quality volumes, 16 for hero shots — the cost is roughly 1.5–2x render time per doubling.
Q: How do I keep RBD destruction sims deterministic across cloud workers?
A: Pin every random seed in the DOPnet before the bake — RBD_SEED, fracture seeds, and any $F-driven or time-of-day seeds. Without seed pinning, the same RBD scene baked on two different workers produces different fracture patterns, which surfaces as comp-time mismatches when a workstation-rendered reference and a farm-rendered final do not match. Set the seed as a global variable in the hbatch invocation (set -g $RBD_SEED 42) and verify the DOPnet reads it.
Q: Do I need Houdini Engine licenses to render crowd sims on a cloud farm?
A: It depends on how the crowd pipeline is constructed. A crowd sim that bakes to a .bgeo.sc transform cache and renders via Karma against the cache does not need Engine — the render-only license tier handles it. A crowd sim that runs HDAs at render time (procedural agent generation, instanced procedural assets) may need Engine seats. Confirm with the farm whether Engine tokens are pooled and how concurrency is handled. On our farm, the render-only license model lets us render Karma XPU on the GPU fleet and CPU renderers on the CPU fleet without Engine seat constraints; HDA-heavy crowd pipelines should be discussed during shot setup.
Related Reading
- Houdini Cloud Render Farm
- Houdini Cloud Render Farm Setup Guide for 2026
- Head-to-Head Comparison of Houdini Render Farms 2026
- Render Farm Cost per Frame Guide
External Resources
About Thierry Marc
3D Rendering Expert with over 10 years of experience in the industry. Specialized in Maya, Arnold, and high-end technical workflows for film and advertising.



