This out of core pipeline is a super cool idea! I work on OptiX and this is the first time I’ve seen someone try this kind of thing. There are different kinds of out of core rendering, and not all of them would be guaranteed to finish without running out of memory. This one is, as long as all the individual meshes can fit in GPU memory.
Chris are you on HN? I am curious whether some savings might be possible here by not snapshotting the IASes and GASes. Instead you can snapshot only the raw geometry, and then rebuild the GASes and IASes before every launch. It may be faster to rebuild the BVHs than to copy them from CPU ram!
*Oh incidentally looking through the code I also noticed that the GASes could be compacted which will save a lot of memory and time/bandwidth on the snapshot copies during rendering. Since you do compact the IAS, I am assuming it's problematic to try and compact the GASes?
Thanks! I really enjoyed working with OptiX. BVH re-building potentially being faster than the associated transfer cost is something that I didn't consider, I'll have to try that out and compare.
As for the lack of GAS compaction, I added IAS compaction to squeeze all of the beach's debris into a single snapshot, and just never got to GAS compaction. Right now each rendering pass is only computing one sample per pixel- so in addition to reducing the BVH footprints, I'm hoping I can also hide the transfer costs by computing many more samples per pass.
Yes if you have a viable way to compute multiple samples per pass and store the updates, you’ll definitely be able to amortize the cost of the snapshot copies and save a lot of time.
It’s been a while since I played with Moana’s meshes, so I don’t remember what GAS compaction will buy you exactly, but generally speaking it’s common to see the compacted GAS end up at about 50% of the size of the first build. If you’re allocating and building multiple GASes in parallel, then it might mean you have to defrag your snapshot in a separate pass using GAS relocation. Once you put all that together, it is possible that copying is faster than rebuilding, so just be aware I’m not certain that my suggestion to build every time is a winner in your case.
I can also envision some other strategies that may save considerably on snapshot copies in many cases ( and may be difficult to implement :) ). If you keep going with this and would like more support or suggestions or perf tips, please feel free to get in touch via the OptiX forum (Feel free to DM me there if you like). We’d be happy to try to help you squeeze out more samples per second.
Let me explain. GPU rendering of cinema quality scenes by itself is not that new. V-Ray GPU was around in 2013 and already had an impressive showreel back then: https://www.youtube.com/watch?v=RYPFY5OUzdk
I'd count Octane and Redshift as the next big contenders, who together with V-Ray switching from perpetual to subscription pricing, took over the market. Here's an example of a Redshift render from 2016: https://www.youtube.com/watch?v=to8yh83jlXg
Lately, Pixar has been handing out Renderman GPU betas, which has already been used in the new jungle book movie: https://youtu.be/tiWr5aqDeck?t=171
Along with this development, the price has gone down. From $1200 per V-Ray license, to $600 per Redshift license and now it seems like the previously prohibitively expensive Pixar renderer will soon be offered for $500.
And recently, people have been using Houdini ($6000 per seat) to export their scenes as USD (free open file format) so that they can use Blender 2.81 for rendering (also free). But the conversion to get things into Blender is work. This renderer can use the Disney production data directly, which is very convenient if you want to drop it in at the last minute for cost saving.
GPU rendering has gone from novelty in 2013 to established in 2016 to become the new default in 2018/2019. And this released as open source to me implies that the commercial renderer market will soon be dead. It looks "good enough" for advertisements and archviz, which is where the money is made.
For ideal scenes that fit completely in GPU memory and use hardware meshes and hardware texture filtering and GPU shading, rendering on a GPU is often in the range of 10x to 100x faster than CPU (as reported by the people writing production renderers today). Since production scenes are commonly larger than GPU ram and even CPU ram too, it complicates things. Embree is a renderer and OptiX isn’t, so they can’t be directly compared, but I think it’s fair to say that OptiX based renderers are typically significantly faster for ideal scenes that fit in memory and use hardware supported primitives, while there are cases where Embree has some huge advantages, like scenes that don’t fit in memory, and Embree has oriented bounding boxes that can make some kinds of scene geometry render much faster.
Dave, I’m curious to see what happens now that the latest Quadro and Tesla parts have 40 or 48 GB of memory.
That’s now plenty enough for all the geometry of a production scene plus a texture cache, but almost never enough for the base textures themselves.
I like the “use the cpu to do texture lookups while the gpu is tracing shadow rays” solution here. But I thought you or Keith had done a similar out-of-core with texture caching implementation for inclusion directly into OptiX. My assumption is that once you got the texture cache full enough, stalling on a Unified Memory page fault or similar wouldn’t be super common (again, depends on the hit rate).
Does the new cuMemAddressReserve combined with mmap on the host side let you have a giant virtual mmap’ed host side memory region?
Yeah, exactly right Simon. The memory is now getting big enough for production geometry, but we still need out of core texture support.
We are shipping some (open source) out of core texture loading with OptiX now, lead by Mark Leone. Definitely yes, the idea is that after a few re-launches you’ll have all the mipmap texture tiles you need resident and you can render many samples without a stall.
To be honest I’m not sure whether we’re using mmap for this. (I’ve had my head buried in curves.) Mark is calling the high level concept “cooperative paging”, meaning the GPU and CPU work together to handle page faults and fill new load requests. The user can decide / define what the page fault behavior is, and whether & how they might want to fill the requests. They’re currently working on cache eviction, which will be a big leap when it arrives.
This basic technique could be used in Chris’ Motunui to fill geometry as well - you could use standin boxes for the instances and a ray hit would fault & load the internal geometry. So, obviously, that’s pretty complicated and not necessarily guaranteed to fit in memory, but when it worked it could save a ton of time. I’m thinking about ways to blend what Chris did with something like this to try and get the advantages of both.
In my experience, rendering a natural landscape with clouds, trees, and grass in 4K resolution took about 40 hours on CPU (Ryzen 2700x) while it took about 1 hour on GPU (1080 TI). It could have been even faster on GPU, but the scene was 70GB of geometry and textures, so the GPU spent some time swapping with main memory.
I just have to say that this is really excellent work.
Especially impressed as I've watched this movie at least 100 times over the last year. My 20 month old daughter is infatuated with it (not to worry, we usually limit viewing to a song-scene or two per sit down). The songs are involuntarily memorized and cries of "mana song" (sic) can be heard each time the toddler gets into the car. To the point that my wife, 3 month old daughter, the oldest, and myself are dressing up for Halloween as TeFiti, kakamora, Moana, and Maui, respectively.
If the OP should see this - what prompted you to want to attempt it, was it just the availability of the dataset or were you drawn into the story by screaming children as well?
I've always wished Pixar would do what Id software did, releasing an Open Source version of some small portion of their movies for analysis, like this Moana scene.
Because the question on my mind is, given the improvements in rendering, there's some year in which we can now render in real time, what took a long time for Pixar to render.
So, what is it? I believe I read that creating the 3D version of Toy Story ran at 24 fps, averaged across the entire film. So, basically, we already crossed the threshold for Toy Story. (But maybe that was only if you have a rendering farm?)
So, where are we at? Could we render a plausible Toy Story in real time now? Bug's Life? Monsters Inc?
Last month they discovered a bug in Blender that was the cause of very slow hair rendering. The fix dropped render times for some scenes from 4 minutes to 40 seconds.
So it can indeed be useful to check if things can be improved for old movies.
But I am not sure realtime is an option right now. Most movies also use post processing per frame wich also takes some time.
KH3 from 3 years ago clearly is missing dynamic shadows (e.g. when Woody moves his head, the lighting around his face changes based on shadows cast by his hat and nose). Also the original toystory appeared to have at least 2 surfaces for reflecting off of the floor while I didn't see any reflections off of the floor in KH3. Granted that's 3 years old now.
Also, TS was definitely rendered higher than 1080p to make the transfer to 35mm film for large screens.
(Also if I understand correctly) there are two kinds of batches here - the instances and the path segments (or path "depth"). Path segment batches are the outer loop, and instance batches are the inner loop. So in terms of path tracing, the instances being batched is a detail that is hidden from the rendering algorithm. Each launch (or batch of instances) performs partial updates to the render buffers (tmax, normal, etc.), until all the instances have been processed, and then the tracing of the next path segment begins, which loops through the instance batches again.
Matt Pharr recently released an early version of PBRTv4, which has been rewritten extensively to make heavy use of Optix and the GPU. Given that he previously wrote a series on rendering Moana using PBRTv3, I wonder if v4 could be used as a comparison?
I was comparing the reference images with the rendered ones, and this tree is not at the same location:
https://imgur.com/a/JIpgAb8 Some low vegetation (grass) under the palm tree on the left is also absent.
Is this part of the limitations the author discloses?: «Other features of the scene are possible to render but out of my initial scope, notably subdivision surfaces and their displacement maps, and a full Disney BSDF implementation»
It may also be a simple divergence in the data used to render the scenes. The article talks about some material colors that aren't present in the data that make pixel perfect rendering impossible. It wouldn't surprise me if they added or removed a tree or two between whenever the data was exported and when the example renders were made.
In addition to dahart’s suggestions, I’d be curious to see some of your profiling that you did. Like, what’s the breakdown of the timing now? Are you PCIe limited, or have you overlapped the I/O with enough compute that it’s only XX% overhead?
How much memory do you need for all the geometry and BVHs? (Particularly post compression). I’m curious if this just barely fits on an A100 or similar.
Thanks! The current version requires around 18GB of total GPU memory. Right now I am heavily bottlenecked by the memory transfer costs, but there is some low-hanging fruit to amortize them by doing much more compute per snapshot transfer. After that, I'm very curious to see what the profiler will show.
"Although OSPRay already supported textures, it only supported image textures, however Ptex is a geometry based texture format baked on top of the underlying meshes [Burley and Lacewell 2008]. Not only does this
mean there is no reasonable way these textures could be converted to 2D images for use in OSPRay, but that OSPRay’s entire view of how textures can be applied to geometry—which was inherently based on image textures—would have to change."
Not the author, and it's a totally reasonable suggestion, but having done it, one decent reason is that baking the Ptex into textures that actually fit in the 2070's ram requires making camera specific resolution decisions, adding gutters between adjacent textures, and probably still using out of core texture loading anyway. In other words, it's doable but not easy. In the Moana dataset, the textures consume quite a bit more space than the geometry IIRC.
I tried to run the pbrt scene locally for timing comparisons, and it kept running out of memory. Thanks for that quote- I realize now that I forgot to switch to the pbrt-next branch. In general, this project has a ton of room left for optimization, so I hope to catch up and maybe pass pbrt. But I'd imagine that a commercial cpu renderer would also perform much better than pbrt.
yeah, PBRT is in an interesting place when it comes to changes and optimization. They have an explicit goal of keeping the current released repository as close to the book ("Physically-Based Rendering") as possible, so all of the optimizations and changes go into -next.
Matt's blog talks a _lot_ about all the optimizations he had to do just to get the scene to _load_ in PBRT. It's a beast.
Oh that’s cool, I didn’t realize pbrt-next was that much leaner. Yeah I was asking a nonsense question, but just tongue-in-cheek trying to hint that we shouldn’t compare PBRT run times to Chris’ project to render Moana on a 2070, they’re solving different problems. Chris’ renderer is also brand new and hasn’t had a chance to evolve the way PBRT did. But all that said, it’s very impressive IMO to render the whole scene with (cpu)Ptex using such a small GPU without it taking far longer. I happen to know for a fact that it’s possible to render Moana much faster than cpu-pbrt if you have access to one of the 48 GB models and don’t need to use out-of-core techniques.
This looks mostly like an academic or 'for fun' exercise.
People run Doom on their fridge for fun so why not squeeze the Moana asset through the bottleneck that sits between your CPU and your GPU? :)
Is it practical/useful? Let's put the timing in perspective.
A commercial CPU production renderer, 3Delight, has timing for the Moana asset rendered at 4k on their website.[1]
Time: ~34 minutes.
I asked them for details about the settings they used before posting this as the page only lists resolution.
4k resolution, 64 (shading) samples per pixel (spp), ray depths: diffuse 2, specular 2, refraction 4 (or 3, 3, 5, depending how you count ray depth).
Machine was a contemporary 24 core server at the end of 2018.
Mind you, the image is fine with 64 spp.
Spp are hard to compare between renderers because optimizing path tracers is a lot about sampling.
One renderer will converge to something useable with 1k samples while another just needs 64.
The 3Delight example is rendering all the geometry as subdivision surfaces with displacement (and their own Ptex implementation for texture lookups).
Timing comparisons of a different scene with recent 24core desktop AMD CPUs suggest that this asset would render much faster in 2020.[2]
The timing shows the issue with GPUs vs CPUs for this kind of assets.
5h for a 1k (!) resolution image with <= 5 bounces and 1024 spp (samples per pixel).
That is terrible.
Not using the real (subdivision) geometry and not using displacement.
I would love to see a breakdown how much of these 5hs is owed to the fact that the data doesn't fit on the device.
Using subdivision surfaces and displacement mapping make the amount of geometry grow exponentially.
I.e. the out of core handling would predictably take an exponentially larger part of the render time.
Looking at the numbers I regularly get to see when counseling VFX companies on their rendering pipelines I don't see GPU offline rendering going anywhere for complex scenes.
And even for simpler scenes where GPUs have an advantage still – with the CPUs AMD is putting out recently the gap is becoming very tight and if you do the math you often pay dearly for having an image a few minutes earlier (not even double digit minutes or hours earlier).
Regardless of what Nvidia's marketing and some vendors who IMHO wasted years optimizing their renderers for a moving hardware target may want you to believe.
Regarding the latter: another point to consider is that you need to spend time working with/around the hardware limitations/bottlenecks of GPUs for this very the "scene doesn't fit on device" use case.
Someone writing a CPU renderer can spend that time working on the actual renderer itself. This kind of software takes years to develop. Go figure.
Finally, as I expect this to be downvoted because of what I just said: don't take my word for any of the above. Just try it yourself.
The Moana Asset can be downloaded at [3].
A script to convert the entire asset and launch a 3Delight render can be had at [4].
The unlimited core version of the renderer can be downloaded for free, after registering with your email, at [5].
It renders with any number of cores your box has but it adds a watermark if no license is available.
Or you thy their cloud rendering. You get 1,000 free 24 core server minutes. Which is plenty to run this test.
It seems like you've missed the point. Chris isn't making any claim that the run time is fast compared to an in-core production renderer. He rendered Moana in 8GB of GPU memory, which is smaller than the input data. From the advertisement, it's clear you are a 3Delight fan (employee?), but I bet 3Delight cannot do that: render Moana without using more than 8GB of CPU memory.
The algorithm here to do out of core rendering is the important part, and it doesn't make sense for you to try to compare in-core CPU rendering to out-of-core GPU rendering.
> I would love to see a breakdown how much of these 5hs is owed to the fact that the data doesn't fit on the device.
I already know it's pretty close to 100% of the time spent handling out-of-core requests, that's not surprising, nor is it a bad thing (though it probably can be improved). If it were a CPU renderer doing this - streaming the geometry & BVH, the result would be the same (or much worse if streaming from SSD instead of an external ram of some sort.)
Chris are you on HN? I am curious whether some savings might be possible here by not snapshotting the IASes and GASes. Instead you can snapshot only the raw geometry, and then rebuild the GASes and IASes before every launch. It may be faster to rebuild the BVHs than to copy them from CPU ram!
*Oh incidentally looking through the code I also noticed that the GASes could be compacted which will save a lot of memory and time/bandwidth on the snapshot copies during rendering. Since you do compact the IAS, I am assuming it's problematic to try and compact the GASes?
In any case, nice work!