Wednesday, 22 February 2017

Graph Evaluation (Again)

I did not post for a while, it doesn't means I've been doing nothing, actually quite the opposite.

Lately I've been reworking a bit of my graph internals (basically read, decouple some bits), and I had lot of ideas and implemented quite a few, so I decided I wanted to share some theory again.

First thing I already spoke about the fact that multi core evaluation is tricky to get right (basically, you need your nodes to do enough work to make it worthwhile, otherwise you gonna have serious issues or just have something slower). 

A second thought was to eventually have a compiled form of the graph, while that's a bit of buzzword, I also believe that compiled ones will not always be faster.

Why? Wouldn't removing evaluation overhead make things faster?

Yes, but by still having access to your model, there's a serious lot of optimizations that you can perform, so answer is : not always.

So I'm gonna talk a bit about graph evaluation strategies. (I'll speak of data flows mostly)

Let's take the following simple patch, as you can see, nothing really fancy, but that will perfectly fit

1/A very basic graph

A/Lazy recursion

This is by far the simplest to implement (to be honest, you can do it in less than 100 lines of code probably).

Pseudo code:
Start with NodeFinal

function processnode(node)

if node is already processed 

for each input pin  
    get upwards node (info is likely stored in a link structure)
    processnode (parent node connected to input pin)
end for
end function

process node (some form of update/evaluate call)
mark node as processed

Yes, that's about it really, simple to understand and very idiomatic.

Now first things that we can notice:
We will call processnode several times on some of them (for example, Time is used by 3 different nodes)

In our case (Since I did not introduce any form of "graph cutting techniques") we can also see that actually order of execution is immutable (our graph will always run and process nodes in the same order).

So let's optimize that

B/Reversed List

Getting the above graph (i add the screenshot again):

We can deduce our dependency order (if I consider I'll scan for pins in a left to right fashion)

NodeFinal -> Node12 -> Node1 -> Time -> Node2 -> Time -> Node 9 -> Node 2 -> Time -> Node 5 -> Time

As mentioned, we prevent to execute twice, which gives the following order:

NodeFinal -> Node12 -> Node1 -> Time -> Node2  -> Node 9 -> Node 5

Now we need to transform that into a running order, we could reverse the second list, but that will not work.

Why? Simply because for example Node5 needs Time to run before to be updated.

So let's use the first list, and reverse it:
Time -> Node5 -> Time -> Node2 -> Node9 -> Time -> Node 2 ->Time -> Node 1 -> Node 12 -> NodeFinal

This list is correct, but replicates some elements, filtered version is :
Time -> Node 5 -> Node 2 -> Node 9 -> Node 1 -> Node 12 -> Node final

To build the list, we perform a first lazy evaluation pass as seen upwards, but instead of calling Update, we simply add the node to a list.

Once we get that first list, building the reverse order is done as follows:

var reversedList = reverse(dependencylist)
var executionList = new EmptyList
foreach node in reversedList
   if not executionList.Contains(node) then executionList.Add(node)

Store that list somewhere, and then graph evaluation becomes:
foreach node in executionList

As you can see, that removes recursion, and actually this is really easily compilable (just generate llvm/il/source code) and compile

Issue is of course that each time the graph changes, you need to preparse it again, which is not always advisable at runtime (the compilation step might severely hurt).

Now let's add some features in our graph

2/Lazy pins (and virtual links)

Let's take the following graph.

Here we consider that LargeData and Generator1 nodes are really heavy, and feed some data into builder, but does not need frequent updates.

Builder has been built with first pin to be Lazy (eg: only build upwards if the second pin something like Apply is true (which is dealt somehow by the conditions upwards)

Graph is then decomposed int the following path:

Builder -> And -> Condition1 -> UserInput -> Condition2

If Second pin (eg result of and node) is true

LargeData -> Generator1 -> Time

As you can see, now our evaluation is split into 2 different parts (basically, everything non lazy, then eventually the lazy part is required)

This is suddenly harder to convert into list (and kinda difficult to auto compile).
So suddenly the idiomatic purely lazy version above is not so bad after all. You only need to add a condition on the node to tell if you want to go upwards or not.

Another version is to have some form of internal link eg:

Here the switch node will decide (depending on SwitchValue result) to run either from Part1 or Part2

This can be implemented in 2 ways:
Use lazy as above.

Rebuild the graph structure (basically if SwitchValue  0, the internal representation of the graph is like this):

As you can see the user "sees" the first version of the graph, but internally the link is removed, so the Part2 node (and above) are now orphans and never run.

This is also really efficient, but of course if you use an optimization technique above, every time the internal representation changes (which can be every frame), you need to rebuild your structure. 

So deciding to do so means that you need to decide if it's worth it (does the optimization technique + optimized evaluation is still faster that idiomatic version).

Please note those techniques are always defined on runtime (eg: at some point, some node is granted "power" to cut the graph).

As we have seen, they add some complexity to the evaluation part.

Now let's separate our usage into 2 parts : 
  • Authoring : When user is effectively building the patch
  • Runtime : When we are running

In case you want to deploy your application, you will of course have effective gains by optimizing runtime performance version. 

But in many cases we still need fast authoring, even if those are niche scenarios, I always have cases where I need to go modify something very early before a show (or actually , during the show). 

Some people will say it's bad, but in cases where it's not avoidable (position some element properly, add a quick animation or whatever some client urgently requires) this becomes critical not to disrupt the render output too much (basically, freeze or lag).

So the whole key is to balance both (of course you can also provide several graph evaluation strategies, so the user can decide depending on his use case).

3/Dual graph

One very easy technique to balance this part is to maintain two evaluators, one idiomatic version, and one optimized version.

When patch structure changes, we swap to idiomatic (since it has close to zero rebuild time), and build a new optimized version in background.

When optimized version is ready, switch back to it.

And key if you use c#, keep a reference to your old optimized evaluators somewhere, it's likely that you'll have a Gen 1/2 GC trigger, which is not desirable either,a small "memory increase instead" might be preferred.


Now let's introduce something interesting, since all the above was fairly basic.

Basically, let's consider a node as "Take some input and produce some outputs"

We could consider 3 cases of "output production"
  • Deterministic : This is identical to "pure functions" in functional paradigm. Same inputs will always produce same outputs.
  • Non deterministic : This will produce varying results (for example, an animaton filter with keep previous frame result and depend on that)
  • User input, non deterministic: This is the same as above, but is a special case (for example a node that has an editable value)
So let's consider the following patch:

Let's say that Constant (and Constant2) nodes are immutable (basically you set the value when you create the node and you are not allowed to modify it).

Other operators are basic (math operators are fully deterministic).

This patch only needs to run once, it will always produce the same results.

You will of course say : I need to animate my data to show on the screen, and I agree, But I can be remotely sure that in large patches, there are a lot of sections like that.

You could of course use a lazy pin as specified above (below finalvalue), but we could be able to just automate this.

Let's take a second example:

In that case, only the Constant node is constant, since Time node is non deterministic, there is not much we can do, but then having:

Here we can see that all the section in gray can be ran only once and cached, it never needs to run several times.

If now we add user input to the mix:

Here the group on the right only changes when user input is changed, so we can also really easily cache results (as user data will not change ever so often).

We do not need to check for user changes manually (generally editors tend to trigger an event).

Pseudo code for evaluation is a tad more complex.
First when graph is updated, we need to propagate some information downwards as per:

Walk though node recursively, check if it has a Fully non deterministic parent (or a user input based one), mark that in a flag (and in case of user input, store some reference to it).

void ProcessNode(node)
   if node has no deterministic parents, stop here (unless it's the first frame or graph has changed)
   if node has a fully non deterministic parent (anything that depends on Time node in our case), process it
   if node has one "user signal" version, check if user has changed value (some form of dirty flag), and only process it if required.

In our example patch, we can see that unless user is editing value, only FinalValue, +, Add, Divide and Time need to be processed, the rest can be completely ignored most of the time, and this can be automated!

Again, in our example, if we deploy our application, the UserValue node can automatically be changed to a constant node (since user will not have access to editor anymore). So we can also even completely remove the User input, non deterministic case, and stay with only Deterministic of not.

As a side note, in case of deploy, we can also even store the result values we need and just remove the nodes completely.

For example :

Can be transformed to:

Here we store all results that are needed (Sine and Multiplies), and don't need those processors anymore.

Of course, in case nodes operate on large lists, storing results can require some large files, so we can of course keep the option to keep nodes and run it only once (and replace them my constants in the graph right after).

Where is the catch then?
There is one, every node must be annotated, we need to provide metadata in order to indicate about a node Determinism, this can of course easily be achieved using some attributes (in c# example)

What's then important is that each node must be "curated", as a Non deterministic node marked as Non Deterministic will prevent downward optimization, opposite scenario is worse, as it will lead to incorrect results (data will not change whereas it should), but adding those is a small cost versus the reward.


Compiling graphs as flattened structures can feel tempting as first sight, but graph data structure can hold a large amount of "knowledge" which is then lost, so before to decide to do so it can be interesting to take a step back and see if it's worthwhile at all (that for sure, is another debate and there will be pros and cons in each case).

Since this post probably feels long to digest, I'll keep the next section for later, which will include shaders in the mix, so stay tuned.

Sunday, 25 October 2015

Intersections Part2 : Id Maps

In the previous post, I spoke about the ability to perform hit detection using analytical functions.

This works extremely well when we can restrict our use case to it, but now we have some other cases where this is not as ideal:

  • Perform detection on arbitrary shape/3d model.
  • User input is not a pointer anymore, but can also be arbitrary (threshold camera texture, Kinect Body Index)
  • Both previous cases combined together

While we can often perform detection for 3d model by using triangle raycast (I'll keep that one for next post), it can be pretty expensive (specially if we perform a 10 touch hit detection, we need to raycast 10 times).

So instead, one easy technique is to use ID map.

Concept is extremely simple, instead of performing hit with a function, we will render our scene into a UInt texture, where each pixel will be object ID.

Of course it means you have to render your scene another time, but in that case you can also easily use the following:
  • Render to a downsized texture (512*512 is often sufficient)
  • Render either bounding volumes, or simplified versions of our 3d models.
Great thing with this technique, our depth buffer already makes sure that we have closest object ID stored (so we get that for "free").

So now we have our ID map, picking objectID from pointer is trivial:

Code Snippet
  1. Texture2D<uint> ObjectIDTexture;
  3. RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;
  5. float2 MousePosition;
  6. int width = 512;
  7. int height = 424;
  8. [numthreads(1,1,1)]
  9. void CS(uint3 tid : SV_DispatchThreadID)
  10. {
  11.     uint w,h;
  12.     ObjectIDTexture.GetDimensions(w,h);
  14.     float2 p = MousePosition;
  15.     p = p * 0.5f  + 0.5f;
  16.     p.y = 1.0f-p.y;
  17.     p.x *= (float)w;
  18.     p.y *= (float)h;
  20.     uint obj = ObjectIDTexture.Load(int3(p,0));
  21.     RWObjectBuffer[0] = obj;
  23. }

Not much more is involved, we grab the pixel id, store in a buffer that we can retrieve in staging.

In case we need multiple pointer, we only need to grab N pixels instead, so process stays pretty simple (and we don't need to render scene for each pointer).

Now as mentioned before, we might need to perform detection against arbitrary texture.

As a starter, for simplicity, I will restrict the use case to single user texture.

So first we render user into a R8_Uint texture , where 0 means no active user and anything else = active.

We render our object map next in the same resolution.

We create a buffer (same size as object count, uint), that will store how many user pixel hit an object pixel.

Dispatch to perform this count.

Use another Append buffer, that select elements over a minimum account of pixel (this is generally important to avoid noise with camera/kinect textures).

Accumulating pixel hit count is done this way:

Code Snippet
  1. Texture2D<uint> ObjectIDTexture;
  2. Texture2D<float> InputTexture;
  4. RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;
  6. float Minvalue;
  7. int maxObjectID;
  9. [numthreads(8,8,1)]
  10. void CS(uint3 tid : SV_DispatchThreadID)
  11. {
  12.     uint obj = ObjectIDTexture[tid.xy];
  13.     float value = InputTexture[tid.xy];
  15.     if (value > Minvalue && obj < maxObjectID)
  16.     {
  17.         uint oldValue;
  18.         InterlockedAdd(RWObjectBuffer[obj],1,oldValue);
  19.     }
  20. }

Make sure you use InterlockedAdd, as you need atomic operation in that case.

Next we can filter elements:

Code Snippet
  1. StructuredBuffer<uint> HitCountBuffer;
  2. AppendStructuredBuffer<uint> AppendObjectIDBuffer : BACKBUFFER;
  4. int minHitCount;
  6. [numthreads(64,1,1)]
  7. void CS(uint3 tid : SV_DispatchThreadID)
  8. {
  9.     uint c,stride;
  10.     HitCountBuffer.GetDimensions(c,stride);
  11.     if (tid.x >= c)
  12.         return;
  14.     int hitcount = HitCountBuffer[tid.x];
  15.     if (hitcount >= minHitCount)
  16.     {
  17.         AppendObjectIDBuffer.Append(tid.x);
  18.     }
  19. }

This is that easy, of course instead of only rendering ObjectID in the map, we can easily add some extra metadata (triangle ID, closest vertexID) for easier lookup.

Now in order to perform multi user detection (for example, using Kinect2 body Index texture), process is not much different.

Instead of having a buffer of ObjectCount, we create it of ObjectCount*UserCount

Accumulator becomes:

Code Snippet
  1. Texture2D<uint> ObjectIDTexture;
  2. Texture2D<uint> UserIDTexture;
  4. RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;
  6. float Minvalue;
  7. int maxObjectID;
  8. int objectCount;
  10. [numthreads(8,8,1)]
  11. void CS(uint3 tid : SV_DispatchThreadID)
  12. {
  13.     uint obj = ObjectIDTexture[tid.xy];
  14.     uint pid = UserIDTexture[tid.xy];
  16.     if (pid != 255 < maxObjectID)
  17.     {
  18.         uint oldValue;
  19.         InterlockedAdd(RWObjectBuffer[pid*objectCount+obj],1,oldValue);
  20.     }
  21. }

And filtering becomes:

Code Snippet
  1. StructuredBuffer<uint> HitCountBuffer;
  2. AppendStructuredBuffer<uint2> AppendObjectIDBuffer : BACKBUFFER;
  4. int minHitCount;
  5. int objectCount;
  6. [numthreads(64,1,1)]
  7. void CS(uint3 tid : SV_DispatchThreadID)
  8. {
  9.     uint c,stride;
  10.     HitCountBuffer.GetDimensions(c,stride);
  11.     if (tid.x >= c)
  12.         return;
  14.     int hitcount = HitCountBuffer[tid.x];
  15.     if (hitcount >= minHitCount)
  16.     {
  17.         uint2 result;
  18.         result.x = tid.x % objectCount; //objectid;
  19.         result.y = tid.x / objectCount;
  20.         AppendObjectIDBuffer.Append(result);
  21.     }
  22. }

We now have a tuple userid/object id instead, as shown in the following screenshot:

Please also note this technique can also easily be optimized with stencil, setting a bit per user. You get then limited to 8 users tho (7 users in case you also want to reserve one bit for object itself).

You will need one pass per user also (so 6 pass with proper depth stencil state/reference value).

If you lucky enough and can run on Windows10/DirectX11.3, and have a card that allows you, you can also simply do :

Code Snippet
  1. Texture2D<uint> BodyIndexTexture : register(t0);
  3. uint PS(float4 p : SV_Position) : SV_StencilRef
  4. {
  5.     uint id = BodyIndexTexture.Load(int3(p.xy, 0));
  6.     if (id = 255) //No user magic value provided by Kinect2
  7.         discard;
  8.     return id;
  9. }

Here is a simple stencil test rig, to show all of the intermediates:

That's it for part 2 (that was simple no?)

For the next (and last) part, I'll explain a few more advanced cases (triangle raycast, scene precull....)

Intersections Part1

So last month been working on latest commercial project (nothing involving any extreme creative skills), so back into research mode.

I got plenty of new ideas for rendering, and quite some parts of my engine are undergoing some reasonable cleanup (mostly new binding model to ease dx12 transition later on).

There's different areas in my tool that I'm keen on improving, many new parts will be for other blog posts, but one has lately drawn my attention and I really wanted to get this one sorted.

As many of you know (or don't), I've been working on many interactive installations around, from small to (very) large.

One common requirement for those are some form of Hit Detection, you have some input device (Kinect, Camera, Mouse, Touch, Leap....), and you need to know if you hit some object in your scene in order to have those elements to react.

After many years in the industry, I've been developing a lot of routines in that aspect, so I thought it would be nice to have all of that as a decent library (to just pick when needed).

After a bit of conversation with my top coder Eric, we wanted to do a bit of feature list, what do we expect of an intersection engine, then the following came up:
  • We have various scenarios, some routines are better fit to some use cases, so we don't want a "one mode to rule them all". For example, if our objects are near spherical, we don't want to ray cast mesh triangles, ray cast on bounding sphere is appropriate (and of course much faster).
  • We want our routines sandboxed, so 4v/flaretic subpatch, it should be one node with inputs/outputs, cooking done properly inside and optimized. That saves us load time, reduce compilation times for shaders (or allow precompiled), and easier to control workflow (if our routine is not needed it costs 0).
  • We want our library minimal, so actually hit routines should not even create data themselves, they are a better fit as pure behaviours (It also helps to have those routines working in different environments).
  • We don't want to be gpu only, if a case fits better as CPU, then we should use CPU (if preparing buffers costs more time than performing the test directly, then let's just do it directly in cpu).

Next we wanted to decide which type of outputs we needed, this came out:

  • bool/int flag, which indicates if object is hit or not
  • filtered version for hit objects
  • filtered version for non hit objects

Then here are the most important hit detection features we require (they cover a large part of our use cases in general)
  • Mouse/Pointer(s) to 2d shape (in most cases we want rectangle, circle).
  • Pointer(s) to 3d object (with selectable precision, either raycast bounding volume eg sphere/box, or go at triangle level).
  • Area(s) to shapes (rectangle selection)
  • Arbitrary texture to shape (most common scenario for this is infrared camera, or kinect body index texture). In that case we also want the ability to differenciate between user id as well as object id.
  • In any 3d scenario, we also eventually want either closest object or all objects that get from the test.
  • We also have 3 general cases in 3d : Intersect (ray), Containment (is our object inside another test primitive), Proximity (is our object "close enough" to some place).
So once those requirements are set, to perform hit detection we generally have the 2 main following scenarios, you use analytical function or you use a map.

So let's show some examples, if that first post about it, I'll only speak about analytical functions.

In this case, we generally follow the usual pattern, convert our input into the desired structure (point to ray for example), and every mode follow the current pseudo code (c# version)

Code Snippet
  1. bool[] hitResults = new bool[objectCount];
  2. List<int> hitObjectList = new List<int>();
  4. for (int i = 0; i < objectCount; i++)
  5. {
  6.     var obj = testObjects[i];
  8.     bool hr = Performtest(userObject, obj);
  9.     hitResults[i] = hr;
  11.     if (hr)
  12.     {
  13.         hitObjectList.Add(i);
  14.     }
  15. }

Pretty much all test modes will follow this pattern, only difference after is the test function.

Obviously when we start to reach a certain number of elements, this can become slow. And many times, our objects might be on our GPU, so we are not gonna load them back into CPU.

Translating this into hlsl is extremely straightforward, here is some pesudo code for it.

Code Snippet
  1. bool PerformTest(SomeUserObject userInput, SomeStruct object)
  2. {
  3.     return //Perform your intersection/containment/proximity routine here
  4. }
  6. StructuredBuffer<SomeStruct> ObjectBuffers  : register(t0);
  8. RWStructuredBuffer<uint> RWObjectHitResultBuffer : register(u0);
  10. AppendStructuredBuffer<uint> AppendObjectHitBuffer : register(u1);
  11. RWStructuredBuffer<uint> RWObjectHitBuffer : register(u1); //In this case, UAV should have a counter flag
  13. cbuffer cbUserInput : register(b0)
  14. {
  15.     SomeStruct userInput;
  16. };
  18. cbuffer cbObjectData : register(b1)
  19. {
  20.     uint objectCount;
  21. };
  23. [numthreads(128, 1, 1)]
  24. void CS(uint3 i : SV_DispatchThreadID)
  25. {
  26.     if (i.x >= objectCount)
  27.         return;
  29.     uint oid = i.x;
  31.     SomeUserObject object = ObjectBuffers[oid];
  33.     bool hitResult = PerformTest(userInput, object);
  34.     RWObjectHitResultBuffer[oid] = hitResult;
  35.     if (hitResult)
  36.     {
  37.         //If we use append buffer
  38.         AppendObjectHitBuffer.Append(oid);
  40.         //If we use counter buffer
  41.         uint idx = RWObjectHitBuffer.IncrementCounter();
  42.         RWObjectHitBuffer[idx] = oid;
  43.     }
  44. }

As you can see there's no huge difference into that.

It's pretty straightforward to perform ray to sphere/triangle/box as a starter.

Rectangle selection is also extremely simple:

  • Construct a 2d transformation for the screen area to check
  • Multiply inverse by camera projection
  • Build a frustrum from this
  • Perform a object/frustrum test insqtead of ray test.
here is a small range test example

Simple no? ;)

Now I can foresee 2 important question that our acute reader is probably already thinking of:
  • How do we get closest object?
  • What if we perfom several user inputs?

Of course, there are solutions for that.

Closest object.

First we will consider that our test function is also capable of returning distance.

So we modify our code by:

Code Snippet
  1. struct HitResult
  2. {
  3.     uint objectID;
  4.     float distanceToObject;
  5. };
  7. bool PerformTest(SomeUserObject userInput, SomeStruct object, out float distanceToObject)
  8. {
  9.     return //Perform your intersection/containment/proximity routine here
  10. }
  12. StructuredBuffer<SomeStruct> ObjectBuffers  : register(t0);
  14. RWStructuredBuffer<uint> RWObjectHitResultBuffer : register(u0);
  16. AppendStructuredBuffer<HitResult> AppendObjectHitBuffer : register(u1);
  18. cbuffer cbUserInput : register(b0)
  19. {
  20.     SomeStruct userInput;
  21. };
  23. cbuffer cbObjectData : register(b1)
  24. {
  25.     uint objectCount;
  26. };
  28. [numthreads(128, 1, 1)]
  29. void CS(uint3 i : SV_DispatchThreadID)
  30. {
  31.     if (i.x >= objectCount)
  32.         return;
  34.     uint oid = i.x;
  36.     SomeUserObject object = ObjectBuffers[oid];
  38.     float d;
  39.     bool hitResult = PerformTest(userInput, object,  d);
  40.     RWObjectHitResultBuffer[oid] = hitResult;
  41.     if (hitResult)
  42.     {
  43.         HitResult hr;
  44.         hr.objectID = oid;
  45.         hr.distanceToObject = d;
  46.         //If we use append buffer
  47.         AppendObjectHitBuffer.Append(hr);
  48.     }
  49. }

Now our buffer also contains our distance to object, the only leftover is to grab the closest element.

We have 2 ways to work that out:

  • Use Compute shader (Use InterlockedMin to filter closest element, since distance is generally positive there's no float to uint tricks to apply), then perform another pass to check if element distance is equal to minimum.
  • Use Pipeline ; DepthBuffer is pretty good to keep closest element, so we might as well let him do it for us ;)
Using pipeline is extremely easy as well, process is as follow:
  • Create a 1x1 render target (uint), Associated with a 1x1 depth buffer
Prepare an indirect draw buffer (from the UAV counter), and draw as point list, write to pixel 0 in vertex, and pass distance so it's written to depth buffer, since code speaks more, here it is:

Code Snippet
  1. struct HitResult
  2. {
  3.     uint objectID;
  4.     float distanceToObject;
  5. };
  7. StructuredBuffer<HitResult> ObjectHitBuffer : register(u0);
  9. cbuffer cbObjectData : register(b1)
  10. {
  11.     float invFarPlane;
  12. };
  14. void VS(uint iv: SV_vertexID, out float4 p : SV_Position,
  15.     out float objDist : OBJECTDISTANCE,
  16.     out uint objID : OBJECTID)
  17. {    
  18.     p = float4(0, 0, 0, 1); //We render to a 1x1 texture, position is always 0
  19.     HitResult hr = ObjectHitBuffer[iv];
  21.     objID = hr.objectID;
  22.     //Make sure we go in 0-1 range
  23.     objDist = hr.distanceToObject * invFarPlane;
  24. }
  26. void PS(float4 p : SV_Position, float objDist : OBJECTDISTANCE, uint objID : OBJECTID,
  27.     out uint closestObjID : SV_Target0, out float d : SV_Depth)
  28. {
  29.     //Just push object id
  30.     closestObjID = objID;
  31.     d = objDist; //Depth will preserve closest distance
  32. }

Now our pixel contains our closest object (clear to 0xFFFFFFFF so this value will mean "no hit")

To finish for this first part, let's now add the fact that we have multiple "user Inputs".

We want to know the closest object per user.

This is not much more complicated (but of course will cost a test for each user/object).

Code Snippet
  1. struct HitResult
  2. {
  3.     uint objectID;
  4.     float distanceToObject;
  5. };
  7. bool PerformTest(UserInput userInput, SomeStruct object, out float distanceToObject)
  8. {
  9.     return //Perform your intersection/containment/proximity routine here
  10. }
  12. StructuredBuffer<SomeStruct> ObjectBuffers  : register(t0);
  13. StructuredBuffer<UserInput> UserInputBuffer : register(t1);
  15. RWStructuredBuffer<uint> RWObjectHitResultBuffer : register(u0);
  17. RWStructuredBuffer<HitResult> RWObjectHitBuffer : register(u1); //Counter flag
  18. RWStructuredBuffer<uint> RWObjectHitUserIDBuffer : register(u2);
  20. cbuffer cbObjectData : register(b0)
  21. {
  22.     uint objectCount;
  23.     uint userCount;
  24. };
  26. [numthreads(128, 1, 1)]
  27. void CS(uint3 tid : SV_DispatchThreadID)
  28. {
  29.     if (tid.x >= objectCount)
  30.         return;
  32.     uint oid = tid.x;
  34.     SomeUserObject object = ObjectBuffers[oid];
  35.     uint hitCount = 0;
  36.     for (uint i = 0; i < userCount; i++)
  37.     {
  38.         float d;
  39.         bool hitResult = PerformTest(userInput, object, d);
  41.         if (hitResult)
  42.         {
  43.             hitCount++;
  44.             HitResult hr;
  45.             hr.objectID = oid;
  46.             hr.distanceToObject = d;
  48.             uint idx = RWObjectHitBuffer.IncrementCounter();
  49.             RWObjectHitBuffer[idx] = hr;
  50.             RWObjectHitUserIDBuffer[idx] = i;
  51.         }
  52.     }
  53.     RWObjectHitResultBuffer[oid] = hitCount;
  54. }

Now we have a buffer with every hit from every user (here is a small example screenshot):

So instead of using a 1x1 texture, we use a Nx1 texture (where N is user Input count).

Process to get closest element is (almost) the same as per the single input.

Only difference, in Vertex Shader, route the objectID/Distance to the relevant user pixel, and you're set!

That's it for first part, next round, I'll explain how the "map technique works", stay tuned.

Wednesday, 12 August 2015

DirectX12, First impressions

So here we go, Windows 10 is out, and so is "officially" DirectX12 (first official samples are finally now available).

Since I was not part of early access program, I could not see much samples, but setting up pipeline for tests was reasonably easy, even tho you have no idea about "best practices".
I helped a bit fixing some of the remaining issues in SharpDX (and also integrated DirectX11.3 support in there on the way, welcome volume tiled resources and conservative rasterization).

So way before official samples I managed to have most features I needed running, but official samples helped to finally nail down a bit the "last pieces of the puzzle" (mostly swap chain handling and descriptor heaps best practices).

First thing we obviously do is to build a small utility library (to remove the boilerplate and be able to prototype fast), and then play :)

So new API means changes, new approaches, new possibilities, everything that I love.

As I'm generally not dealing with the "general case" (game engines with ton of assets), eg: lot of procedural geometry, tons of different effects permutations, many different scenes... I will of course speak about what it changes for those scenarios.

So.... let's go


Finally (and really I mean it), a resource is just a resource, eg: some place in memory, there are no restrictions anymore about binding.

Before in DirectX11, you could, for example, not create a resource usable both as Vertex/Index buffer and StructuredBuffer.

For vertex buffer it's okay (can just bind as StructuredBuffer and fetch in Vertex Shader using VertexID), but index buffer can only be bound within the pipeline, so either had to copy to byte address, or mimic Append/Counter features with ByteAddressBuffer.

Now I can create a resource, create an Index buffer view for it, and an append or counter structured view as well, no problem, way it should be.

(Note: From DirectX11.2 it was possible to do it in some ways using Tiled resources, you create 2 "empty resources" (one for index, one for structured), and allocate the same tile mapping to each of those, works but a bit hacky).

This is a huge step forward in my opinion, since now I can construct geometry once and seamlessly use it in pipeline or compute.


As well as having resource as locations, now we have several ways to create resources. 

  • Commited : this is roughly equivalent to DirectX11 style, you create a resource and the runtime allocates backing memory for it
  • Placed : Now we can create a Heap (eg: some memory), and place resources in there, so several resources can share the same heap (and even overlap, so for small temporary resources this is rather useful)
  • Reserved : Reserved resources is the Tiled resources equivalent
Heaps can also be of 3 types:
  • Default : same as previously
  • Upload : Cpu accessible for writing
  • Readback : Cpu Accessible for reading

As a side note, the way to upload resources now is totally up to you, recommended way is to allocate an upload resource, write data in there, and use a copy command to copy data to a default resources, since it's mentionned that gpu access to Upload "resources" is slower (which I actually confirmed from some benchmarks).

So you have pretty much full control on how you organize/load your memory (specially as you can have dedicated copy Command Queues, but that one likely for another post).


Queries in DirectX11 were an utter pain (and often some form of disaster waiting to happen).

You had to wait for some point later in the frame, loop until data is available, then get data, Stall party in brief.

Now queries handling is much simpler, you just create a query heap (per query type), which can also contain backing memory for several of those, then use begin/end (except time which now only uses end as it gets gpu timer)

Then when you need to access data, you resolve query in a resource (which can be of any type), so you can either wait for end of frame and readback or also even use that data right away (stream output to indirect draw, I was waiting for this for so long..., bye bye DrawAuto, will no miss you anytime ;)

4/Uav Counters

On the same fashion, uav counters have simply.... (drum rolls) ... disappeared.

Now uav counters are simply backed using a resource, which of course means that you have full read/write control over it in a cpu/gpu fashion (you can even share counter between different resources/uavs, I'm pretty sure I'll find a weird use case for it at some points)

Previously in DirectX11, you could only set initial count before a dispatch in Cpu, which was sometimes quite limiting (quite often just ended up using a small buffer and interlocked functions to mimic append/counter, now all can also be used seamlessly, also a Stream Output query result can now be set as an initial count for an append buffer, which is something I already see myself abusing).

5/Pipeline State Objects

PSO are an obvious huge difference in how you set your pipeline.

Before in DirectX11, you have to set several states individually (blend/rasterizer, shaders...), which of course provided a reasonable level of flexibility (really easy to switch from solid to wireframe for example), but at the expense of a performance hit (as well as have to implement state tracking).

Now the whole pipeline (shaders/states/input layout, all except resource binding and render targets) is created up front and sent in a single call.

This offers the obvious advantage of a very big performance gain (since now the card know up front what to do and doesn't need to "reconstruct" the pipeline every draw, but pipeline states are expensive to create, so don't expect to tweak small states and have an immediate feedback (pso creation cost some tenth of milliseconds, please note they can of course be created async).

PSO can also be cached (hence also serialized to disk), so this can improve loading times drastically for applications in runtime mode.

So that's it for the first overview part, there's of course many more new features and changes, but let's keep those for another post.


Monday, 15 June 2015

Resource Handling with Tiled Resources in DirectX11.2

After upgrading my particle system, the next part that needs my attention is my deferred renderer.

It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)

Now I started to upgrade my shaders for the next level:

  • Msaa support with subpixel (sample shading)
  • Allow better AO usage (AO Factor can be set as low quality or each light can have an AO factor)
  • Better organization of some calculations (to avoid them to be done twice).
  • Some other cool things ;)
Before I start to revamp the glue code to handle all this, as soon as you start to use Msaa targets  (this is no news of course), your memory footprint grows quite drastically.

In my use case, since I'm not dealing with the usual "single level game scenario", I can also have several full HD (or more, or less) which all need to be rendered every frame and composited. 
I looked a bit at my memory usage, and while it's not too bad (reasonably efficient usage of pools and temporary resources), I thought I could start to have a proper think about it before to start coding ;)

So generally when we render scenes, we have several type of resources lifetimes:
  • "Scene lifetime" : Those are resources which live with your scene, so until you decide to unload your whole scene, those resources must live. A good example is some particle buffers, as they are read write, they need to be persisted across frames.
  • "Frame lifetime" : Those are the ones that we use for a single frame, often some intermediate results, that needs to be persisted across a sufficiently long part in the frame duration. For example, Linear Depth is quite often required for a long part in your post processing pipeline, since it's used by a decent amount of post processors.
  • "Scoped lifetime" : Those have a very short lifetime (generally within a unit of work/function call)
When I did a first memory profile test, I could see that actually a lot of my memory footprint is caused by those Scoped resources, so I decided to first focus on those.

So as a starter, here are my local resources for some of my post processors
  • Depth of field: 1 target for CoC (R16f), 1 target for temp blur (4 channels, format is renderer dependent)
  • Hbao : 4 targets for AO + blur (2 are R16f, other 2 are R16G16f)
  • Bokeh : 1 buffer for sprite filtering, 1 target for overdraw (renderer dependent).

Now for those short lived resources, you can handle them in the following way.

  • Create/Dispose : you create resources every time they are needed and release become to leave the function, in c# code, this would look like the traditional use pattern as:

Code Snippet
  1. using (var buffer = DX11StructuredBuffer.CreateAppend<float>(device, 1024))
  2. {
  3.     //Do some work
  5. } //Buffer is now disposed

While this is a natural pattern in c#, it is not designed to work well with real time graphics (resource creation is expensive, and creating / releasing gpu resources all the time is not such a good idea, memory fragmentation looming).

  • Resource Pool : instead of creating resources all the time, to create a small wrapper around it to keep a isLocked flag. This looks this way in c#:

Code Snippet
  1. var buffer = Device.ResourcePool.LockStructuredBuffer<float>(1024);
  3. //do something with buffer : buffer.Element.ShaderView;
  5. buffer.UnLock(); // Markas free to reuse

When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.

This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.

First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.

Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.

So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....

Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).

Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)

So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)

Adding a second scene with a different resolution, that's of course 12 targets.

One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.

In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.

Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.

Code Snippet
  1. BufferDescription bdPool = new BufferDescription()
  2. {
  3.     BindFlags = BindFlags.None,
  4.     CpuAccessFlags = CpuAccessFlags.None,
  5.     OptionFlags = ResourceOptionFlags.TilePool,
  6.     SizeInBytes = memSize,
  7.     Usage = ResourceUsage.Default
  8. };

So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.

This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).

Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".

There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).

So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).

So let's do a first try, and create two Tiled buffers, which point to the same memory location.

Code Snippet
  1. SharpDX.Direct3D11.BufferDescription bd = new BufferDescription()
  2. {
  3.     BindFlags = BindFlags.ShaderResource | BindFlags.UnorderedAccess,
  4.     CpuAccessFlags = CpuAccessFlags.None,
  5.     OptionFlags = ResourceOptionFlags.BufferAllowRawViews | ResourceOptionFlags.Tiled,
  6.     SizeInBytes = memSize,
  7.     Usage = ResourceUsage.Default,
  8.     StructureByteStride = 4
  9. };
  11. SharpDX.Direct3D11.BufferDescription bd2 = new BufferDescription()
  12. {
  13.     BindFlags = BindFlags.ShaderResource | BindFlags.UnorderedAccess,
  14.     CpuAccessFlags = CpuAccessFlags.None,
  15.     OptionFlags = ResourceOptionFlags.Tiled,
  16.     SizeInBytes = memSize,
  17.     Usage = ResourceUsage.Default,
  18.     StructureByteStride = 4
  19. };
  21. var Buffer1 = DX11StructuredBuffer.CreateTiled<int>(device, elemCount);
  22. var Buffer2 = DX11StructuredBuffer.CreateTiled<int>(device, elemCount);

Here I show the arguments of CreateTiled static constructor, for readability.

And we need to provide a backend for it:

Code Snippet
  1. SharpDX.Direct3D11.BufferDescription bdPool = new BufferDescription()
  2. {
  3.     BindFlags = BindFlags.None,
  4.     CpuAccessFlags = CpuAccessFlags.None,
  5.     OptionFlags = ResourceOptionFlags.TilePool,
  6.     SizeInBytes = 65536,
  7.     Usage = ResourceUsage.Default
  8. };
  10. var bufferPool = new SharpDX.Direct3D11.Buffer(device, bdPool);

Now we assign the same tile(s) to each buffer like this:

Code Snippet
  1. var rangeFlags = new TileRangeFlags[] { TileRangeFlags.None };
  2. context.Context.UpdateTileMappings(resource, 1, new TiledResourceCoordinate[] { }, new TileRegionSize[] { }, this.tilePoolBuffer, 1, rangeFlags, new int[] { 0 }, new int[] { }, TileMappingFlags.None);

We do this for each buffer

Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)

Use copy resource on either Buffer1 or Buffer2 (not both).

Create two staging resources s1 and s2

Readback data from buffer1 into s1 and Buffer2 into s2.

Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.

So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).

Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).

Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:

Code Snippet
  2. context.Context.ResizeTilePool(this.tilePoolBuffer, newPageCount * PageSize);

And that's more or less it, now our pool size is equal to the largest unit of work instead of adding between them.

It simply means, in the case above, that our pool size is the sum of the size of the largest effector (which is hbao in the example).

Adding another post processing chain somewhere else will reuse that same memory for all the short lived objects, so if I start to have 5 scenes rendering that's a quite significant gain.

As a side note registration looks (for now) this way:

Code Snippet
  1. this.Device.SharedTiledPool.BeginCollect();
  3. lindepth = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R32_Float);
  4. rthbaox = this.Device.SharedTiledPool.PlaceRenderTarget(context,DepthTexture.Width, DepthTexture.Height, Format.R16_Float);
  5. rthbaoy = this.Device.SharedTiledPool.PlaceRenderTarget(context,DepthTexture.Width, DepthTexture.Height, Format.R16G16_Float);
  6. rtblurx = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R16G16_Float);
  7. rtblury = this.Device.SharedTiledPool.PlaceRenderTarget(context, DepthTexture.Width, DepthTexture.Height, Format.R16_Float);
  9. this.Device.SharedTiledPool.EndCollect();

Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.

I render one scene in full hd + hdr, and a second scene in half hd + hdr

Case 1 (old pool only):

  • Global pool : 241 459 200
  • Tiled Pool: 0

Case 2 (starting to use shared pool):

  • Global pool: 109 670 400
  • Tiled pool : 58 064 896
  • Total memory: 167 735 296

So ok, 70 megs gain in a day where some people will say that you have 12 gigs of ram in a Titan card is meaningless, but well :
  • Most people don't have a Titan card. (I normally plan on 4gb cards when doing projects).
  • Adding a new scene will not change the tiled pool size, and increase the global pool in a much smaller fashion.
  • If you start to add Msaa or render 3x full HD, you can expect a larger gain
  • When you start to have a few assets in the Mix (like a 2+ gigs car model, never happened to me did it? ;) a hundred meg can make a huge difference.
  • For cards that don't support tiled resources, the technique is really easily swappable, so it's not a big deal to fallback to global pool if feature is not supported (or leave the user decide).
  • I applied it quickly as a proof of concept and only on 3 effectors, now this work I can also more aggressively optimize the post processing pipeline chain in the same way (and actually also anything that needs temporary resources in my general rendering, and there's a lot).
That's it for now, as a side note, I have some other pretty cool use cases for this, they will likely end up here when I'll have implemented them.