Wednesday, 22 February 2017

Graph Evaluation (Again)

I did not post for a while, it doesn't means I've been doing nothing, actually quite the opposite.

Lately I've been reworking a bit of my graph internals (basically read, decouple some bits), and I had lot of ideas and implemented quite a few, so I decided I wanted to share some theory again.

First thing I already spoke about the fact that multi core evaluation is tricky to get right (basically, you need your nodes to do enough work to make it worthwhile, otherwise you gonna have serious issues or just have something slower). 

A second thought was to eventually have a compiled form of the graph, while that's a bit of buzzword, I also believe that compiled ones will not always be faster.

Why? Wouldn't removing evaluation overhead make things faster?

Yes, but by still having access to your model, there's a serious lot of optimizations that you can perform, so answer is : not always.

So I'm gonna talk a bit about graph evaluation strategies. (I'll speak of data flows mostly)

Let's take the following simple patch, as you can see, nothing really fancy, but that will perfectly fit

1/A very basic graph

A/Lazy recursion

This is by far the simplest to implement (to be honest, you can do it in less than 100 lines of code probably).

Pseudo code:
Start with NodeFinal

function processnode(node)

if node is already processed 

for each input pin  
    get upwards node (info is likely stored in a link structure)
    processnode (parent node connected to input pin)
end for
end function

process node (some form of update/evaluate call)
mark node as processed

Yes, that's about it really, simple to understand and very idiomatic.

Now first things that we can notice:
We will call processnode several times on some of them (for example, Time is used by 3 different nodes)

In our case (Since I did not introduce any form of "graph cutting techniques") we can also see that actually order of execution is immutable (our graph will always run and process nodes in the same order).

So let's optimize that

B/Reversed List

Getting the above graph (i add the screenshot again):

We can deduce our dependency order (if I consider I'll scan for pins in a left to right fashion)

NodeFinal -> Node12 -> Node1 -> Time -> Node2 -> Time -> Node 9 -> Node 2 -> Time -> Node 5 -> Time

As mentioned, we prevent to execute twice, which gives the following order:

NodeFinal -> Node12 -> Node1 -> Time -> Node2  -> Node 9 -> Node 5

Now we need to transform that into a running order, we could reverse the second list, but that will not work.

Why? Simply because for example Node5 needs Time to run before to be updated.

So let's use the first list, and reverse it:
Time -> Node5 -> Time -> Node2 -> Node9 -> Time -> Node 2 ->Time -> Node 1 -> Node 12 -> NodeFinal

This list is correct, but replicates some elements, filtered version is :
Time -> Node 5 -> Node 2 -> Node 9 -> Node 1 -> Node 12 -> Node final

To build the list, we perform a first lazy evaluation pass as seen upwards, but instead of calling Update, we simply add the node to a list.

Once we get that first list, building the reverse order is done as follows:

var reversedList = reverse(dependencylist)
var executionList = new EmptyList
foreach node in reversedList
   if not executionList.Contains(node) then executionList.Add(node)

Store that list somewhere, and then graph evaluation becomes:
foreach node in executionList

As you can see, that removes recursion, and actually this is really easily compilable (just generate llvm/il/source code) and compile

Issue is of course that each time the graph changes, you need to preparse it again, which is not always advisable at runtime (the compilation step might severely hurt).

Now let's add some features in our graph

2/Lazy pins (and virtual links)

Let's take the following graph.

Here we consider that LargeData and Generator1 nodes are really heavy, and feed some data into builder, but does not need frequent updates.

Builder has been built with first pin to be Lazy (eg: only build upwards if the second pin something like Apply is true (which is dealt somehow by the conditions upwards)

Graph is then decomposed int the following path:

Builder -> And -> Condition1 -> UserInput -> Condition2

If Second pin (eg result of and node) is true

LargeData -> Generator1 -> Time

As you can see, now our evaluation is split into 2 different parts (basically, everything non lazy, then eventually the lazy part is required)

This is suddenly harder to convert into list (and kinda difficult to auto compile).
So suddenly the idiomatic purely lazy version above is not so bad after all. You only need to add a condition on the node to tell if you want to go upwards or not.

Another version is to have some form of internal link eg:

Here the switch node will decide (depending on SwitchValue result) to run either from Part1 or Part2

This can be implemented in 2 ways:
Use lazy as above.

Rebuild the graph structure (basically if SwitchValue  0, the internal representation of the graph is like this):

As you can see the user "sees" the first version of the graph, but internally the link is removed, so the Part2 node (and above) are now orphans and never run.

This is also really efficient, but of course if you use an optimization technique above, every time the internal representation changes (which can be every frame), you need to rebuild your structure. 

So deciding to do so means that you need to decide if it's worth it (does the optimization technique + optimized evaluation is still faster that idiomatic version).

Please note those techniques are always defined on runtime (eg: at some point, some node is granted "power" to cut the graph).

As we have seen, they add some complexity to the evaluation part.

Now let's separate our usage into 2 parts : 
  • Authoring : When user is effectively building the patch
  • Runtime : When we are running

In case you want to deploy your application, you will of course have effective gains by optimizing runtime performance version. 

But in many cases we still need fast authoring, even if those are niche scenarios, I always have cases where I need to go modify something very early before a show (or actually , during the show). 

Some people will say it's bad, but in cases where it's not avoidable (position some element properly, add a quick animation or whatever some client urgently requires) this becomes critical not to disrupt the render output too much (basically, freeze or lag).

So the whole key is to balance both (of course you can also provide several graph evaluation strategies, so the user can decide depending on his use case).

3/Dual graph

One very easy technique to balance this part is to maintain two evaluators, one idiomatic version, and one optimized version.

When patch structure changes, we swap to idiomatic (since it has close to zero rebuild time), and build a new optimized version in background.

When optimized version is ready, switch back to it.

And key if you use c#, keep a reference to your old optimized evaluators somewhere, it's likely that you'll have a Gen 1/2 GC trigger, which is not desirable either,a small "memory increase instead" might be preferred.


Now let's introduce something interesting, since all the above was fairly basic.

Basically, let's consider a node as "Take some input and produce some outputs"

We could consider 3 cases of "output production"
  • Deterministic : This is identical to "pure functions" in functional paradigm. Same inputs will always produce same outputs.
  • Non deterministic : This will produce varying results (for example, an animaton filter with keep previous frame result and depend on that)
  • User input, non deterministic: This is the same as above, but is a special case (for example a node that has an editable value)
So let's consider the following patch:

Let's say that Constant (and Constant2) nodes are immutable (basically you set the value when you create the node and you are not allowed to modify it).

Other operators are basic (math operators are fully deterministic).

This patch only needs to run once, it will always produce the same results.

You will of course say : I need to animate my data to show on the screen, and I agree, But I can be remotely sure that in large patches, there are a lot of sections like that.

You could of course use a lazy pin as specified above (below finalvalue), but we could be able to just automate this.

Let's take a second example:

In that case, only the Constant node is constant, since Time node is non deterministic, there is not much we can do, but then having:

Here we can see that all the section in gray can be ran only once and cached, it never needs to run several times.

If now we add user input to the mix:

Here the group on the right only changes when user input is changed, so we can also really easily cache results (as user data will not change ever so often).

We do not need to check for user changes manually (generally editors tend to trigger an event).

Pseudo code for evaluation is a tad more complex.
First when graph is updated, we need to propagate some information downwards as per:

Walk though node recursively, check if it has a Fully non deterministic parent (or a user input based one), mark that in a flag (and in case of user input, store some reference to it).

void ProcessNode(node)
   if node has no deterministic parents, stop here (unless it's the first frame or graph has changed)
   if node has a fully non deterministic parent (anything that depends on Time node in our case), process it
   if node has one "user signal" version, check if user has changed value (some form of dirty flag), and only process it if required.

In our example patch, we can see that unless user is editing value, only FinalValue, +, Add, Divide and Time need to be processed, the rest can be completely ignored most of the time, and this can be automated!

Again, in our example, if we deploy our application, the UserValue node can automatically be changed to a constant node (since user will not have access to editor anymore). So we can also even completely remove the User input, non deterministic case, and stay with only Deterministic of not.

As a side note, in case of deploy, we can also even store the result values we need and just remove the nodes completely.

For example :

Can be transformed to:

Here we store all results that are needed (Sine and Multiplies), and don't need those processors anymore.

Of course, in case nodes operate on large lists, storing results can require some large files, so we can of course keep the option to keep nodes and run it only once (and replace them my constants in the graph right after).

Where is the catch then?
There is one, every node must be annotated, we need to provide metadata in order to indicate about a node Determinism, this can of course easily be achieved using some attributes (in c# example)

What's then important is that each node must be "curated", as a Non deterministic node marked as Non Deterministic will prevent downward optimization, opposite scenario is worse, as it will lead to incorrect results (data will not change whereas it should), but adding those is a small cost versus the reward.


Compiling graphs as flattened structures can feel tempting as first sight, but graph data structure can hold a large amount of "knowledge" which is then lost, so before to decide to do so it can be interesting to take a step back and see if it's worthwhile at all (that for sure, is another debate and there will be pros and cons in each case).

Since this post probably feels long to digest, I'll keep the next section for later, which will include shaders in the mix, so stay tuned.

No comments:

Post a Comment