Catflier: November 2013

Wednesday 20 November 2013

Execution Path and IO

I explained in previous post how you can build simple functions and convert them into IL code.

Now what is interesting is you can easily optimize those by using simple assumptions, which are things that you would easily forget while writing code.

You can consider 2 things:

Immediate optimizations : For elements that you know at compile time
Code Path : You need to define them at runtime depending on inputs size.

Let's see a few Immediate optimizations:

Code Snippet

[SlicewiseMethod(Name = "SimpleLerp", Category = "Value")]
public static void SimpleLerp(
    [Input("Input 1")] double d1,
    [Input("Input 2")] double d2,
    [Input("Amount")] double amt,
    [Output("Output")] out double result)
{
    result = VMath.Lerp(d1, d2, amt);
}

Here we just process a simple lerp.

Since in slicewise operations, our output size is always SpreadMax, Generator will automaticaly select a pointer output,and miss the mod operator.

This is a no brainer, easy as that, outputs never use mod and do direct storage.

Second test:

Code Snippet

[SlicewiseMethod(Name = "Sin", Category = "Value")]
public static void Sin(
    [Input("Input")] double d, 
    [Output("Output")] out double dout)
{
    dout = Math.Sin(d);
}

Here we have one input and one output. The compiler can detect that, and automatically set the input read to safely ignore the mod operator too.

Now let's look at our lerp above, and let's say that we want only one lerp value for all elements:

Code Snippet

[SlicewiseMethod(Name = "SimpleLerp2", Category = "Value")]
public static void SimpleLerp(
    [Input("Input 1")] double d1,
    [Input("Input 2")] double d2,
    [Input("Amount", IsSingle=true)] double amt,
    [Output("Output")] out double result)
{
    result = VMath.Lerp(d1, d2, amt);
}

Here we did set the IsSingle Attribute to our variable. Compiler will detect that, read the variable into a local once before the for loop.

Now let's say we want to multiply a spread by a constant value:

Code Snippet

[SlicewiseMethod(Name = "MultiplyFixed", Category = "Value")]
public static void SimpleLerp(
    [Input("Input")] double d,
    [Input("Amount", IsSingle = true)] double amt,
    [Output("Output")] out double result)
{
    result = d * amt;
}

Here as before, our Amount variable will be stored into a local.
But now we also know that our Input Variable is alone (same as the Sine case).
So compiler will remove the mod operator on input value as well!

Now let's see a little bit more complex case:

Code Snippet

[Selector(Name = "Function", Category = "Value Inverse")]
public static Dictionary<string, Func<double, double>> TrigonometryInv
{
    get
    {
        var result = new Dictionary<string, Func<double, double>>();
        result.Add("Asin", (x) => Math.Asin(x));
        result.Add("Acos", (x) => Math.Acos(x));
        result.Add("Atan", (x) => Math.Atan(x));
        return result;
    }
}

Here we basically build a node that Has one input, one output , and and enum to select function.

Equivalent code:

Code Snippet

[PluginInfo(Name = "Function", Category = "Value", Version = "Simple")]
public unsafe class DummyFunc : IPluginEvaluate
{
    public enum eFunc { Sin,Cos, Tan}
    public delegate double FuncDelegate (double input);
 
    [Input("Input")]
    private ValueInput input;
    [Input("Function", IsSingle = true)]
    private ISpread<eFunc> function;
    [Output("Output")]
    private ValueOutput output;
    private Dictionary<eFunc, FuncDelegate> functable = new Dictionary<eFunc, FuncDelegate>();
 
    public DummyFunc()
    {
        functable.Add(eFunc.Sin, (x) => Math.Sin(x));
        functable.Add(eFunc.Cos, (x) => Math.Cos(x));
        functable.Add(eFunc.Tan, (x) => Math.Tan(x));
    }
 
    public void Evaluate(int SpreadMax)
    {
        FuncDelegate f = functable[function[0]];
        output.Length = SpreadMax;
        double* iptr = input.Data;
        double* optr = output.Data;
        for (int i = 0; i < SpreadMax; i++)
        {
            optr[i] = f(iptr[i]);
        }
    }
}

Please note that we could create a loop delegate, like this:

Code Snippet

[PluginInfo(Name = "Function", Category = "Value", Version = "Loopable")]
public unsafe class DummyFunc2 : IPluginEvaluate
{
    public enum eFunc { Sin, Cos, Tan }
    public delegate void FuncLoopDelegate(int cnt, double* input, double* output);
    [Input("Input")]
    private ValueInput input;
 
    [Input("Function", IsSingle = true)]
    private ISpread<eFunc> function;
 
    [Output("Output")]
    private ValueOutput output;
 
    private Dictionary<eFunc, FuncLoopDelegate> functable = new Dictionary<eFunc, FuncLoopDelegate>();
    public DummyFunc2()
    {
        functable.Add(eFunc.Sin, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Sin(input[i]); } });
        functable.Add(eFunc.Cos, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Cos(input[i]); } });
        functable.Add(eFunc.Tan, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Tan(input[i]); } });
    }
 
    public void Evaluate(int SpreadMax)
    {
        FuncLoopDelegate f = functable[function[0]];
        output.Length = SpreadMax;
        double* iptr = input.Data;
        double* optr = output.Data;
        f(SpreadMax,iptr,optr);
    }
}

This doesn't give a very good gain tho, so first option is easier to build.

Now let's look at code path.

We can eventually decide of our loop at runtime, let's take our Vector2 join again:

We have 5 cases we want to consider:

Length (x) == Length(y) : We can ignore mod on each vector.
Length (x) == 1 : We store x before the loop and sample y at full speed
Length (y) == 1 : Same as above, just the other way round.
Length (x) > Length(y) : Sample x at full speed, mod on y
Length (y) < Length(y) : Opposite

Expressed as code:

Code Snippet

public delegate void VectorLoop(int cnt, double* d1, int i1, double* d2,int i2, Vector2D* output);
 
private Dictionary<eFuncSelector, VectorLoop> functable = new Dictionary<eFuncSelector, VectorLoop>();
 
public DummyVectorDynamic()
{
    functable.Add(eFuncSelector.XMax, (cnt, x, i1,y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i % i2]; } });
    functable.Add(eFuncSelector.YMax, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i%i1].y = y[i]; } });
    functable.Add(eFuncSelector.FullSpeed, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i]; } });
    functable.Add(eFuncSelector.SingleX, (cnt, x, i1, y, i2, o) => { double xval = x[0]; for (int i = 0; i < cnt; i++) { o[i].x = xval; o[i].y = y[i]; } });
    functable.Add(eFuncSelector.SingleY, (cnt, x, i1, y, i2, o) => { double yval = y[0]; for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = yval; } });
}

Now in our evaluate method:

Code Snippet

public void Evaluate(int SpreadMax)
{
    dout.Length = SpreadMax * 2;
    Vector2D* d = (Vector2D*)dout.Data;
 
    double* pt1 = d1.Data;
    double* pt2 = d2.Data;
    int l1 = d1.Length;
    int l2 = d2.Length;
 
    if (l1 == l2) {
        functable[eFuncSelector.FullSpeed](SpreadMax, pt1, l1, pt2, l2,d);
    } else if (l1 == 1) {
        functable[eFuncSelector.SingleX](SpreadMax, pt1, l1, pt2, l2, d);
    } else if (l2 == 1) {
        functable[eFuncSelector.SingleY](SpreadMax, pt1, l1, pt2, l2, d);
    } else if (l1 > l2) {
        functable[eFuncSelector.XMax](SpreadMax, pt1, l1, pt2, l2, d);
    } else {
        functable[eFuncSelector.YMax](SpreadMax, pt1, l1, pt2, l2, d);
    }
}

Now here are a few benchmarks (considering different cases):

X = 50000, Y = 20
Native: 0.35 ms
Zip : 0.28ms
Mutable : 0.2 ms

This is one of the 2 worst cases, (when X != Y), but we saved a mod on the biggest spread, which technically reflects.

X = 25000, Y = 25000
In that case the node will use : FullSpeed
Native: 0.17ms
Zip : 0.12ms
Mutable : 0.06 ms

In that case we simply ignore all the mod operators, and Mutable clearly wins.

X = 50000, Y = 1
Native : 0.36 ms
Zip : 0.25ms
Mutable : 0.16ms

We can also see that pushing a local and setting other at full speed gives a decent gain. So setting path at runtime is significant.

So technically, that gives us this information:

let function (p1,p2,p3,p4) = dosomething

Now what we have, to generalize our vector, is the following (for each parameter)
if length == 1 -> set local before loop and use that local
if length == SpreadMax -> read without mod
if length < SpreadMax -> read modded

Of course we can clearly see that we have a combinatorial explosion, a function with 8 parameters has a LOT of different cases.

Few options here:
generate the worst case scenario (eg: mod everything).

When you enter Evaluate:
build a mini hash (length == 1) -> 0, (length == SpreadMax) -> 1, (length < SpreadMax) -> 2

Ask the runtime if we already have the function, if not compile it and give it back.

Run the function.

Yes, clearly as you've seen, we have nodes that self compile ;)

So this is of course to use with caution, if your node input length vary a lot, you might have a lot of recompile.
If you have reasonably steady spread counts, you'll have a little overhead on startup. Please note that something like:

let function (a,b) => a+b
a = 2000, b = 1
next frame , a = 1000, b= 1

will not cause a recompile, since a is is still at SpreadMax.

On Some specific notes, node compiler is about 0.5 millisecond, which is pretty fast, but if you have 1000 nodes rebuilding on the fly, that's not too good ;)

What is good is that our node is now a basic container, so we save some il code. Second very cool thing, our logic becomes a delegate, which is much easier to wrap into a Task, but more on that later.

Monday 18 November 2013

Into Code to IL to Node

I changed the title a bit, but yes I'll kinda speak again of code generation somehow.

I remember some time ago devvvvs shown some prototype to have simpler ways to makes nodes. Basically you create a function and it generates boilerplate code around.
Sadly it never really made it through (maybe they keep it for their new version).

Basically you build a function and you generate the node to fit it.

I started to work on this, I called the project MicroNodes ;)

Here is a simple example

Code Snippet

[SlicewiseMethod(Name = "+", Category = "Value")]
public static void Add([Input("Input 1")] double d1, [Input("Input 2")] double d2, [Output("Output")] out double dout)
{
    dout = d1 + d2;
}

Ok That's the infamous Add (without group), but you see the idea.

You basically have your operator, and then I build a plugin in intermediate language that prepares data for each slice, then calls the function.

So a dummy implementation looks like this:

Code Snippet

[PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Simple")]
public class DummyAdd3 : IPluginEvaluate
{
    [Input("Input 1")]
    private ISpread<double> d1;
 
    [Input("Input 2")]
    private ISpread<double> d2;
 
    [Output("Output")]
    private ISpread<double> dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.SliceCount = SpreadMax;
        double dou;
        for (int i = 0; i < SpreadMax; i++)
        {
            SliceWise.Add(d1[i], d2[i],out dou);
            dout[i] = dou; 
        }
    }
}

You can notice I use output parameter in that case, that will fit some other parts. Here is the same implementation using operator:

Code Snippet

[PluginInfo(Name = "Add", Category="Value",Version="Operator Simple")]
public class DummyAdd : IPluginEvaluate
{
    [Input("Input 1")]
    private ISpread<double> d1;
 
    [Input("Input 2")]
    private ISpread<double> d2;
 
    [Output("Output")]
    private ISpread<double> dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.SliceCount = SpreadMax;
        for (int i = 0; i < SpreadMax; i++)
        {
            dout[i] = d1[i] + d2[i];
        }
    }
}

You can notice that it's very easy to generate intermediate language for Slicewise methods.

First interesting bit is that both of those are running at same speed. Reason being simple, the JIT, when it converts the Intermediate language into machine code will inline the function call, so assembly wise it's pretty similar.

That's good news, since it means I can easily use functions and call has no overhead.

Now let's see how we can optimize this.
First, any slicewise operation, output result count = spreadmax.
Doing dout[i] does the following in Intermediate Language:
virtcall ISpread<T>._setcyclic()

Setting a slice like this does (for every single write)

Mark the Spread as dirty (eg: it needs to flush data back into vvvv)
Go access the internal DataStore (an array basically)
Do a Mod to store, output[i%icount] = data

That's actually a lot.

So now we can access the internal store ourselves, this is exposed on Spread Type.

Here is a better version:

Code Snippet

[PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Buffer")]
public class DummyAdd4 : IPluginEvaluate
{
    [Input("Input 1")]
    private ISpread<double> d1;
 
    [Input("Input 2")]
    private ISpread<double> d2;
 
    [Output("Output")]
    private ISpread<double> dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.SliceCount = SpreadMax;
        double[] b = dout.Stream.Buffer;
        for (int i = 0; i < SpreadMax; i++)
        {
            SliceWise.Add(d1[i], d2[i], out b[i]);
        }
        dout.Flush(true);
    }
}

Now, since we access the buffer (and we know we never have to use Mod for output), we get access to the array, and can pass the index as output directly,this does this in IL:

IL_0039: ldloc.0 (This is our array local variable)
IL_003a: ldloc.1 (This is index)
IL_003b: ldelema [mscorlib]System.Double

So instead it loads the address of the array and pass it to the function.

Pleace note that I call Flush(true), since now I write directly into the internal store, the dirty flag if never set.

So it's easy to add this in the IL generator, and it Gives you a decent substantial boost.

For 5000 elements:
ISpread virtual : 0.1 ms
ISpread buffer : 0.06 ms

For 50000 elements
ISpread virtual : 1 ms
ISpread buffer : 0.6 ms

It's pretty low values but it can easily add up.

So now we can use pointer output, since we can write directly back into memory.
This doesn't give a decent improvement, since a memcpy on 5000 elements is blazing fast, But at least we save some memory (40 kilobytes for a 5000 spread).

It seems kinda minimal, but in a reasonably large patch, it also adds up quickly.

So now the obvious next part is to modify our input. We also do a virtcall for access which also does a mod.

Operator version:

Code Snippet

[PluginInfo(Name = "Add", Category = "Value", Version = "Operator Ptr Full")]
public unsafe class DummyAdd8 : IPluginEvaluate
{
    [Input("Input 1")]
    private ValueInput d1;
 
    [Input("Input 2")]
    private ValueInput d2;
 
    [Output("Output")]
    private ValueOutput dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.Length = SpreadMax;
        double* d = dout.Data;
 
        double* pt1 = d1.Data;
        double* pt2 = d2.Data;
 
        int l1 = d1.Length;
        int l2 = d2.Length;
 
        for (int i = 0; i < SpreadMax; i++)
        {
            d[i] = pt1[i%l1] + pt2[i%l2];
        }
    }
}

And Static function version:

Code Snippet

[PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Ptr Full")]
public unsafe class DummyAdd10 : IPluginEvaluate
{
    [Input("Input 1")]
    private ValueInput d1;
 
    [Input("Input 2")]
    private ValueInput d2;
 
    [Output("Output")]
    private ValueOutput dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.Length = SpreadMax;
        double* d = dout.Data;
        double* pt1 = d1.Data;
        double* pt2 = d2.Data;
        int l1 = d1.Length;
        int l2 = d2.Length;
        for (int i = 0; i < SpreadMax; i++)
        {
            SliceWise.Add(pt1[i % l1], pt2[i % l2], out d[i]);
        }
    }
}

As before Operator vs Static function gets pretty much 0 difference, which is still good news, since it means that we can still easily generate that IL code.

It yields a difference compared to ISpread still, here is the table:

For 5000 elements:
ISpread virtual : 0.1 ms
ISpread buffer : 0.06 ms
Ful Pointer : 0.03 ms

For 50000 elements
ISpread virtual : 1 ms
ISpread buffer : 0.6 ms
Ful Pointer : 0.3 ms

Please note that we also save memory for each pin, so let's say we have 5000 elements in first spread and 500 in second, we gain : (5000*8*2) + (500*8) = 84 kilobytes

That's quite pretty decent specially also considering that we have a node that is much faster.

So technically speaking, one very fun thing is that by allowing the user to build this first function as above:

Code Snippet

public static void Add(double d1, double d2, out double dout)
{
    dout = d1 + d2;
}

By choosing a route in our IL generator we can make a node 3 times faster compared to the dummy implementation that comes from template.

Also please note that it's much easier to reuse that function or post optimize.
If you have 50 nodes like this, finding a new speedup is a decent amount of refactoring, here, none is needed.

Now I'll just give a second example, since now devvvs have presented their new 5x to 6x faster Vector Join, split, I might put them to the test :)

First thing is indeed, compared to last version they are much faster, no brainer, which is great!

Please note that there's also Zip (Value), which is normally pretty optimized.

So Let's see, Vector (2d Join)

5000 Elements
Native : 0.035 ms
Zip : 0.035 ms

50000 Elements
Native : 0.37 ms
Zip : 0.32 ms

So now I decided to test 3 techniques, so are more code generation friendly, some are more raw plugin, here we go.

First a Vector Join function is simply like this:

Code Snippet

[SlicewiseMethod(Name = "Vector", Category = "2d Join")]
public static void V2dJoin(
    [Input("X")] double x,
    [Input("Y")] double y,
    [Output("Output")] out Vector2D result)
{
    result = new Vector2D(x, y);
}

Really simple no?

Now here are the testers.

Method one:

Code Snippet

  [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr")]
  public unsafe class DummyVector: IPluginEvaluate
  {
      [Input("Input 1")]
      private ValueInput d1;
 
      [Input("Input 2")]
      private ValueInput d2;
 
      [Output("Output")]
      private ValueOutput dout;
 
      public void Evaluate(int SpreadMax)
      {
          dout.Length = SpreadMax*2;
          double* d = dout.Data;
 
          double* pt1 = d1.Data;
          double* pt2 = d2.Data;
 
          int l1 = d1.Length;
          int l2 = d2.Length;
 
          for (int i = 0; i < SpreadMax; i++)
          {
              *d = pt1[i % l1];
              d++;
                *d=pt2[i % l2];
              d++;
          }
      }
  }

Pretty simple, we push per component.

Method 2:

Code Snippet

[PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Struct")]
public unsafe class DummyVector2 : IPluginEvaluate
{
    [Input("Input 1")]
    private ValueInput d1;
 
    [Input("Input 2")]
    private ValueInput d2;
 
    [Output("Output")]
    private ValueOutput dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.Length = SpreadMax * 2;
        Vector2D* d = (Vector2D*)dout.Data;
 
        double* pt1 = d1.Data;
        double* pt2 = d2.Data;
 
        int l1 = d1.Length;
        int l2 = d2.Length;
 
        for (int i = 0; i < SpreadMax; i++)
        {
            d->x = pt1[i % l1];
            d->y = pt2[i % l2];
            d++;
        }
    }
}

This is almost the same, but we build as struct.

Now actually we can also simply use our slicewise:

Code Snippet

[PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Out")]
public unsafe class DummyVector3 : IPluginEvaluate
{
    [Input("Input 1")]
    private ValueInput d1;
 
    [Input("Input 2")]
    private ValueInput d2;
 
    [Output("Output")]
    private ValueOutput dout;
 
    public void Evaluate(int SpreadMax)
    {
        dout.Length = SpreadMax * 2;
        Vector2D* d = (Vector2D*)dout.Data;
        double* pt1 = d1.Data;
        double* pt2 = d2.Data;
        int l1 = d1.Length;
        int l2 = d2.Length;
        for (int i = 0; i < SpreadMax; i++)
        {
            Joiner.Join(pt1[i % l1], pt2[i % l2], out d[i]);
        }
    }
}

So now the results:

5000 Elements
Native : 0.035 ms
Zip : 0.035 ms
Per Component: 0.035 ms
As Struct: 0.031 ms
Function Output: 0.038 ms

50000 Elements
Native : 0.37 ms
Zip : 0.32 ms
Per Component: 0.35 ms
As Struct: 0.33 ms
Function Output: 0.34 ms

Please note that values fluctuate a little bit.

But basically that means few things:

Devvvvs did a good job at optimizing vectors
On very low spread (50), zip seems slower, native is of course fastest (since you have COM object call). It is not as obvious tho.
A generated version from function is as Fast as optimized native ;)

So to resume, IL code generation is pretty cool.

Now you have a lot of other use cases than SliceWise.

What I decided was to build different templates, so for example:

Code Snippet

[ReductorMethod(Name = "+", Category = "Value Spectral")]
public static void SpectralAdd([Input("Input")] IEnumerable<double> data,[Output("Result")] out double result)
{
    result = 0.0;
    foreach (double d in data) { result += d; }
}

You can also do Struct Join/Split like this:

Code Snippet

[Compactor(Name="Struct",Category="Test")]
public struct MyStruct2
{
    [CompactorProperty(Name = "Hello")]
    public double Hello;
 
    [CompactorProperty(Name = "Element2")]
    public double Element2;
 
    [CompactorProperty(Name="Vector",Flatten=true)]
    public Vector3 Vector;
}

Or have a visitor class in case you want to use an existing struct.

There is some other types/parts, but that will be for another post.