Sunday 15 December 2013

SharpDX, refactoring and complexity

I finally was able to spend some proper time to port over SlimDX dx11 to SharpDX (still on quite a few stressful projects, so I try to lock the little amount of spare time I have to test new goodies ;)

Main thing, I didn't redo all from scratch, here is my workflow in that case:

  • Port the core API: This is quite minimal, so I got rid of a lot of the boilerplate, less types, split some parts. API is now much more minimal and easy to work with. 
  • Port of the core: On first instance, I also just replace namespaces/namings. Then I started to refactor parts.
  • Nodes : Since you don't want to break all naming, I keep all very simple. I copy/paste all the nodes and remove the code inside ;) That means I don't have random compile errors, but I got all the in/outs up front (which is kinda similar to test driven development. When I think my core API is ready I push the code again.
So lately I was able to also add improvements, no more multi device helps a lot cleaning codebase, which would help 1 person in a lifetime but makes suffer all the others. Now dx nodes are simpler, easier to write.
There's sill a bit of work polishing API, but foundation seems much nicer for now.

Shader node finally got the improvement it deserved, and is much lighter api wise (eg, also decently faster in many cases).

I also looked to port a few of my high level nodes, and give some decent improvements to the layer system, and all of this is also promising, plan for next release being to provide more low level access to advanced users, while having more high level nodes for general patchers. This is for me a good step forward, but balancing this is hard ;)

On the week end I was able to test a bit API, and results are pretty promising:




Now one major issue with move to SharpDX is the following:

  • You want to support win7
  • You want access to latest DirectX (eg : 11.2)
Luckily, SharpDX makes it relatively easy, only pain part is file load/save. (In win7 you have dxut which makes it easy, in 8/8.1 you hve to do it yourself, or port DirectXTex from c++ to c#).

But that already makes 3 builds to maintain (I'd say ok, 2, 11.1 and 11.2 is not much difference for now).
And since in 4v you need to differenciate x86/x64, that makes 6 builds,uff, just a right pain.

So I looked again a bit more in 4v core, and oh man everything is so overcomplex.

SlimDX is really ultra tied to the core, and it's also the only assembly that forces this 32/64 bits build (90% of the rest could happily be anycpu).
But since it's tied to the core any plugin must choose.

This is a right pain.

I did a few tests, replacing SharpDX by SlimDX also for DX9, but it doesn't scale well, since then you tie yourself to the win7 assembly, which kinda sucks (assembly loading order can create some... interesting errors).

So easiest way ended up to be the brute force, eg, just get rid of SlimDX, and basically just break any plugin with DX9 mesh/layer/texture out.

For me right now it is the best and most sensible solution. I don't have access to the full 4v core, so I'm not able to split a few interfaces, so both of them can live together.

Since I'm really not into dx9/dx11 working in the same 4v instance, it's really not a biggie, but it can bring a few maintenance issues.

So what is the plan now?

Well it would be great to work with devvvvs in order to properly split standard pin logic to render pins, and properly isolate SlimDX (and any architecture dependent code) from he main core. I think it's primordial, but if i have to ship a custom core to avoid this nonsense, then so be it, I will gladly do it.

All those Matrices in 4v are also nonsense, I forgot some of the exact bits, but they add so much complexity to the system, and for me thinking that a simple 16 floats array create such a mess feels kinda bad.
So since I got mostly all of them with a much simpler, custom type, I might introduce that in the next build.
That would be a big thing since it would break backwards compatibility, but at the end I tend to plan for the future not the past.

Rest is to continue into SIMPLIFYING the system. I think most people tend to think too much about a problem, then the most you think of it the more twisted you mind becomes, then you produce a system 10 times more complex than it should be. This is wrong. Not saying that your system should be more flexible, but it needs to be easily testable/debuggable. More complexity never brings any good, over design is bad. 
For example, I looked a lot at code generation, and finally, I will only use it for fx->c# generation.
Doing it for all my types sounds good, but finally I ported all my geometry nodes in 20 minutes, would have taken 2 days to build a generator. For effects with reflection it stays a pretty good idea, since any user can write their own fx, so in that case it makes some sense.

The most fun part of it, you will complain about a system being overcomplex, decide to rewrite all, and makes something 5 times more complex up front (while repeating a lot of mistakes and adding new ones). Not worth it, refactor and improve.

Stop using crappy defaults, and inform user when something is wrong. Crappy defaults sounds a good idea on first instance, but it also makes your system more complex, and when your crappy defaults don't work anymore, you have to know if it comes from the user or yourself, and your user didn't learn anything. A few well placed defaults (with proper information eg : Ok you didn't provided this info, so I used this instead, but please be wary) is a step in the right direction. Choosing silent defaults is a NO GO.

Take more painful decisions if I need to ;) I know users hate changes in some ways (myself included), but then you have to decide and take the risk if you think this is future proof, not stay like ten years ago. I know on programmer side this is also a problem, but finally, changing library if you need is also taking care of users. I could just take AddFlow (the ui library used in 4v), which is now so crap that it's impossible to use vvvv in a live environment anymore (move a few nodes and all your render freezes). I wish they would take the painful path and just switch library, and it doesn't take that much time to do (just bit boring).

Ok stop mumbling, lot of new goodness on the way, be happy ;)







Monday 2 December 2013

Last DX11 Release...

Using SlimDX ;)

http://vvvv.org/blog/directx-11-beta-31.2-update

Beta31.2 marks a little history as now the plan is to make a full move to SharpDX.

Some people might ask what is the benefits, or if I'm some kind of masochist who like to redo everything again :)

First, a move to SharpDX sounds like the smart option so far, and offers quite a few cool bits:

  • It's quite actively maintained
  • Support for DX11.2
  • Generally API calls are better for performance
  • No more dreaded 32/64 bits builds
And now I'm not rewriting everything, but I feel a lot of things which are already nicer than in old DirectX9 can be done in an even smarter way. Some years of programming in DX taught me a lot as well, and there's definitely some parts that I'm looking forward to improve or rework.

On a first note, writing an API is hard, it takes a lot of time, trial and error, to find a decent compromise between performances/features/ease of programming. So far I consider the first round as a success, there's still a few bugs/features missing, but hey, I'm more or less on my own writing the core, even tho I of course want to thanks people who contributed nodes/shaders ;)

So besides moving to SharpDX, there's of course many plans to improve what is there, in no particular order.

1/Wrapper

New wrapper is almost as fully featured as previous one, but has quite a decent amount of improvements:
  • Clear separation between Device/DeviceContext (multi threaded rendering in mind).
  • Resources are much thinner and much less abstract class/overrides, which should decently help where sometimes it was just a right pain to link 2 elements.
  • Now most of the runtime (not finished yet), is unit tested. So I can quickly see if a change breaks resource creation, pool... This is such a time saver.
  • Resource also have much easier to use creation methods, copies are much more streamlined (only one method to copy to dynamic texture now.
  • Wrapper will mostly manage many addons too (vlc/kinect/geometries...) so most of this can be independently tested in a much cleaner way.
  • Input layout handling is still one of the area for me, it's a bit of a pain to find the right model (for a game engine you can safely build a small hash, but in 4v case, there's so many permutations that it's really not easy).

2/VVVV

Now on the vvvv side, there are also quite some (drastic) changes on the main core. Please don't be afraid as a patcher, on your side it shouldn't be any change.

Death of multi device. Having to handle resource dictionary per device sucks (pain to do anything thread safe), it never got used , and single device works on multi graphics card anyway (tested on a decently fat project). Maybe there's a 0.00001% scenario would pop up where it has a usage, but to be honest I prefer to ease the pain of the 99.9999% of other people ;)

That will mean some performance improvement (specially for large patches) more streamlined coding (less mess up with interfaces).

Layer system is also getting a little improvement, with better stacks for camera, reserved cbuffers, easier rebinding.

Scheduler improvement is also on the way, and small Task based rendering is an area I'm actively looking and experimenting.

3/Shader management

That's the main area where I want to work on, I find shader management in 4v sucks at the moment, you have gazillions of little pieces of code messing around that you can way too easily modify.

Not that modifying shader to fit you needs is bad, but it's just a pain for standard ones. That involves 2 big changes.

Shader package:

basically you have a pack folder and you can compile a library with the following:
  • Precompiled shaders (namespaced as in folder structure)
  • Json content file
So instead of having 200 shaders lying around, you can pack all that lot and easily distribute.

FX Projects

I always found FX projects to be a bit of an issue (an fx project is just a single file, so technically it's not even a project ;)

Also having an extension for each shader type gives some limitations (you don't want to give 50 extensions, but you still want to allow to give a shader a context). Context is for me what is the most missing in vvvv, most stuff is just... stuff.

Giving context to a shader, as selecting host via gui, would allow to create different ways of interpreting it (geometry generator/particle emitter....) in a much easier way, improving quality of contributions.

Ah and on the loop, shader will compile in background, no more massive freeze when you press Ctrl+S on this big fatty compute shader blur :)

I'm not sure if this one will be ready for first release, but it's definitely on decent priority list.

4/Shader API footprint

When you batch like a nutter, Shader API footprint doesn't matter much (you end up <100 draw calls anyway).

But now with compute you can also easily build data structures and manage a decent amount of logic directly in your gpu (on our last project we ended up having most of the processing hosted in there).

So API footprint starts to make sense again, specially on compute side.

There's already some prototypes/experiment done on that side, stay tuned, it rocks believe me ;)

5/Be high (No connotations with any type of substance... )

For now I consider 4v to be fairly low level. 

First part of the plan for DirectX11 was to build a backbone to have people getting used to it. Plan to have high level nodes was always there, but it's not that useful without a decent backbone, now time is getting there.

Many more high level nodes are needed, where people can really do things out of the box.
  • Deffered Renderers
  • Better Light equations
  • Easier to use materials
  • Pluggable particle systems
  • Geometry processors
  • All that wrapped in proper plugins (sandboxed) to also ensure quality API usage.
Some people will claim that it's less tweakable when you sandbox, but for many users they don't write shaders anyway, so some defaults where you can do quality rendering is one thing that is definitely needed.

Resource management, Smart ordering is where normally most user will fail, so giving them a decent start up is not a bad thing (and of course you still have access to the low level API if you feel up for it ;)


And please note that if you want to contribute you are more than welcomed, since doing all that lot takes time and I'm more or less on my own, and I also do projects ;)

That includes mostly:
  • One person to help manage GitHub/Builds (really that would be god send)
  • People to do help patches/examples
  • People for writing some nodes (even tho with new system it will be a bit different).

    So that's it for this post, would say, one chapter closes, one new chapter opens, exciting times ;)






    Wednesday 20 November 2013

    Execution Path and IO

    I explained in previous post how you can build simple functions and convert them into IL code.

    Now what is interesting is you can easily optimize those by using simple assumptions, which are things that you would easily forget while writing code.

    You can consider 2 things:
    • Immediate optimizations : For elements that you know at compile time
    • Code Path : You need to define them at runtime depending on inputs size.
    Let's see a few Immediate optimizations:

    Code Snippet
    1. [SlicewiseMethod(Name = "SimpleLerp", Category = "Value")]
    2. public static void SimpleLerp(
    3.     [Input("Input 1")] double d1,
    4.     [Input("Input 2")] double d2,
    5.     [Input("Amount")] double amt,
    6.     [Output("Output")] out double result)
    7. {
    8.     result = VMath.Lerp(d1, d2, amt);
    9. }

    Here we just process a simple lerp.

    Since in slicewise operations, our output size is always SpreadMax, Generator will automaticaly select a pointer output,and miss the mod operator.

    This is a no brainer, easy as that, outputs never use mod and do direct storage.

    Second test:

    Code Snippet
    1. [SlicewiseMethod(Name = "Sin", Category = "Value")]
    2. public static void Sin(
    3.     [Input("Input")] double d,
    4.     [Output("Output")] out double dout)
    5. {
    6.     dout = Math.Sin(d);
    7. }

    Here we have one input and one output. The compiler can detect that, and automatically set the input read to safely ignore the mod operator too.

    Now let's look at our lerp above, and let's say that we want only one lerp value for all elements:

    Code Snippet
    1. [SlicewiseMethod(Name = "SimpleLerp2", Category = "Value")]
    2. public static void SimpleLerp(
    3.     [Input("Input 1")] double d1,
    4.     [Input("Input 2")] double d2,
    5.     [Input("Amount", IsSingle=true)] double amt,
    6.     [Output("Output")] out double result)
    7. {
    8.     result = VMath.Lerp(d1, d2, amt);
    9. }

    Here we did set the IsSingle Attribute to our variable. Compiler will detect that, read the variable into a local once before the for loop.

    Now let's say we want to multiply a spread by a constant value:

    Code Snippet
    1. [SlicewiseMethod(Name = "MultiplyFixed", Category = "Value")]
    2. public static void SimpleLerp(
    3.     [Input("Input")] double d,
    4.     [Input("Amount", IsSingle = true)] double amt,
    5.     [Output("Output")] out double result)
    6. {
    7.     result = d * amt;
    8. }

    Here as before, our Amount variable will be stored into a local.
    But now we also know that our Input Variable is alone (same as the Sine case).
    So compiler will remove the mod operator on input value as well!

    Now let's see a little bit more complex case:

    Code Snippet
    1. [Selector(Name = "Function", Category = "Value Inverse")]
    2. public static Dictionary<string, Func<double, double>> TrigonometryInv
    3. {
    4.     get
    5.     {
    6.         var result = new Dictionary<string, Func<double, double>>();
    7.         result.Add("Asin", (x) => Math.Asin(x));
    8.         result.Add("Acos", (x) => Math.Acos(x));
    9.         result.Add("Atan", (x) => Math.Atan(x));
    10.         return result;
    11.     }
    12. }

    Here we basically build a node that Has one input, one output , and and enum to select function.

    Equivalent code:

    Code Snippet
    1. [PluginInfo(Name = "Function", Category = "Value", Version = "Simple")]
    2. public unsafe class DummyFunc : IPluginEvaluate
    3. {
    4.     public enum eFunc { Sin,Cos, Tan}
    5.     public delegate double FuncDelegate (double input);
    6.  
    7.     [Input("Input")]
    8.     private ValueInput input;
    9.     [Input("Function", IsSingle = true)]
    10.     private ISpread<eFunc> function;
    11.     [Output("Output")]
    12.     private ValueOutput output;
    13.     private Dictionary<eFunc, FuncDelegate> functable = new Dictionary<eFunc, FuncDelegate>();
    14.  
    15.     public DummyFunc()
    16.     {
    17.         functable.Add(eFunc.Sin, (x) => Math.Sin(x));
    18.         functable.Add(eFunc.Cos, (x) => Math.Cos(x));
    19.         functable.Add(eFunc.Tan, (x) => Math.Tan(x));
    20.     }
    21.  
    22.     public void Evaluate(int SpreadMax)
    23.     {
    24.         FuncDelegate f = functable[function[0]];
    25.         output.Length = SpreadMax;
    26.         double* iptr = input.Data;
    27.         double* optr = output.Data;
    28.         for (int i = 0; i < SpreadMax; i++)
    29.         {
    30.             optr[i] = f(iptr[i]);
    31.         }
    32.     }
    33. }

    Please note that we could create a loop delegate, like this:

    Code Snippet
    1. [PluginInfo(Name = "Function", Category = "Value", Version = "Loopable")]
    2. public unsafe class DummyFunc2 : IPluginEvaluate
    3. {
    4.     public enum eFunc { Sin, Cos, Tan }
    5.     public delegate void FuncLoopDelegate(int cnt, double* input, double* output);
    6.     [Input("Input")]
    7.     private ValueInput input;
    8.  
    9.     [Input("Function", IsSingle = true)]
    10.     private ISpread<eFunc> function;
    11.  
    12.     [Output("Output")]
    13.     private ValueOutput output;
    14.  
    15.     private Dictionary<eFunc, FuncLoopDelegate> functable = new Dictionary<eFunc, FuncLoopDelegate>();
    16.     public DummyFunc2()
    17.     {
    18.         functable.Add(eFunc.Sin, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Sin(input[i]); } });
    19.         functable.Add(eFunc.Cos, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Cos(input[i]); } });
    20.         functable.Add(eFunc.Tan, (cnt, input, output) => { for (int i = 0; i < cnt; i++) { output[i] = Math.Tan(input[i]); } });
    21.     }
    22.  
    23.     public void Evaluate(int SpreadMax)
    24.     {
    25.         FuncLoopDelegate f = functable[function[0]];
    26.         output.Length = SpreadMax;
    27.         double* iptr = input.Data;
    28.         double* optr = output.Data;
    29.         f(SpreadMax,iptr,optr);
    30.     }
    31. }

    This doesn't give a very good gain tho, so first option is easier to build.


    Now let's look at code path.

    We can eventually decide of our loop at runtime, let's take our Vector2 join again:

    We have 5 cases we want to consider:
    • Length (x) == Length(y) : We can ignore mod on each vector.
    • Length (x) == 1 : We store x before the loop and sample y at full speed
    • Length (y) == 1 : Same as above, just the other way round.
    • Length (x) > Length(y) : Sample x at full speed, mod on y
    • Length (y) < Length(y) : Opposite
    Expressed as code:

    Code Snippet
    1. public delegate void VectorLoop(int cnt, double* d1, int i1, double* d2,int i2, Vector2D* output);
    2.  
    3. private Dictionary<eFuncSelector, VectorLoop> functable = new Dictionary<eFuncSelector, VectorLoop>();
    4.  
    5. public DummyVectorDynamic()
    6. {
    7.     functable.Add(eFuncSelector.XMax, (cnt, x, i1,y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i % i2]; } });
    8.     functable.Add(eFuncSelector.YMax, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i%i1].y = y[i]; } });
    9.     functable.Add(eFuncSelector.FullSpeed, (cnt, x, i1, y, i2, o) => { for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = y[i]; } });
    10.     functable.Add(eFuncSelector.SingleX, (cnt, x, i1, y, i2, o) => { double xval = x[0]; for (int i = 0; i < cnt; i++) { o[i].x = xval; o[i].y = y[i]; } });
    11.     functable.Add(eFuncSelector.SingleY, (cnt, x, i1, y, i2, o) => { double yval = y[0]; for (int i = 0; i < cnt; i++) { o[i].x = x[i]; o[i].y = yval; } });
    12. }

    Now in our evaluate method:

    Code Snippet
    1. public void Evaluate(int SpreadMax)
    2. {
    3.     dout.Length = SpreadMax * 2;
    4.     Vector2D* d = (Vector2D*)dout.Data;
    5.  
    6.     double* pt1 = d1.Data;
    7.     double* pt2 = d2.Data;
    8.     int l1 = d1.Length;
    9.     int l2 = d2.Length;
    10.  
    11.     if (l1 == l2) {
    12.         functable[eFuncSelector.FullSpeed](SpreadMax, pt1, l1, pt2, l2,d);
    13.     } else if (l1 == 1) {
    14.         functable[eFuncSelector.SingleX](SpreadMax, pt1, l1, pt2, l2, d);
    15.     } else if (l2 == 1) {
    16.         functable[eFuncSelector.SingleY](SpreadMax, pt1, l1, pt2, l2, d);
    17.     } else if (l1 > l2) {
    18.         functable[eFuncSelector.XMax](SpreadMax, pt1, l1, pt2, l2, d);
    19.     } else {
    20.         functable[eFuncSelector.YMax](SpreadMax, pt1, l1, pt2, l2, d);
    21.     }
    22. }

    Now here are a few benchmarks (considering different cases):

    X = 50000, Y = 20
    Native: 0.35 ms
    Zip : 0.28ms
    Mutable : 0.2 ms

    This is one of the 2 worst cases, (when X != Y), but we saved a mod on the biggest spread, which technically reflects.

    X = 25000, Y = 25000
    In that case the node will use : FullSpeed
    Native: 0.17ms
    Zip : 0.12ms
    Mutable : 0.06 ms

    In that case we simply ignore all the mod operators, and Mutable clearly wins.

    X = 50000, Y = 1
    Native : 0.36 ms
    Zip : 0.25ms
    Mutable : 0.16ms

    We can also see that pushing a local and setting other at full speed gives a decent gain. So setting path at runtime is significant.

    So technically, that gives us this information:

    let function (p1,p2,p3,p4) = dosomething

    Now what we have, to generalize our vector, is the following (for each parameter)
    if length == 1 -> set local before loop and use that local
    if length == SpreadMax -> read without mod
    if length < SpreadMax -> read modded

    Of course we can clearly see that we have a combinatorial explosion, a function with 8 parameters has a LOT of different cases.

    Few options here:
    generate the worst case scenario (eg: mod everything).

    When you enter Evaluate:
    build a mini hash (length == 1) -> 0, (length == SpreadMax) -> 1, (length < SpreadMax) -> 2

    Ask the runtime if we already have the function, if not compile it and give it back.

    Run the function.

    Yes, clearly as you've seen, we have nodes that self compile ;)

    So this is of course to use with caution, if your node input length vary a lot, you might have a lot of recompile.
    If you have reasonably steady spread counts, you'll have a little overhead on startup. Please note that something like:

    let function (a,b) => a+b
    a = 2000, b = 1
    next frame , a = 1000, b= 1

    will not cause a recompile, since a is is still at SpreadMax.

    On Some specific notes, node compiler is about 0.5 millisecond, which is pretty fast, but if you have 1000 nodes rebuilding on the fly, that's not too good ;)

    What is good is that our node is now a basic container, so we save some il code. Second very cool thing, our logic becomes a delegate, which is much easier to wrap into a Task, but more on that later.

    Monday 18 November 2013

    Into Code to IL to Node

    I changed the title a bit, but yes I'll kinda speak again of code generation somehow.

    I remember some time ago devvvvs shown some prototype to have simpler ways to makes nodes. Basically you create a function and it generates boilerplate code around.
    Sadly it never really made it through (maybe they keep it for their new version).

    Basically you build a function and you generate the node to fit it.

    I started to work on this, I called the project MicroNodes ;)

    Here is a simple example

    Code Snippet
    1. [SlicewiseMethod(Name = "+", Category = "Value")]
    2. public static void Add([Input("Input 1")] double d1, [Input("Input 2")] double d2, [Output("Output")] out double dout)
    3. {
    4.     dout = d1 + d2;
    5. }

    Ok That's the infamous Add (without group), but you see the idea.

    You basically have your operator, and then I build a plugin in intermediate language that prepares data for each slice, then calls the function.

    So a dummy implementation looks like this:

    Code Snippet
    1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Simple")]
    2. public class DummyAdd3 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ISpread<double> d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ISpread<double> d2;
    9.  
    10.     [Output("Output")]
    11.     private ISpread<double> dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.SliceCount = SpreadMax;
    16.         double dou;
    17.         for (int i = 0; i < SpreadMax; i++)
    18.         {
    19.             SliceWise.Add(d1[i], d2[i],out dou);
    20.             dout[i] = dou;
    21.         }
    22.     }
    23. }

    You can notice I use output parameter in that case, that will fit some other parts. Here is the same implementation using operator:

    Code Snippet
    1. [PluginInfo(Name = "Add", Category="Value",Version="Operator Simple")]
    2. public class DummyAdd : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ISpread<double> d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ISpread<double> d2;
    9.  
    10.     [Output("Output")]
    11.     private ISpread<double> dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.SliceCount = SpreadMax;
    16.         for (int i = 0; i < SpreadMax; i++)
    17.         {
    18.             dout[i] = d1[i] + d2[i];
    19.         }
    20.     }
    21. }

    You can notice that it's very easy to generate intermediate language for Slicewise methods.

    First interesting bit is that both of those are running at same speed. Reason being simple, the JIT, when it converts the Intermediate language into machine code will inline the function call, so assembly wise it's pretty similar.

    That's good news, since it means I can easily use functions and call has no overhead.

    Now let's see how we can optimize this.
    First, any slicewise operation, output result count = spreadmax.
    Doing dout[i] does the following in Intermediate Language:
    virtcall ISpread<T>._setcyclic()

    Setting a slice like this does (for every single write)

    • Mark the Spread as dirty (eg: it needs to flush data back into vvvv) 
    • Go access the internal DataStore (an array basically)
    • Do a Mod to store, output[i%icount] = data
    That's actually a lot.

    So now we can access the internal store ourselves, this is exposed on Spread Type.

    Here is a better version:

    Code Snippet
    1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Buffer")]
    2. public class DummyAdd4 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ISpread<double> d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ISpread<double> d2;
    9.  
    10.     [Output("Output")]
    11.     private ISpread<double> dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.SliceCount = SpreadMax;
    16.         double[] b = dout.Stream.Buffer;
    17.         for (int i = 0; i < SpreadMax; i++)
    18.         {
    19.             SliceWise.Add(d1[i], d2[i], out b[i]);
    20.         }
    21.         dout.Flush(true);
    22.     }
    23. }

    Now, since we access the buffer (and we know we never have to use Mod for output), we get access to the array, and can pass the index as output directly,this does this in IL:

      IL_0039:  ldloc.0 (This is our array local variable)
      IL_003a:  ldloc.1  (This is index)
      IL_003b:  ldelema    [mscorlib]System.Double

    So instead it loads the address of the array and pass it to the function.

    Pleace note that I call Flush(true), since now I write directly into the internal store, the dirty flag if never set.

    So it's easy to add this in the IL generator, and it Gives you a decent substantial boost.

    For 5000 elements:
    ISpread virtual : 0.1 ms
    ISpread buffer : 0.06 ms

    For 50000 elements
    ISpread virtual : 1 ms
    ISpread buffer : 0.6 ms

    It's pretty low values but it can easily add up.

    So now we can use pointer output, since we can write directly back into memory.
    This doesn't give a decent improvement, since a memcpy on 5000 elements is blazing fast, But at least we save some memory (40 kilobytes for a 5000 spread).

    It seems kinda minimal, but in a reasonably large patch, it also adds up quickly.

    So now the obvious next part is to modify our input. We also do a virtcall for access which also does a mod.

    Operator version:

    Code Snippet
    1. [PluginInfo(Name = "Add", Category = "Value", Version = "Operator Ptr Full")]
    2. public unsafe class DummyAdd8 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ValueInput d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ValueInput d2;
    9.  
    10.     [Output("Output")]
    11.     private ValueOutput dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.Length = SpreadMax;
    16.         double* d = dout.Data;
    17.  
    18.         double* pt1 = d1.Data;
    19.         double* pt2 = d2.Data;
    20.  
    21.         int l1 = d1.Length;
    22.         int l2 = d2.Length;
    23.  
    24.         for (int i = 0; i < SpreadMax; i++)
    25.         {
    26.             d[i] = pt1[i%l1] + pt2[i%l2];
    27.         }
    28.     }
    29. }

    And Static function version:

    Code Snippet
    1. [PluginInfo(Name = "Add", Category = "Value", Version = "Slicewise Ptr Full")]
    2. public unsafe class DummyAdd10 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ValueInput d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ValueInput d2;
    9.  
    10.     [Output("Output")]
    11.     private ValueOutput dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.Length = SpreadMax;
    16.         double* d = dout.Data;
    17.         double* pt1 = d1.Data;
    18.         double* pt2 = d2.Data;
    19.         int l1 = d1.Length;
    20.         int l2 = d2.Length;
    21.         for (int i = 0; i < SpreadMax; i++)
    22.         {
    23.             SliceWise.Add(pt1[i % l1], pt2[i % l2], out d[i]);
    24.         }
    25.     }
    26. }

    As before Operator vs Static function gets pretty much 0 difference, which is still good news, since it means that we can still easily generate that IL code.

    It yields a difference compared to ISpread still, here is the table:

    For 5000 elements:
    ISpread virtual : 0.1 ms
    ISpread buffer : 0.06 ms
    Ful Pointer : 0.03 ms

    For 50000 elements
    ISpread virtual : 1 ms
    ISpread buffer : 0.6 ms
    Ful Pointer : 0.3 ms

    Please note that we also save memory for each pin, so let's say we have 5000 elements in first spread and 500 in second, we gain : (5000*8*2) + (500*8) = 84 kilobytes

    That's quite pretty decent specially also considering that we have a node that is much faster.

    So technically speaking, one very fun thing is that by allowing the user to build this first function as above:

    Code Snippet
    1. public static void Add(double d1, double d2, out double dout)
    2. {
    3.     dout = d1 + d2;
    4. }

    By choosing a route in our IL generator we can make a node 3 times faster compared to the dummy implementation that comes from template.

    Also please note that it's much easier to reuse that function or post optimize.
    If you have 50 nodes like this, finding a new speedup is a decent amount of refactoring, here, none is needed.

    Now I'll just give a second example, since now devvvs have presented their new 5x to 6x faster Vector Join, split, I might put them to the test :)

    First thing is indeed, compared to last version they are much faster, no brainer, which is great!

    Please note that there's also Zip (Value), which is normally pretty optimized.

    So Let's see, Vector (2d Join)

    5000 Elements
    Native : 0.035 ms
    Zip : 0.035 ms

    50000 Elements
    Native : 0.37 ms
    Zip : 0.32 ms

    So now I decided to test 3 techniques, so are more code generation friendly, some are more raw plugin, here we go.

    First a Vector Join function is simply like this:

    Code Snippet
    1. [SlicewiseMethod(Name = "Vector", Category = "2d Join")]
    2. public static void V2dJoin(
    3.     [Input("X")] double x,
    4.     [Input("Y")] double y,
    5.     [Output("Output")] out Vector2D result)
    6. {
    7.     result = new Vector2D(x, y);
    8. }

    Really simple no?

    Now here are the testers.

    Method one:
    Code Snippet
    1.   [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr")]
    2.   public unsafe class DummyVector: IPluginEvaluate
    3.   {
    4.       [Input("Input 1")]
    5.       private ValueInput d1;
    6.  
    7.       [Input("Input 2")]
    8.       private ValueInput d2;
    9.  
    10.       [Output("Output")]
    11.       private ValueOutput dout;
    12.  
    13.       public void Evaluate(int SpreadMax)
    14.       {
    15.           dout.Length = SpreadMax*2;
    16.           double* d = dout.Data;
    17.  
    18.           double* pt1 = d1.Data;
    19.           double* pt2 = d2.Data;
    20.  
    21.           int l1 = d1.Length;
    22.           int l2 = d2.Length;
    23.  
    24.           for (int i = 0; i < SpreadMax; i++)
    25.           {
    26.               *d = pt1[i % l1];
    27.               d++;
    28.                 *d=pt2[i % l2];
    29.               d++;
    30.           }
    31.       }
    32.   }

    Pretty simple, we push per component.

    Method 2:

    Code Snippet
    1. [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Struct")]
    2. public unsafe class DummyVector2 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ValueInput d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ValueInput d2;
    9.  
    10.     [Output("Output")]
    11.     private ValueOutput dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.Length = SpreadMax * 2;
    16.         Vector2D* d = (Vector2D*)dout.Data;
    17.  
    18.         double* pt1 = d1.Data;
    19.         double* pt2 = d2.Data;
    20.  
    21.         int l1 = d1.Length;
    22.         int l2 = d2.Length;
    23.  
    24.         for (int i = 0; i < SpreadMax; i++)
    25.         {
    26.             d->x = pt1[i % l1];
    27.             d->y = pt2[i % l2];
    28.             d++;
    29.         }
    30.     }
    31. }

    This is almost the same, but we build as struct.

    Now actually we can also simply use our slicewise:

    Code Snippet
    1. [PluginInfo(Name = "Vector", Category = "2d Join", Version = "Ptr Out")]
    2. public unsafe class DummyVector3 : IPluginEvaluate
    3. {
    4.     [Input("Input 1")]
    5.     private ValueInput d1;
    6.  
    7.     [Input("Input 2")]
    8.     private ValueInput d2;
    9.  
    10.     [Output("Output")]
    11.     private ValueOutput dout;
    12.  
    13.     public void Evaluate(int SpreadMax)
    14.     {
    15.         dout.Length = SpreadMax * 2;
    16.         Vector2D* d = (Vector2D*)dout.Data;
    17.         double* pt1 = d1.Data;
    18.         double* pt2 = d2.Data;
    19.         int l1 = d1.Length;
    20.         int l2 = d2.Length;
    21.         for (int i = 0; i < SpreadMax; i++)
    22.         {
    23.             Joiner.Join(pt1[i % l1], pt2[i % l2], out d[i]);
    24.         }
    25.     }
    26. }

    So now the results:

    5000 Elements
    Native : 0.035 ms
    Zip : 0.035 ms
    Per Component: 0.035 ms
    As Struct: 0.031 ms
    Function Output:  0.038 ms

    50000 Elements
    Native : 0.37 ms
    Zip : 0.32 ms
    Per Component: 0.35 ms
    As Struct: 0.33 ms
    Function Output:  0.34 ms

    Please note that values fluctuate a little bit.

    But basically that means few things:

    • Devvvvs did a good job at optimizing vectors
    • On very low spread (50), zip seems slower, native is of course fastest (since you have COM object call). It is not as obvious tho.
    • A generated version from function is as Fast as optimized native ;)

    So to resume, IL code generation is pretty cool.

    Now you have a lot of other use cases than SliceWise.
    What I decided was to build different templates, so for example:

    Code Snippet
    1. [ReductorMethod(Name = "+", Category = "Value Spectral")]
    2. public static void SpectralAdd([Input("Input")] IEnumerable<double> data,[Output("Result")] out double result)
    3. {
    4.     result = 0.0;
    5.     foreach (double d in data) { result += d; }
    6. }

    You can also do Struct Join/Split like this:

    Code Snippet
    1. [Compactor(Name="Struct",Category="Test")]
    2. public struct MyStruct2
    3. {
    4.     [CompactorProperty(Name = "Hello")]
    5.     public double Hello;
    6.  
    7.     [CompactorProperty(Name = "Element2")]
    8.     public double Element2;
    9.  
    10.     [CompactorProperty(Name="Vector",Flatten=true)]
    11.     public Vector3 Vector;
    12. }

    Or have a visitor class in case you want to use an existing struct.

    There is some other types/parts, but that will be for another post.