About fast merge algorithms
Making D3 readable
Vectorization in .NET 7
Haskell was right
About

Home

About fast merge algorithms

There's surprisingly little information on the internet about how to merge two sorted arrays. Yes, there are probably thousands of research papers about sorting, including merge sort, but what if you want to just merge two arrays? Sorting is often in place and assumes that the arrays similar in size, which wasn't suitable for my specific use case. I did however find this C++ approach. It's an impressive solution that uses AVX2 and lookup tables to reduce execution time down to about 40%. Let's start at the same baseline see if we can get lower than that.

The following samples are in C#, not C++, and therefore not comparable. A direct comparison would be interesting but this blog post is quite long already.

A Baseline

Let's take a look at the trivial algorithm first. We start at zero for all indices, compare two elements each iteration and write the smaller one to the result array. Increment either idxA or idxB, always increment the result index. Repeat until one input has no more elements left. Copy the raminder of the other array.

static void MergeScalar<T, TComparer>(

Span<T> a,

Span<T> b,

Span<T> result,

TComparer comparer)

where TComparer : IComparer<T>

{

if (a.Length + b.Length > result.Length)

throw new ArgumentException("...");


var idxA = 0;

var idxB = 0;

var idxR = 0;


while (idxA < a.Length && idxB < b.Length)

{

var c = comparer.Compare(a[idxA], b[idxB]);


if (c <= 0)

{

result[idxR] = a[idxA];

idxA++;

}

else

{

result[idxR] = a[idxB];

idxB++;

}

idxR++;


if (c <= 0)

{

a.Slice(idxA).CopyTo(result.Slice(idxR));

}

else

{

b.Slice(idxB).CopyTo(result.Slice(idxR));

}

}

Not too difficult. But is it even worth to write a merge algorithm in 2024 when you could just concatenate two arrays and sort them? Quicksort and all its variations are probably the most researched and optimized topic in computer science.

Random integers

Method

n

Mean

Error

StdDev

Ratio

Sorted

100000

6938.9 us

76.12 us

71.21 us

1.00

Scalar

100000

992.1 us

12.22 us

11.43 us

0.14

Neat! While the results aren't too surprising, it's nice to see a simple but specialized algorithm beat the optimized but lazy method. Can we do better?

Branchless SIMD Merge

I'm quite proud of this one. I don't claim to be the first who found it but I came up with it completely on my own. It's an extension of the scalar version, which exploits the possibility that the larger array contains more elements clumped together. The code from this point on is specific to integers but should be simple to extend to any data type that supports vectorization.

The base structure is similar to the scalar one, except that we operate on raw pointers, which is required by Vector256.Load and should also eliminate a couple of bounds checks.

static void MergeVectorized(Span<int> a, Span<int> b, Span<int> result)

{

if (a.Length + b.Length > result.Length)

throw new ArgumentException("...");


if (a.Length < b.Length)

{

MergeVectorized(b, a, result);

return;

}

var idxA = 0;

var idxB = 0;

var idxR = 0;

var countVec = Vector256<int>.Count;


unsafe

{

fixed (int* ptrA = a)

fixed (int* ptrB = b)

fixed (int* ptrR = result)

{

while ((idxA + countVec) < a.Length && idxB < b.Length)

{

// ... see next section ...

}

}

}


MergeScalar(

a.Slice(idxA),

b.Slice(idxB),

result.Slice(idxR),

...);

}

Now to the interesting part. We load eight elements from the first argument into a SIMD register and broadcast a single element from the second argument into a second register. As an example, vecA might contain [1, 2, 3, 5, 6, 7, 8, 9] and vecB [4, 4, 4, 4, 4, 4, 4, 4] (4 is missing in vecA).

Compare both lanes, [-1, -1, -1, 0, 0, 0, 0, 0], and select the elements based on that comparison: [1, 2, 3, 4, 4, 4, 4, 4]. This part is unconditionally written to result.

The comparison result is then used again to figure out how much work we actually did. vecCmp.ExtractMostSignificantBits() would result in 11100000 here and PopCount(11100000) in 3. incrementB is always one, except when all elements in vecA were less than vecB.

var vecA = Vector256.Load(ptrA + idxA);

var vecB = Vector256.Create(*(ptrB + idxB));


var vecCmp = Vector256.LessThan(vecA, vecB);

vecA = Vector256.ConditionalSelect(vecCmp, vecA, vecB);

vecA.Store(ptrR + idxR);


var cmp = vecCmp.ExtractMostSignificantBits();

var storeB = BitOperations.PopCount(cmp);

var incrementB = 1 - (int)(cmp << 7);


idxA += storeB;

idxB += incrementB;

idxR += storeB + incrementB;

Random integers, .NET 7

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

2588.8 us

4.40 us

3.90 us

1.00

Vector

262144

1050.5 us

4.40 us

3.90 us

0.41

Same ratio as the other blog post with a lot less code. They're different compilers, even programming languages but surely the ratio between naive and complex is similar... And doesn't change when we do something small, like an upgrade from .NET 7 to .NET 8. Right?

Random integers, .NET 8

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

2133.8 us

1.43 us

1.27 us

1.00

Vector

262144

1056.0 us

0.21 us

0.18 us

0.49

The .Net devlopers did their homework. These are the measurments from my desktop machine. My laptop's results are 15% worse because the scalar code allows it to boost higher. Sigh. And the constructed worst case went from tolerable to poor.

Last element of A is less than first element of B, .NET 7

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

714.5 us

2.11 us

1.88 us

1.00

Vector

262144

148.4 us

0.15 us

0.13 us

0.21

Vector Reversed

262144

1046.7 us

2.52 us

2.36 us

1.47

Last element of A is less than first element of B, .NET 8

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

353.7 us

0.07 us

0.07 us

1.00

Vector

262144

202.2 us

0.04 us

0.04 us

0.57

Vector Reversed

262144

1100.8 us

0.17 us

0.15 us

3.11

Bitonic Merge

Back to the drawing board. I knew that Bitonic sorters were an option but it took me some time to wrap my head around it. At least we can skip half of the algorithm because we're only interested in the merge part.

My first implementation was 4 times slower than the scalar variant. Bitonic merge takes an ascending and a descending sequence and computes the mininmum/maximum between both. This results in two new sequences where the all elements of the min sequence are less than all elements of the other. This principle is then applied recursively to half of each sequence, until there's nothing left to compare. The algorithm only works with input lengths that are a power of two but many of the comparisons can be done in parallel.

At one point it just clicked and everything fell into place but until that moment it wasn't intuitive at all. I would recommend to just write the algorithm yourself, if you're interested, with one unit test for every shuffle. And if you ever meet someone who doesn't believe in unit tests, make them implement this thing.

var vecA = Load(ptrA + idxA);

var vecB = Load(ptrB + idxB);

var copyVecA = vecA;

vecB = Shuffle(vecB, Create(7, 6, 5, 4, 3, 2, 1, 0));


vecA = Min(vecA, vecB);

var incrementA = BitOperations.PopCount(

Equals<int>(vecA, copyVecA).ExtractMostSignificantBits());

vecA = CompareAndSwap128(vecA);

vecA = Shuffle(vecA, Create(0, 1, 4, 5, 2, 3, 6, 7));

vecA = CompareAndSwap128(vecA);

vecA = Shuffle(vecA, Create(0, 2, 4, 6, 1, 3, 5, 7));

vecA = CompareAndSwap128(vecA);

vecA = Shuffle(vecA, Create(0, 4, 2, 6, 1, 5, 3, 7));


vecA.Store(ptrR + idxR);


idxA += incrementA;

idxB += countVec - incrementA;

idxR += countVec;

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static Vector256<int> CompareAndSwap128(Vector256<int> vec)

{

var lower = Vector128.Min(vec.GetLower(), vec.GetUpper());

var upper = Vector128.Max(vec.GetLower(), vec.GetUpper());


return Create(lower, upper);

}

Except for the shuffles, this only uses Vector256 at the very start. I hesitated to improve this further. Using two full lanes requires more shuffles and ConditionalSelect, so doubling the amount of data without AVX-512 support may not be worth it. Luckily, you don't have to build a correct solution to benchmark it, you just need the right instruction count. The results were promising, so I implemented it. It's ugly and long but doesn't introduce any new concepts.

vecA = Load(ptrA + idxA);

vecB = Load(ptrA + countVec + idxA);

vecC = Load(ptrB + countVec + idxB);

vecD = Load(ptrB + idxB);

copyVecA = vecA;

copyVecB = vecB;

vecC = Shuffle(vecC, Create(7, 6, 5, 4, 3, 2, 1, 0));

vecD = Shuffle(vecD, Create(7, 6, 5, 4, 3, 2, 1, 0));


vecA = Min(vecA, vecC);

vecB = Min(vecB, vecD);

var incrementA = // same as above with both vecA/vecB and copyVecA/copyVecB


vecC = Min(vecA, vecB);

vecD = Max(vecA, vecB);


vecA = ConditionalSelect(Create(-1, -1, -1, -1, 0, 0, 0, 0), vecC, vecD);

vecB = ConditionalSelect(Create(-1, -1, -1, -1, 0, 0, 0, 0), vecD, vecC);

vecB = Shuffle(vecB, Create(4, 5, 6, 7, 0, 1, 2, 3));


vecC = Min(vecA, vecB);

vecD = Max(vecA, vecB);

// ... and so on...

Final Benchmarks

Desktop Ryzen 7 5700, .NET 8

Random integers

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

2133.8 us

1.43 us

1.27 us

1.00

Vector

262144

1056.0 us

0.21 us

0.18 us

0.49

Bitonic 128

262144

412.3 us

0.35 us

0.29 us

0.19

Bitonic 256

262144

240.7 us

0.05 us

0.04 us

0.11

Same input

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

1148.3 us

0.58 us

0.51 us

1.00

Vector

262144

1046.6 us

0.30 us

0.26 us

0.91

Bitonic 128

262144

403.4 us

0.14 us

0.16 us

0.35

Bitonic 256

262144

239.0 us

0.18 us

0.15 us

0.21

Second Argument is tiny (8 elements)

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

396.28 us

0.56 us

0.50 us

1.00

Vector

262144

144.36 us

0.05 us

0.04 us

0.36

Bitonic 128

262144

148.09 us

0.04 us

0.04 us

0.37

Bitonic 256

262144

n.a.

n.a.

n.a.

n.a.

Binary Search

262144

19.03 us

0.04 us

0.03 us

0.05

Alternating stair steps

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

1217.7 us

0.19 us

0.18 us

1.00

Vector

262144

1015.7 us

0.32 us

0.28 us

0.83

Bitonic 128

262144

393.3 us

0.08 us

0.07 us

0.32

Bitonic 256

262144

234.0 us

0.04 us

0.03 us

0.19

Concatenated (last element of A is less than first of B)

Method

n

Mean

Error

StdDev

Ratio

Scalar

262144

353.7 us

0.07 us

0.07 us

1.00

Vector

262144

202.2 us

0.04 us

0.03 us

0.57

Vector Reversed

262144

1100.8 us

0.17 us

0.15 us

3.11

Bitonic 128

262144

267.9 us

0.13 us

0.13 us

0.76

Bitonic 256

262144

191.0 us

0.09 us

0.07 us

0.54

Conclusion

Remember just concatenating and sorting the array at the beginning of this post? The final solution is not only a hundred times more complex, it's also that much faster, which I'd call a win. Although there are probably still a few unoptimized spots left, I'm happy with the results. If someone wants to take it further, there's always AVX-512 support. Happy coding!

Making D3 readable

Over the last few years I worked with multiple projects that used Data Driven Documents, also known as D3.js or just D3. While I think it's one of the best charting libraries out there, especially if you don't know all of your requirements yet, the resulting code can be furstratingly difficult to maintain. It's very much write once and pray that you don't ever have to update it. This shouldn't be the case and recently I really tried to get it right. These are my biggest takeaways.

Refactor early and often

Your first draft shouldn't be your last. Tell your higher ups that this is a proof of concept which needs revision after it's done, especially if it is your first time with D3. If you can't do that, use a different library.

Use layers

You will run into rendering issues when SVG elemens overlap. Connecting points should be drawn over lines, lines over bars and bars over the background. Stuff like that. You can either make sure to always get the order of operations right or you can use a layer system, similar to what you find in image manipulation programs.

function appendLayer(targetSvg, layerName) {

return targetSvg.append('g').attr('id', layerName);

}


function selectLayer(sourceSvg, layerName) {

return sourceSvg.select(`#${layerName}`);

}

function createSvgTemplate(...) {

// ... initialization ...


appendLayer(svg, 'background');

appendLayer(svg, 'plot_0');

appendLayer(svg, 'plot_1');

appendLayer(svg, 'plot_2');

appendLayer(svg, 'headline');


// ... returns ...

}

Coordinates and transforms

Like the DOM, SVG coordinates start at the top left and grow towards the bottom right. It may become an issue if you end up doing a lot of 2D math, where you probably want (0/0) to be at the center or bottom left. It's relatively simple to adjust via viewBox but difficult to change later.

It's tempting to append the axes of your graph to the left and bottom edges and then scale that layer to a smaller size to make room for labels, headlines and everything else. Which will lead to subtle issues much later in the project. Circles appear oval once the stretch is too extreme and now you have to handle a second coordinate system on a per layer basis: If your total SVG area is (0, 0, 100, 100) and you transform one layer by translate(10, 10) scale(0.5, 0.7), your layer's internal coordinates are still (0, 0, 100, 100) but they start at (10, 10) and end at (60, 80) in relation to the viewBox.

Turn D3 into D1

This one took me way too long to realize. Every modern web framework tries to solve two things: state management and data binding. D3 can do both, it's called Data Driven Documents after all, but stop for a second and ask yourself: should you use these features? Your entire team is accustomed to a certain style and now you introduce a new way of doing things. One that's probably worse than what you're already using. The only thing you really care about in the end is an SVG Document. So do yourself a favor and ignore the Data Driven part.

If you don't, one issue you will almost definitely run into is incomplete data, probably encoded as undefined. If such a value infects your data, it gets turned into NaN and your SVG will break. It's just that sometimes your really want to signal the absence of something, like in line graphs. You can't just replace undefined with 0, a line with incomplete data should be broken up into multiple pieces and D3's fluent interface will not help you. Now you're mixing application logic with D3 logic.

Just draw the stuff you want to draw in a loop, piece by piece, and assign a unique id to each element. If you need to update stuff, query it by id. Delete it? Query by id. Yes, this does mean that you may have to diff your current state with the previous one but it's a small price to pay.

Focus on primitives

For a long time I used to separate my functions into steps that made sense from a business perspective, like initLineGraph, initBarGraph, updateBarGraph etc. but this falls apart with reasonably complicated requirements. Does a new element belong to the initialization function or is it an update? What about deletes? How do I even detect elements ready to be deleted? A better way to do things, even if it sounds worse at first, is to have one function per primitive that does everything:

function isPoison(nbr) {

return nbr == null || isNaN(nbr) || !isFinite(nbr);

}

function updatePoint(target, index, coordinates, metadata) {

const id = `${metadata.id}_${index}`;

let selection = selectLayer(target, metadata.layerName)

.select(`circle#${id}`);

const x = coordinates.x(index);

const y = coordinates.y(index);


const poison = isPoison(x) || isPoison(y);

if (poison && selection.empty()) return;

if (poison) {

selection.remove();

return;

}


if (selection.empty()) {

selection = selectLayer(target, metadata.layerName)

.append('circle')

.attr('id', id)

.attr('r', 3);

// more customization

}


selection

.transition()

.duration(500)

.attr('cx', x)

.attr('cy', y);

}

This is reasonably short, robust, reusable and completely independent of your business requirements. Note that I'm using an index and x/y accessor functions instead of actual data. This not only decouples the interface from the implementation, it also makes your primitive functions more similar: updateLine would use x(index - 1) in addition to x(index) and while an updateRectangle function needs two more accessors to actually make a rectangle, they can just be an extension of your Coordinate interface.

Sidenote, if you're having trouble to make a correct rect in combination with D3's scaling functions, use a polygon instead.

Hover effects and tooltips

Tooltips have no place in the actual SVG. They shouldn't be limited to the bounds of the graph, you shouldn't have to rewrite your coporation's style guide in SVG and you don't want to run into export issues where a tooltip happens to randomly show up in one graph. So don't do it. Use plain HTML.

SVG has limited support for pointer events but you don't want them anyway once your project becomes sufficiently complex. Imagine someone with a shaky hand trying to hover over a point which is 3 pixels wide. Not a great user experience. You want something that selects the point that's closest to the user's mouse. Larger, invisible circles with bubbling pointer events don't help either because they'll overlap and you still have to somehow figure out which is closest. Bite the bullet, attach the mouseover and mouseleave events to the SVG's container elements, translate the local mouse coordinates into SVG coordinates and query all points via Signed Distance Functions (SDFs). For most applications the simplest one, a circle, length(point, center) - radius, should be enough. Inigo Quilez has you covered if you need more.

Future improvements

So far I didn't have the time or opportunity to implement them but I think projection matrices would be a great addition. As an example, consider the data point (1, 1_000_000). Your SVG is probably not 1 million units high and D3 has different scaling functions to make sure the point is displayed adequately. The scaled plot is then scaled again because you don't want to just display your data points. Headlines, axes and the legend all need space, too.

Assuming the scale range is linear, this could be done much more elegantly with two matrix multiplications. One to scale your data into (-1, -1), (1, 1) space and one to scale it to parts of the SVG container, e.g. to (350, 590), (50, 50). When the mouse pointer hovers over a chart element, a ray cast would be enough to trace all the way from the affected SVG container in the dom back to the source data that was used to draw a specific element. It could handle click, enter, move and exit events all in one go. You could also discard (almost) everything that's not in the range of (-1, 1) before you do your update and now you have the math to make it work with canvas and WebGPU, too.

I'm not sure about what's left of D3 after this. Maybe it's time for a new library...

Vectorization has come a long way in .NET

Disclaimer: I wrote this blog post originally for my employer's website, Tekaris. The website does no longer exist to my knowledge.


Vectorization, or Single Instruction Multiple Data (SIMD), is the art of performing the same operation simultaneously on a small block of different pieces of data. While many compilers do support some kind of auto-vectorization, it's often fragile and even small changes can break everything. But SIMD code written by hand is difficult to read and has to be implemented multiple times for multiple platforms, with fallbacks if the customer's CPU doesn't support it.


Over the Christmas holidays I had the chance to revisit and port a personal project from .NET 6 to .NET 7 and gave vectorization a second look. I don't think the above statement is 100% true anymore. In some way, it turned out to be the opposite. Let me give you an example, a Color struct with two methods: Equals as an example for an operator and AlphaBlend, which is a simple textbook implementation of layering two pixels on top of each other.

[StructLayout(LayoutKind.Sequential)]

public readonly struct Color

{

public float A { get; }

public float R { get; }

public float G { get; }

public float B { get; }


public Color AlphaBlend(Color top) => new(

top.A + A * (1 - top.A),

top.R + R * (1 - top.A),

top.G + G * (1 - top.A),

top.B + B * (1 - top.A));


public bool Equals(Color other) =>

A == other.A

&& R == other.R

&& G == other.G

&& B == other.B;

}

While both methods are four lines long, the actual logic is only one line, repeated four times. As the name suggests, Single Instruction Multiple Data is meant for these kinds of tasks.

C#'s Vector in the past

C# provides four different types that help us with vectorization: Vector64<T>, Vector128<T>, Vector256<T> and Vector<T>. Vector<T> has an unspecified size and will use one of the other three types internally. The others are fixed size, as their name suggests, but can be split in different ways. One Vector128 can hold two 64 bit long, four 32 bit int or eight 16 bit short. Floating point numbers are supported as well but custom structs will fail at runtime.

I did implement a vectorized version of AlphaBlend prior to C# 11 for performance and curiosity. It wasn't pretty. Color is a struct of four 32 bit floats, which is 128 bit in total, the same size as Vector128<float>. To perform operations on a Vector128, you had to use the functions defined in the Sse/Avx static classes. These will fail at runtime if the host's CPU doesn't support them and it's up to you to cover the possibility.

public Vector128<float> Vec128 =>

Unsafe.As<ColorF32, Vector128<float>>(ref Unsafe.AsRef(in this));


public Color AlphaBlend(Color top)

{

if (Sse.IsSupported)

{

var vResult = Sse.Multiply(Vec128, Vector128.Create(1.0f - top.A));

vResult = Sse.Add(vResult, top.Vec128);

return new(vResult);

}

else

{

return new(

a: top.A + A * (1 - top.A),

r: top.R + R * (1 - top.A),

g: top.G + G * (1 - top.A),

b: top.B + B * (1 - top.A));

}

}

Are those unsafe casts safe? The unit tests pass but I'm not sure I would risk it in production. It doesn't support ARM CPUs and looks a lot scarier in languages that only provide cryptic names or have to fall back to metaprogramming to eliminate the if-else branch. At least C#'s JIT is smart enough to detect such simple patterns at runtime and rewrite the function without branches (depending on what the host's CPU supports).

C# 11's Vector

With C# 11 you don't have to write the code above anymore because the Vector classes have their own operators and utility functions now! It's a small change that has a massive impact on readability and maintainability. The vectorized version is now on par with the scalar version in terms of readability, and even a little shorter. Since it was so easy to make the entire class vectorized, I also changed the ARGB values to be of type Vector128<float> internally and provide getters instead, which yielded additional performance improvements. The Unsafe methods seem to not be zero cost like I thought initially. Generic Math (also new in C# 11, not shown here) was a breeze to implement.

public readonly struct Color

{

private readonly Vector128<float> argb;


public Vector128<float> Vector => argb;


public float A => argb.GetElement(0);

public float R => argb.GetElement(1);

public float G => argb.GetElement(2);

public float B => argb.GetElement(3);


public Color AlphaBlend(Color top) =>

new(Vector * Vector128.Create(1.0f - top.A) + top.Vector);


public bool Equals(Color other) => Vector == other.Vector;

}

Are there still functions that cannot be implemented this way? Would it be even faster to MemoryMarshal.Cast an array of colors to a Vector256<float>? Of course. But the next time you write a simple container of homogenous data, stop and think for a second if you could express it as a Vector internally. Maybe someone will thank you for your extra care someday, or at least learn something new while looking through your source code.

Haskell was right

It's somewhat ironic that many programmers think that the (IO) monad is overly complicated and solves problems that could only ever exist in a pure lazy functional programming language; Yet, at the same time, are convinced that the following code should be considered good practice:

public interface IProcessor

{

async Task<...> ProcessData(int id);

}


public class Processor : IProcessor

{

private readonly IDbConnector dbConnector;


public async Task<...> ProcessData(int id)

{

var data = await dbConnector.QueryAsync(id);

return TransformData(data);

}

}

... where IO is explicitly marked in three different places.

- async because you do not want your app to lag or your web server run out of threads. There are very few cases where async does not imply IO.

- IDbConnector because you might want to mock IO away for testing. Or use a different database later on... because that always works without breaking changes...

- IProcessor because Dependency Injection is the best thing since sliced bread and correctly mocking a database connector is hard.


Of course, all of this is less general than monads and you will be changing the IProcessor interface every time you add or remove a method fromProcessor. This code is not and never was about composability or API stability, it's a mess of bandaids for the underlying problems that come with IO. You just accepted it as "clean code" because everyone else does the same thing.

Academia has good reasons for handling things in a certain way and we should build on their foundations, not ignore them.

About

Hi! I'm Philipp, a software developer based in Munich, Germany. While my daily work is mostly full stack development, my interests reach far beyond that, including programming language research, low level optimizations and game engines. Outside of computer science there's also painting and photography, which are huge hobbies, too.