Algorithm and example for boolean per pair draw order parallel GPU?

Well, my merge sort did not work and I am not exactly an expert on this. What would be a good algorithm for like from initializing a draw order with all it will hold to per each pair possible there is a boolean of whether before or behind arbitrarily and sort by that? Trying to reduce memory as that will be much bigger than normal 3d for this. Also, a good example of whatever?

Like start with an array of shapes, I can arbitrarily say shape 0 is before or after shape 1 and any other unique pair could have a similar value. I have the failed algorithm if anyone wants to help. It is currently like on average, z and distance from center in 2d. This is for an any dimensional algorithm so I would like to keep my points data in place and maybe shapes data too. Draw order, move at will. Radix sort will not work because these are not numbers. I am thinking of Bitonic, Merge, or Sample sorts now but like I know. Note that draw order will have to be initialized to hold indices that will be all shapes. X E.

I think merge sort did not work for me, bitonic sort used half resources, sample sort has no examples, quicksort seems viable, any good open source parallel on GPU example with numbers or something like that? Note this is for open source, if you can’t find a license I can’t use it. If it works on numbers other than Radix sort it will probably work. X E.

No quicksort, it is much slower, I guess Bitonic it is, Bitonic Sort on CUDA. On a quick benchmark it was 10x faster than the CPU version. · GitHub,
This version is 2 times faster then the original one. After a study based on project for high performance computing sorting in cuda we made this version. João Silva, Tiago Ribeiro and Nuno Duarte {2120905,2120907,2120909}@my.ipleiria.pt · GitHub, and I think this is solved. Unless of course you people can do better. X E.

Unfortunately I tried this and it still did not work, anyone want to review? Bitonic sort as should work did not. I guess best way might just be to copy RinomXE on CPU, or not, that one was sequential but I know it works. Any advice? I know my previous merge sort was not enough sorting, Bitonic at least was consistent. Getting closer. Try again tomorrow. X E.

I am thinking of making my own binary insertion sort parallel GPU thing because I tried Bitonic sort and it didn’t sort correctly despite being a correct algorithm. Based on RinomXE CPU versions. Basically like sort initial 2, have min and max to start, binary whether middle rounded is before or after, move one of bounds, if they are one off, move same one but to equal other, then once they are equal, compare with that index, if before then bump up one, otherwise, keep it, then move all down that would be after it by one inserting in empty space it results in. Deadlock so only one can input and shift at once. All that would be comparing not in shifted or after already shifted can continue, shift also based on being in bounds that would shift. I know mostly how to do this.

So basically it would be atomic what current one would be to shift, and atomic which is currently next to do, each complete iteration a check whether that next to do is after the last index, return if it is. For shifting bounds, if currently going for shifting then shift if applicable, maybe a short log of the last ones done and a number per thread like which one was the last input. Like start with input to log, then increment a counter of things, have each thread while waiting have its own counter and count off log while waiting counting up as it goes, might have to shift log or something though. For efficiency I could go by the size of log counting up on a modulus like each next up is modulus of log size. I know, lots of stuff but this should work. Also as an error condition if the counter is too far off and past one full log then redo with that index and update the counter to be current.

One optimization I could do is immediately place invalid indices at the end with no comparisons. Any improvements or anything I should try? Radix sort is not an option. X E.

I was doing Bitonic wrong, we shall see if it works fixed. X E.

//Bitonic-Sort/BitonicSort.cu at main · Mooix/Bitonic-Sort · GitHub
//7/2/2025, 6:10PM, done and started. This based on above link and translated by Grok made by xAI.
//6:11PM, done. X E.
//6:23PM, done. X E.
//7:09PM, done. X E.
//7/3/2025, 3:54PM, start making it after
//Bitonic Sort on CUDA. On a quick benchmark it was 10x faster than the CPU version. · GitHub
//3:54PM, X E.
//4:15PM done. X E. 4:24PM, done? X E.
//4:36PM, done. X E.
//7:57PM, original was under License: BSD 3
//7:57PM, X E.
//7/4/2025, 3:43PM, like done. X E.
/*
7/26/2023, 12:32PM, start.
Started by Norvel M. IV, Josiah.
Started on Ubuntu Touch using uText.
I define “X E” and “XE” ending things to mean:
“All that is near and with this me and since like my last adequate uncertainty to this me as per one might be input maybe or might not be input maybe or might be output maybe or might not be output maybe, maybe or maybe not, maybe.”.
In a sentence, “A” may be lower case and if a period is after “X E” then period may be omitted from that definition.
This Rinom is intended to allow drawing in 3 or more dimensions using dimension axes on a 2 dimensional screen.
X E.
*/
// Define uniform parameters: ij for bitonic sort phase, N for buffer size
// Define the input and output buffers
RWStructuredBuffer points;
RWStructuredBuffer points2d;
RWStructuredBuffer shapes;
RWStructuredBuffer shapeEnds;
RWStructuredBuffer zDists;
RWStructuredBuffer drawOrder;
// Constant buffer for parameters
cbuffer ComputeParams {
uint3 xyz; // 3 integers for x, y, z
uint unitSize; // Number of floats per unit
uint pointNumber; // Number of points
uint shapeNumber; // Number of shapes
float3 constants; // 3 float constants
float2 offset2d; // 2d offsets
uint shouldRound; // Whether to round
float roundWith; // What to round with
float precPad; // Precision padding
//X E
};
//X E
uniform uint4 ij; // ij.x is j, ij.y is k
//X E

// Bitonic compute shader entry point
[shader(“compute”)]
[numthreads(256, 1, 1)]
void bitonicSortMain(uint3 dispatchThreadID : SV_DispatchThreadID) {
//Variable holding counterpart
uint temp = dispatchThreadID.x^ij.x;
// Exit early if index is out of bounds or buffer is empty
//Implied: dispatchThreadID.x >= shapeNumber ||
if (shapeNumber < 2 || temp >= shapeNumber || temp<=dispatchThreadID.x) {
//X E
return;
//X E
}
// Only perform comparison if both indices are within bounds
if ((dispatchThreadID.x & ij.y) == 0) {
if (zDists[drawOrder[dispatchThreadID.x]][0]<zDists[drawOrder[temp]][0]||(zDists[drawOrder[dispatchThreadID.x]][0]==zDists[drawOrder[temp]][0]&&zDists[drawOrder[dispatchThreadID.x]][1]<zDists[drawOrder[temp]][1])) {
temp = drawOrder[temp];
drawOrder[dispatchThreadID.x^ij.x]=drawOrder[dispatchThreadID.x];
drawOrder[dispatchThreadID.x] = temp;
}
} else {
if (zDists[drawOrder[dispatchThreadID.x]][0]>zDists[drawOrder[temp]][0]||(zDists[drawOrder[dispatchThreadID.x]][0]==zDists[drawOrder[temp]][0]&&zDists[drawOrder[dispatchThreadID.x]][1]>zDists[drawOrder[temp]][1])) {
temp = drawOrder[temp];
drawOrder[dispatchThreadID.x^ij.x]=drawOrder[dispatchThreadID.x];
drawOrder[dispatchThreadID.x] = temp;
}
}
//X E
}
//X E

Well, anyone want to say what I did wrong? I am trying to reverse order from as in link. X E.

It is Bitonic, powers of 2 ONLY. Padding is not good for this. Back to merge sort, I hear Nvidia had one, where can I find it? X E.

I decided to try Bitonic powers of 2 with my own algorithm to sort in the rest. X E.

Forget Bitonic, for large data sets every AI I use is saying merge sort, I hear Nvidia has one, where can I find it? Hint: Thrust::sort. X E.

cccl/cub/cub/device/device_merge_sort.cuh at main · NVIDIA/cccl · GitHub, I don’t understand it though. X E.

I now know how to implement radix sort and merge sort, all I need. X E.