>do you know how to specifically target multiple cpu's in a sort in order to maximize their use
I don't understand your question. Do you want to specialize an algorithm for high performance machines or make it usable on more machines? You seem to be asking for both.
>sort of a divide and conquer by setting multiple threads sorting one portion at a time ?
Perhaps in a huge distributed sort across multiple machines, but that's typically where one would use an external sort with each process (rather than dividing the work of a single algorithm with threads) and then merging the results together into one. No, individual algorithms are typically optimized by making them cache aware and taking advantage of branch prediction.