Difference between the OpenMPI versions 4.1.5 and 4.1.7

Hi,
I am not sure if this is the correct thread, however, I would like to ask a quick question regarding OpenMPI in nvhpc_sdk version 24.7.
If I run my executable with mpirun (distributing over different hosts in a heterogeneous cluster, simple CPU test for the moment) located in “24.7/comm_libs/12.5/openmpi4/openmpi-4.1.5/bin/” I get errors of the kind:

btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier

However, if I use the version e.g. located in “24.7/comm_libs/mpi/bin/” it runs without issues.
By checking the versions of mpirun it shows that the former is 4.1.5 and the later is 4.1.7. As the newer version works for me but not the older one I was wondering what is the difference between these two versions that could explain the reported behavior?

Many thanks.
Reto

Hi Reto,

As the newer version works for me but not the older one I was wondering what is the difference between these two versions that could explain the reported behavior?

While I’m not a expert here, but my best guess the difference is due to UCX. The default “comm_libs/mpi/bin” points the HPCX install versus the OpenMPI 4.1.5 install. While HPCX is based on OpenMPI, it uses UCX and UCX is mostly likely the key difference rather than 4.1.5 vs 4.1.7.

Now what the error is, I’m not sure, but doing a web search I see others with similar errors but works with UCX.

-Mat