Large transfers in ROMIO
Let’s say you had a fat node with lots of memory, and you wanted to write 2 GiB of data out to a file with MPI-IO. How would you do that? You would naturally look at
MPI_File_write()
or one of its variants. This is the prototype for MPI_File_write_all:
int MPI_File_write_all(MPI_File fh, const void *buf, int count, MPI_Datatype datatype, MPI_Status *status)
Perfect! Even though C ‘int’ types are 32 bits on most platforms, 2^31 (or anything less than 2147483649) will fit into a signed int. But until recently, when we try to do this in ROMIO, we fail.
It turns out we were, despite ROMIO being 15 years old, using POSIX system calls incorrectly. The write(2)
system call doesn’t actually have to write out all the data you asked it to. It’s perfectly legal to return success, but only write a few bytes of your request. For most transfers, though, a successful write and a “full write” were the same thing, and so we got by for years without testing for “short writes”. Until recently:
I recently fixed this in git , so folks who were having difficulty with large-ish transfers (in 2013, 2 GiB isn’t that large) should enjoy the next MPICH release.
Note that this fix is not the same as transferring more than 2 GiB of data. That work requires a bit more attention to the MPI type system. I’ll write up a bit about that some other time.