Lustre driver story
ROMIO has a general-purpose file system driver we call “UFS” (for Unix File System). UFS contains no file-system-specific optimizations: just data sieving and two phase collective buffering.
The generic approach works, in that it gives correct answers, but it has two big problems when writing to Lustre:
- When assigning the file domains, UFS simply takes the start, the end, and divides evenly over the I/O aggregators. We wrote about Wei-keng’s SC 2008 paper in this area earlier.
- the collective buffering algorithm will do a read-modify-write if there are any holes or gaps in the request. There is a point (specific to each file system deployment) where data sieving does not win out and e.g. two large writes would be better than a read-modify-write.
We rely on the community to contribute many of the fs-specific drivers (e.g. PanFS, XFS), and through 2009 and 2010 the Lustre community did just that. Weikuan Yu did some early work while he was at ORNL. Sun’s developers contributed more improvements, including an independently-developed version of Wei-keng’s group-cyclic distribution. End-users Martin Pokorny at NRAO and Pascal Deveze at BULL contributed additional testing and patching. As a result, ROMIO ended up with an optimized Lustre driver incorporating optimizations for the two points discussed above.
Lustre users should still let us know how things are going: is collective MPI-IO working well? working poorly? The more community involvement we get, the better we can make things.