In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring.
Take a look at the following code:
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) {
return 1;
}
io_uring_prep_write(sqe, fd, buffer, BUFFER_SIZE, 0);
io_uring_submit(&ring);
struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
return 2;
}
If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:
int32_t rvn_write_io_ring(
void *handle,
int32_t count,
struct page_to_write *buffers,
int32_t *detailed_error_code);
There is a set of buffers that I want to write, and the natural way to do that is:
int32_t rvn_write_io_ring(
void *handle,
int32_t count,
struct page_to_write *buffers,
int32_t *detailed_error_code)
{
struct handle *handle_ptr = handle;
for (size_t i = 0; i < count; i++)
{
struct io_uring_sqe *sqe = io_uring_get_sqe(
&handle_ptr->global_state->ring);
io_uring_prep_write(sqe,
handle_ptr->file_fd,
buffers[i].ptr,
buffers[i].count_of_pages * VORON_PAGE_SIZE,
buffers[i].page_num * VORON_PAGE_SIZE
);
}
return _submit_and_wait(&handle_ptr->global_state->ring,
count, detailed_error_code);
}
int32_t _submit_and_wait(
struct io_uring* ring,
int32_t count,
int32_t* detailed_error_code)
{
int32_t rc = io_uring_submit_and_wait(ring, count);
if(rc < 0)
{
*detailed_error_code = -rc;
return FAIL_IO_RING_SUBMIT;
}
struct io_uring_cqe* cqe;
for(int i = 0; i < count; i++)
{
rc = io_uring_wait_cqe(ring, &cqe);
if (rc < 0)
{
*detailed_error_code = -rc;
return FAIL_IO_RING_NO_RESULT;
}
if(cqe->res < 0)
{
*detailed_error_code = -cqe->res;
return FAIL_IO_RING_WRITE_RESULT;
}
io_uring_cqe_seen(ring, cqe);
}
return SUCCESS;
}
In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.
After thinking about this for a while, I came up with the following implementation:
int32_t _submit_writes_to_ring(
struct handle *handle,
int32_t count,
struct page_to_write *buffers,
int32_t* detailed_error_code)
{
struct io_uring *ring = &handle->global_state->ring;
off_t *offsets = handle->global_state->offsets;
memset(offsets, 0, count * sizeof(off_t));
while(true)
{
int32_t submitted = 0;
for (size_t i = 0; i < count; i++)
{
off_t offset = offsets[i];
if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE)
continue;
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
if (sqe == NULL) // the ring is full, flush it...
break;
io_uring_sqe_set_data(sqe, i);
io_uring_prep_write(sqe, handle->file_fd,
buffers[i].ptr + offset,
buffers[i].count_of_pages * VORON_PAGE_SIZE - offset,
buffers[i].page_num * VORON_PAGE_SIZE + offset);
submitted++;
}
if(submitted == 0)
return SUCCESS;
int32_t rc = io_uring_submit_and_wait(ring, submitted);
if(rc < 0)
{
*detailed_error_code = -rc;
return FAIL_IO_RING_SUBMIT;
}
struct io_uring_cqe *cqe;
uint32_t head = 0;
uint32_t i = 0;
bool has_errors = false;
io_uring_for_each_cqe(ring, head, cqe) {
i++;
uint64_t index = io_uring_cqe_get_data64(cqe);
int result = cqe->res;
if(result < 0)
{
has_errors = true;
*detailed_error_code = -result;
}
else
{
offsets[index] += result;
if(result == 0)
{
// there shouldn't be a scenario where we return 0
// for a write operation, we may want to retry here
// but figuring out if this is a single happening, of if
// we need to retry this operation (_have_ retried it?) is
// complex enough to treat this as an error for now.
has_errors = true;
*detailed_error_code = EIO;
}
}
}
io_uring_cq_advance(ring, i);
if(has_errors)
return FAIL_IO_RING_WRITE_RESULT;
}
}
That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.
If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.
Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work.
This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens.
The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there.
It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:
Offset | Size | |
Buffer 1 | 1 GB | 4 GB |
Buffer 2 | 10 GB | 6 GB |
The patterns of writes that will actually be executed are:
- 1GB .. 3 GB, 10 GB .. 12 GB
- 3 GB .. 5 GB, 12 GB .. 14 GB
- 14 GB .. 16 GB
I can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance.
Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.
What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.
In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.