Ticket #81 (closed defect: fixed)

Opened 2 years ago

Last modified 18 months ago

[PATCH] Race condition between libusb_submit_transfer() and usbi_handle_disconnect()

Reported by: arne Owned by: pbatard
Milestone: Component: libusb-1.0
Keywords: Cc:
Blocked By: Blocks:

Description

Package: libusb 1.0.8
OS: linux 2.6.31 (OpenSUSE 11.2)

I am using libusb in a multithreaded application with my own dedicated libusb event handler thread. All USB I/O operations (either blocking or asynchronous) are running in other threads.

Running my program through valgrind I got the following error, when I unplugged the device:

==21338== Invalid free() / delete / delete[]
==21338== at 0x40268A6: free (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==21338== by 0x817E72E: submit_bulk_transfer (linux_usbfs.c:1412)
==21338== by 0x817C56C: libusb_submit_transfer (io.c:1236)
[...]
==21338== Address 0xf555920 is 0 bytes inside a block of size 44 free'd
==21338== at 0x40268A6: free (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==21338== by 0x817DB07: op_clear_transfer_priv (linux_usbfs.c:1774)
==21338== by 0x817CC16: usbi_handle_disconnect (io.c:2303)
==21338== by 0x817DC8B: op_handle_events (linux_usbfs.c:2125)
==21338== by 0x817C248: handle_events (io.c:1867)
==21338== by 0x817C41E: libusb_handle_events_locked (io.c:2014)
[...]

It seems that just after submit_bulk_transfer() submitted the urb, libusb detected a removed device through the event handler and freed the urb in op_clear_transfer_priv(). Erroneous return from the ioctl() caused the submit function to free the same urb once more, because it was the first (and actually the only) urb of the transfer.

Possible solution: locking the transfer with pthread_mutex_lock(&itransfer->lock) inside op_clear_transfer_priv() solved the problem for me.

Attachments

device_disconnect_race_condition_fix.patch (2.0 KB) - added by arne 2 years ago.
linux-prevent-double-free-of-urbs-on-device-disconnect.patch (1.1 KB) - added by pbatard 2 years ago.

Download all attachments as: .zip

Change History

comment:1 Changed 2 years ago by arne

Upgraded to libusb-pbatard in the meantime.

Some additional info: actually there is a race condition between libusb_submit_transfer() and usbi_handle_disconnect(). libusb_submit_transfer() calls add_to_flying_list() which adds the transfer to the list and releases the list lock. If transfer fails (as in this case, because the device was just removed), this lock is taken again to remove the transfer from the list. usbi_handle_disconnect() actually can be called in the meantime from the event handling thread even before the transfer is submitted to the system. It will search the list and try to free and complete all transfers that are still pending, including the one that was just added. This would actually cause a situation where a completion upcall would be called for the transfer even if libusb_submit_transfer() actually fails.

The only way to fix this, is to guarantee, that if a transfer is in the list, then it is also successfully submitted. This will also fix the problem in the original ticket. Easiest possible fix: move the flying_transfers_lock out of add_to_flying_list() and lock it explicitly in libusb_submit_transfer():

	usbi_mutex_lock(&ctx->flying_transfers_lock);
	first = add_to_flying_list(itransfer);
	r = usbi_backend->submit_transfer(itransfer);
	if (r) {
		list_del(&itransfer->list);
	}
#ifdef USBI_TIMERFD_AVAILABLE
	[...]
#endif
	usbi_mutex_unlock(&ctx->flying_transfers_lock);

comment:2 Changed 2 years ago by stuge

Thanks for tracking this down. I think your fix seems like the right move. Could you make a patch?

I'd recommend using libusb-stuge.git for platforms other than Windows. -pbatard has all the latest Windows work while -stuge master currently is the merge queue for libusb.git, and it includes some fixes for Linux and Darwin. (Note that -stuge master is rebased when neccessary, until patches are applied to libusb.git.)

comment:3 Changed 2 years ago by arne

  • Summary changed from Device removal causes double free() in Linux usbfs backend to [PATCH] Race condition between libusb_submit_transfer() and usbi_handle_disconnect()

Attached the patch. Sorry for the long delay.

I originally chose the -pbatard repository because it had the libusb_handle_events_timeout() race condition fixed. I'm not very familiar with git. This patch is created against -pbatard repository, but should I post the one against -stuge as well? Anyway they are different only in the index.

comment:4 Changed 2 years ago by pbatard

  • Owner set to pbatard
  • Status changed from new to accepted

comment:5 follow-up: Changed 2 years ago by pbatard

The proposed patch breaks concurrency on Windows, as our poll emulation there requires a call to usbi_fd_notification() before returning from the transfer submission call (so as not to require a wakeup of poll() listeners in order for them to resubmit an updated fd pool, which is an alternative that was also toyed with with the DYNAMIC_FDS option from poll_windows.h, but which we don't see as a desirable option).

I have confirmed breakage of Windows with the proposed patch using the multithreaded libusbDotNet benchmark app.

Since the analysis of the problem indicates that the issue is really with a double free of URBs on Linux (haven't checked OSX), I think we should aim at addressing it with the URBs themselves, rather than by locking the whole transfer submission, though this may require adding URB locks.

As a reminder, the original concurrency problem is as follows:

If a device is removed right after submit_bulk_transfer() is initiated, and consequently that call detects an error, which results in the freeing of URB, usbi_handle_disconnect() may still concurrently attempt to call clear_transfer_priv(), which tries to free() the the same URB.

I will investigate an alternative patch/workaround, focusing on URBs only, but comments are welcome.

comment:6 Changed 2 years ago by stuge

This does seem to relate to #82, which also deals with cleaning up devices disappearing while other stuff is going on. Current libusb-stuge.git master has both this patch and the suggestions in #82 included. It would be interesting to know if the #82 changes are enough. Could test reverting this patch with libusb-stuge master checked out:

git revert 08057817cfc97b784c5cfd416baabe564b231ed5

I fear the problem is not neccessarily specific to Linux, but I also haven't compared the various transfer cleanup paths between backends.

comment:7 Changed 2 years ago by pbatard

Yeah, #82 is in the same area, but I don't think the patches for #82 alone will fix this one, as they apply to explicit transfer cancellation, whereas what occurs here is that linux_usbfs.c's submit_bulk_transfer() may decide to free the urbs on its own, which is the root of the issue:

/* if the first URB submission fails, we can simply free up and

  • return failure immediately. */

if (i == 0) {

usbi_dbg("first URB failed, easy peasy");
free(urbs);
tpriv->urbs = NULL;
return r;

}

I think a better approach might be to do what Darwin does an wait till we find problematic fds in handle_events() to declare the device attached to those as missing and then call usbi_handle_disconnect(), which will do the job of freeing the urbs (though Nathan comments that this approach is a bit heavy handed). I don't really see the immediate need to free the urbs in submit_transfer, but I am however wondering if we could run into a situation where the fd for a removed device is reassigned to another before we have a chance to check for orphans.

I should also point out that the Windows backend is not calling usbi_handle_disconnect() at all, mostly because we work with per-transfer OVERLAPPED in our emulated poll(), so we can just rely on the transfer timeout for cleanup.

Now, with regards to the current libusb-stuge.git, it does indeed fail in the same fashion for the benchmark test, but with #81 reverted things are back to normal.

comment:8 in reply to: ↑ 5 Changed 2 years ago by stuge

Replying to pbatard:

I think we should aim at addressing it with the URBs themselves,

Yes it would indeed be much nicer to avoid locking the whole list of transfers.

though this may require adding URB locks.

Maybe taking itransfer->lock before making any change to the tpriv is enough?

comment:9 follow-up: Changed 2 years ago by arne

I remember that this was a little more tricky. Problem is not only with the URB-s, but actually with the transfer itself. If submit_transfer() fails, then the user may immediately call libusb_free_transfer(), but usbi_handle_disconnect() may already have got the same transfer from the flying list. Locking the itransfer->lock (as it was in the original ticket) did get me a situation where completion upcall was called after the transfer itself was already freed.

comment:10 in reply to: ↑ 9 Changed 2 years ago by pbatard

Replying to arne:

I remember that this was a little more tricky. Problem is not only with the URB-s, but actually with the transfer itself. If submit_transfer() fails, then the user may immediately call libusb_free_transfer(), but usbi_handle_disconnect() may already have got the same transfer from the flying list. Locking the itransfer->lock (as it was in the original ticket) did get me a situation where completion upcall was called after the transfer itself was already freed.

From what I see in the code, only the private data of the transfer should have been freed then, and while usbi_handle_disconnect() can indeed result in a user callback, if you check the transfer status in that callback, you will get a LIBUSB_TRANSFER_NO_DEVICE, so the expectation there is that you will not attempt to process data, but abort/free the transfer.

If needed, I can see a way to free the private data after the callback is called in usbi_handle_disconnect(), but this relies on the transfer and the previous dev_handle not both having been reassigned by the time we're done with callback, which is still potentially problematic.

For now, I will produce a patch that prevents the double freeing of Linux urbs, so that we know where we stand on that. If accesses to the freed private data in user callbacks that are expected to abort only is seens as an issue, we can look into it.

comment:11 Changed 2 years ago by pbatard

New patch is now attached. Could use some testing as I don't have an application that can reproduce the issue. Produced against the latest -pbatard, but should apply against -stuge as well.

Last edited 2 years ago by pbatard (previous) (diff)

comment:12 follow-up: Changed 2 years ago by arne

I have not been able to test it yet, but by looking at it I can see, that it might also need a lock around free_iso_urbs() in op_clear_transfer_priv(), because submit_iso_transfer() is also freeing everything when URB submission fails.

About the transfer problem: in the worst case it may actually happen that usbi_handle_disconnect() will access the private transfer data when it is already freed by libusb_free_transfer() in the other thread. Highly unlikely, but still possible. That said, I don't see it as a very big problem for the normal libusb user. We are currently using libusb for Linux with an application that listens to hotplug events, and allows the user to attach and detach devices at will. As I understand, this is not actually the intended usage, and most users will not do this, so this problem will be quite rare for them. But I can actually say that with this transfer problem, and the ticket #70 (device removal breaks internal state, making later libusb_open calls always fail) fixed, we have not had any problems with this approach. I cannot comment on other platforms though.

Another possible idea for the transfer fix: add a condition variable and counter to struct libusb_device_handle, that will be incremented in libusb_submit_transfer() before adding the transfer to the flying list, and decremented before libusb_submit_transfer() returns. Protect it with flying list lock. Set condition variable, when counter reaches 0. In usbi_handle_disconnect() check the list only if counter is 0, otherwise wait for the condition. Problem with this fix though is that when usbi_handle_disconnect() is not used (as in Windows) then this will just create unnecessary overhead for each transfer...

comment:13 in reply to: ↑ 12 ; follow-up: Changed 2 years ago by pbatard

Replying to arne:

I have not been able to test it yet, but by looking at it I can see, that it might also need a lock around free_iso_urbs() in op_clear_transfer_priv(), because submit_iso_transfer() is also freeing everything when URB submission fails.

Good point, I overlooked that one. I have replaced the previous patch to include this, as well the simplifications discussed on the list.

About the transfer problem: in the worst case it may actually happen that usbi_handle_disconnect() will access the private transfer data when it is already freed by libusb_free_transfer() in the other thread.

I think the current patch should cover that one. usbi_handle_disconnect() calls clear_transfer_priv() which should now properly wait, before it reads the tpriv->urbs pointer with the intent of freeing it, that any other thread is done with it. If tpriv->urbs has been freed in another thread, clear_transfer_priv() should only get a NULL pointer now. The transfer is locked during the whole submit process, so in effect, this patch should not be that different from the one you proposed, in that the free can no longer occur while a transfer is still being submitted.

comment:14 in reply to: ↑ 13 Changed 2 years ago by arne

Tested libusb-pbatard with and without the patch with valgrind. Looks OK for the URB-s, no multiple frees in the patched version anymore.

About the transfer problem: in the worst case it may actually happen that usbi_handle_disconnect() will access the private transfer data when it is already freed by libusb_free_transfer() in the other thread.

I think the current patch should cover that one. usbi_handle_disconnect() calls clear_transfer_priv() which should now properly wait, before it reads the tpriv->urbs pointer with the intent of freeing it, that any other thread is done with it. If tpriv->urbs has been freed in another thread, clear_transfer_priv() should only get a NULL pointer now. The transfer is locked during the whole submit process, so in effect, this patch should not be that different from the one you proposed, in that the free can no longer occur while a transfer is still being submitted.

Seems that I have confused different terms. With "transfer private data" I actually meant "itransfer". The possible (though very rare) transfer problem is a different thing, but fixing this actually fixed the URB problem also:
1) User submits the the allocated transfer to libusb_submit_transfer()
2) Device gets disconnected
3) usbi_handle_disconnect() gets called in the maintenance thread, finds the same transfer in the flying list, but gets scheduled out by system just after that.
4) libusb_submit_transfer() fails
5) user calls libusb_free_transfer() on failure, which destroys itransfer lock and frees everything. There is no locking in libusb_free_transfer().
6) maintenance thread gets scheduled by the system and accesses the itransfer, which is already freed.

Actually the maintenance thread in the patched version probably starts to wait on itransfer lock. libusb_submit_transfer() releases it when it returns, but now there will be a race between the user thread and the maintenance thread, with the former trying to free the transfer and the latter to clear the URB-s and handle completion.

But as I said earlier, for the typical user this should not be a very big concern. After the first fix addressing the URB-s (which looked a lot like your current patch) I did get this situation exactly once in about a month, and while I was specifically testing by attaching and detaching devices. The URB problem was much more frequent.

comment:15 Changed 2 years ago by pbatard

Many thanks for the test report. Much appreciated!

From the comments in usbi_handle_disconnect() I think we are aware that our current approach there is still potentially racy.

One way I see to fix this would be to add another level of indirection (with a mutex/lock) to access the transfer. Instead of returning a direct pointer to the transfer, the flying list operations would return a pointer to a pointer to the transfer (pp). Then when any concurrent call frees that transfer, it sets a NULL for the pointer to the transfer, which will be detected by whoever comes next. We can also move the lock from the transfer into this pp+lock struct, which would be used to mutex both transfer access and free operations.

Now of course, we can't allocate and free a pp+lock struc on the fly without running into exactly the same kind of issues as the one we're trying to avoid, so we'd have to change our flying list implementation to maintain a growable table of pp+lock structs, with reusable elements, that would only be freed on libusb_exit.

This seems fairly heavy handed work for the 1.0.9 release, and there probably exist other/better ways to address the problem (we can probably do away with the added indirection if we just add a flag in the pp+lock struct), so I'd go for release with what we have above, and wait for 1.0.10 to introduce a non trivial rework of how libusb handles the transfer list.

comment:16 Changed 21 months ago by pbatard

In [09ed2fe00817d0282b932282e4802e85439f8687/libusb-stuge]:

Linux: Prevent URB double free race condition on device disconnect

References #81. A submitted transfer that has just been failed by the
kernel could be picked up by an event handler to be cleaned up, where
freeing of URB memory would race with the submit function doing it's
own cleanup and freeing as a result of the submit failing.

libusb_submit_transfer() always holds itransfer->lock, so the race is
avoided by taking the lock in the cleanup path and checking that URB
memory has not already been freed before freeing it there.

[stuge: Add check in cleanup path that URBs have not already been freed]

comment:17 Changed 18 months ago by pbatard

  • Resolution set to fixed
  • Status changed from accepted to closed

In [737ba04ea40f19564b445fbb489907529e75edc9/libusb]:

Linux: Fix #81 URB double free race condition on device disconnect

A submitted transfer that has just been failed by the kernel could be
picked up by an event handler to be cleaned up, where freeing of URB
memory would race with the submit function doing it's own cleanup and
freeing as a result of the submit failing.

libusb_submit_transfer() always holds itransfer->lock, so the race can
be avoided by taking that lock also in the cleanup path and checking
that the URB memory has not already been freed before freeing it there.

As http://libusb.org/ticket/81#comment:14 notes there is still another
possible, but unlikely, race condition between libusb_submit_transfer()
and an event handling thread. That will require more work to solve.

[stuge: Add check in cleanup path that URBs have not already been freed]

Note: See TracTickets for help on using tickets.