-
Notifications
You must be signed in to change notification settings - Fork 965
Add host_read_async
interfaces to datasource
#18018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
host_read_async
interfaces to datasource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Turns out I forgot to implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. Only a few small changes suggested. Looking forward to the S3 performance after this update!
Co-authored-by: Tianyu Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Found another spot where the new APIs need to be implemented - |
) Depends on #18018 When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism. This PR performs footer reading/parsing asynchronously using an internal thread pool. The pool size can be controlled with an environment variable `LIBCUDF_NUM_HOST_WORKERS`. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Paul Mattione (https://github.com/pmattione-nvidia) - Bradley Dice (https://github.com/bdice) URL: #17957
Now that #17957 is merged, should this PR be merged as well? |
I'll proceed with merging this PR; these APIs will be required to eliminate redundant copies when host compression is used. |
/merge |
1420ef2
into
rapidsai:branch-25.04
Description
kvikIO supports asynchronous host reads, but we don't utilize them to optimize host reads such as metadata access.
This PR adds the async versions of the
host_read
APIs to allow efficient use of the kvikIO pool for host reads. Thedatasource
s that are not backed by kvikIO implement these as deferred calls to the synchronous versions.Checklist