Skip to content

Add host_read_async interfaces to datasource #18018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Mar 4, 2025

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 14, 2025

Description

kvikIO supports asynchronous host reads, but we don't utilize them to optimize host reads such as metadata access.
This PR adds the async versions of the host_read APIs to allow efficient use of the kvikIO pool for host reads. The datasources that are not backed by kvikIO implement these as deferred calls to the synchronous versions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 14, 2025
@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Feb 14, 2025
@vuule
Copy link
Contributor Author

vuule commented Feb 14, 2025

/ok to test

@vuule vuule changed the title Add host_read_async interface to datasource Add host_read_async interfaces to datasource Feb 18, 2025
@vuule vuule marked this pull request as ready for review February 18, 2025 21:47
@vuule vuule requested a review from a team as a code owner February 18, 2025 21:47
@vuule vuule requested a review from PointKernel February 18, 2025 21:47
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule
Copy link
Contributor Author

vuule commented Feb 19, 2025

Turns out I forgot to implement host_read_asyncs in the remote datasource, and the deferred async implementation is not great for remote IO. I'll update the PR to include this crucial feature.

@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Feb 19, 2025
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Only a few small changes suggested. Looking forward to the S3 performance after this update!

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule vuule added the DO NOT MERGE Hold off on merging; see PR for details label Feb 21, 2025
@vuule
Copy link
Contributor Author

vuule commented Feb 21, 2025

Found another spot where the new APIs need to be implemented - user_datasource_wrapper. Will address this tomorrow. Overall I won't merge this until it performs well here.

rapids-bot bot pushed a commit that referenced this pull request Feb 24, 2025
)

Depends on #18018

When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism.

This PR performs footer reading/parsing asynchronously using an internal thread pool. The pool size can be controlled with an environment variable `LIBCUDF_NUM_HOST_WORKERS`.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Bradley Dice (https://github.com/bdice)

URL: #17957
@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2025

Now that #17957 is merged, should this PR be merged as well?

@vuule
Copy link
Contributor Author

vuule commented Feb 26, 2025

Now that #17957 is merged, should this PR be merged as well?

#17957 does not depend on this PR any more, I chose to merge the version that works well and will look into the version that uses host_read_async at a later point.

@GregoryKimball GregoryKimball moved this to Burndown in libcudf Mar 3, 2025
@vuule
Copy link
Contributor Author

vuule commented Mar 4, 2025

I'll proceed with merging this PR; these APIs will be required to eliminate redundant copies when host compression is used.

@vuule vuule removed the DO NOT MERGE Hold off on merging; see PR for details label Mar 4, 2025
@vuule
Copy link
Contributor Author

vuule commented Mar 4, 2025

/merge

@rapids-bot rapids-bot bot merged commit 1420ef2 into rapidsai:branch-25.04 Mar 4, 2025
105 of 106 checks passed
@GregoryKimball GregoryKimball moved this from Burndown to Landed in libcudf Mar 4, 2025
@GregoryKimball GregoryKimball removed this from libcudf May 4, 2025
@vuule vuule deleted the fea-host_read_async branch May 22, 2025 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants