-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-37728][SQL] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@c21 Could you take a look when you are free? Thanks! Looking forward to your feedback. |
@yym1995 thank you submitting a fix! Could you help add a unit test case as well? |
ok to test |
cc @dongjoon-hyun FYI |
Kubernetes integration test starting |
Kubernetes integration test status failure |
@c21 I just added a unit test case. Please take a look, thanks! |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146545 has finished for PR 35002 at commit
|
Test build #146550 has finished for PR 35002 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yym1995 for the fix! Having some comments for code structure, but the fix is valid.
@@ -31,26 +32,27 @@ | |||
*/ | |||
public class OrcArrayColumnVector extends OrcColumnVector { | |||
private final OrcColumnVector data; | |||
private final long[] offsets; | |||
private final long[] lengths; | |||
private ListColumnVector listData; | |||
|
|||
OrcArrayColumnVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to store another copy of vector
here. We can change OrcColumnVector.baseData
from private
to protected
, and just use baseData
to get offset and length. So how about defining the constructor like below:
OrcArrayColumnVector(
DataType type,
ListColumnVector vector,
OrcColumnVector data) {
super(type, vector);
this.data = data;
}
@@ -32,28 +33,30 @@ | |||
public class OrcMapColumnVector extends OrcColumnVector { | |||
private final OrcColumnVector keys; | |||
private final OrcColumnVector values; | |||
private final long[] offsets; | |||
private final long[] lengths; | |||
private MapColumnVector mapData; | |||
|
|||
OrcMapColumnVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly we can define OrcMapColumnVector
constructor as below:
OrcMapColumnVector(
DataType type,
MapColumnVector vector,
OrcColumnVector keys,
OrcColumnVector values) {
super(type, vector);
this.keys = keys;
this.values = values;
}
} | ||
|
||
@Override | ||
public ColumnarMap getMap(int ordinal) { | ||
return new ColumnarMap(keys, values, (int) offsets[ordinal], (int) lengths[ordinal]); | ||
return new ColumnarMap(keys, values, (int) mapData.offsets[ordinal], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public ColumnarMap getMap(int ordinal) {
int offset = (int) ((MapColumnVector) baseData).offsets[ordinal];
int length = (int) ((MapColumnVector) baseData).lengths[ordinal];
return new ColumnarMap(keys, values, offset, length);
}
} | ||
|
||
@Override | ||
public ColumnarArray getArray(int rowId) { | ||
return new ColumnarArray(data, (int) offsets[rowId], (int) lengths[rowId]); | ||
return new ColumnarArray(data, (int) listData.offsets[rowId], (int) listData.lengths[rowId]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just use baseData
here:
public ColumnarArray getArray(int rowId) {
int offset = (int) ((ListColumnVector) baseData).offsets[rowId];
int length = (int) ((ListColumnVector) baseData).lengths[rowId];
return new ColumnarArray(data, offset, length);
}
@c21 Thank you for the feedback! I have already changed the code structure. |
@dongjoon-hyun This PR fixed a bug in ORC vectorized reader. @c21 has reviewed this PR, and I have improved the code according to the feedback. I was wondering if you could merge this PR, thanks! |
LGTM with pending CI tests. cc @viirya and @cloud-fan as well. |
retest this please |
The fix LGTM. Please click the failed Github Action, follow the instructions to fix the issues. |
5cd704b
to
ee293e3
Compare
ee293e3
to
0805ece
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good. Pending CI.
GA seems unstable now. You can submit an empty commit to re-trigger it. |
As you added an unit test, you can modify the PR description too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CI)
Now all checks have passed. cc @cloud-fan @dongjoon-hyun @HyukjinKwon @viirya |
I revised the PR description. Merged to master. |
Welcome to the Apache Spark community, @yym1995 . |
OK, will do. |
@dongjoon-hyun - thanks for merging. We only need backport to branch-3.2, right? The related code was only introduced in 3.2 branch (in PR for ORC vectorized reader of nested column) |
Yes, we only need to backport to the branches where the test case fails. BTW, @yym1995 , if |
…can cause ArrayIndexOutOfBoundsException When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. Bugfix No Pass the CIs with the newly added test case. Closes apache#35002 from yym1995/fix-nested. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of #35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes #35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin Yang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin Yang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin Yang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin Yang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <[email protected]> Co-authored-by: Yimin Yang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 5f9b92c) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:
orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);
When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.
However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.
This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.
Why are the changes needed?
Bugfix
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the CIs with the newly added test case.