Skip to content

[SPARK-37728][SQL] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

yimin-yang
Copy link
Contributor

@yimin-yang yimin-yang commented Dec 23, 2021

What changes were proposed in this pull request?

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the CIs with the newly added test case.

@github-actions github-actions bot added the SQL label Dec 23, 2021
@yimin-yang
Copy link
Contributor Author

@c21 Could you take a look when you are free? Thanks! Looking forward to your feedback.

@c21
Copy link
Contributor

c21 commented Dec 23, 2021

@yym1995 thank you submitting a fix! Could you help add a unit test case as well?

@HyukjinKwon HyukjinKwon changed the title [SPARK-37728][SQL] reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException [SPARK-37728][SQL] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException Dec 24, 2021
@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @dongjoon-hyun FYI

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

@yimin-yang
Copy link
Contributor Author

yimin-yang commented Dec 24, 2021

@yym1995 thank you submitting a fix! Could you help add a unit test case as well?

@c21 I just added a unit test case. Please take a look, thanks!

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

Test build #146545 has finished for PR 35002 at commit f975af9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2021

Test build #146550 has finished for PR 35002 at commit ec2e0db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yym1995 for the fix! Having some comments for code structure, but the fix is valid.

@@ -31,26 +32,27 @@
*/
public class OrcArrayColumnVector extends OrcColumnVector {
private final OrcColumnVector data;
private final long[] offsets;
private final long[] lengths;
private ListColumnVector listData;

OrcArrayColumnVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to store another copy of vector here. We can change OrcColumnVector.baseData from private to protected, and just use baseData to get offset and length. So how about defining the constructor like below:

OrcArrayColumnVector(
    DataType type,
    ListColumnVector vector,
    OrcColumnVector data) {

  super(type, vector);

  this.data = data;
}

@@ -32,28 +33,30 @@
public class OrcMapColumnVector extends OrcColumnVector {
private final OrcColumnVector keys;
private final OrcColumnVector values;
private final long[] offsets;
private final long[] lengths;
private MapColumnVector mapData;

OrcMapColumnVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly we can define OrcMapColumnVector constructor as below:

OrcMapColumnVector(
    DataType type,
    MapColumnVector vector,
    OrcColumnVector keys,
    OrcColumnVector values) {

  super(type, vector);

  this.keys = keys;
  this.values = values;
}

}

@Override
public ColumnarMap getMap(int ordinal) {
return new ColumnarMap(keys, values, (int) offsets[ordinal], (int) lengths[ordinal]);
return new ColumnarMap(keys, values, (int) mapData.offsets[ordinal],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ColumnarMap getMap(int ordinal) {
  int offset = (int) ((MapColumnVector) baseData).offsets[ordinal];
  int length = (int) ((MapColumnVector) baseData).lengths[ordinal];
  return new ColumnarMap(keys, values, offset, length);
}

}

@Override
public ColumnarArray getArray(int rowId) {
return new ColumnarArray(data, (int) offsets[rowId], (int) lengths[rowId]);
return new ColumnarArray(data, (int) listData.offsets[rowId], (int) listData.lengths[rowId]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use baseData here:

public ColumnarArray getArray(int rowId) {
  int offset = (int) ((ListColumnVector) baseData).offsets[rowId];
  int length = (int) ((ListColumnVector) baseData).lengths[rowId];
  return new ColumnarArray(data, offset, length);
}

@yimin-yang
Copy link
Contributor Author

@c21 Thank you for the feedback! I have already changed the code structure.

@yimin-yang
Copy link
Contributor Author

@dongjoon-hyun This PR fixed a bug in ORC vectorized reader. @c21 has reviewed this PR, and I have improved the code according to the feedback. I was wondering if you could merge this PR, thanks!

@c21
Copy link
Contributor

c21 commented Dec 27, 2021

LGTM with pending CI tests. cc @viirya and @cloud-fan as well.

@c21
Copy link
Contributor

c21 commented Dec 27, 2021

retest this please

@cloud-fan
Copy link
Contributor

The fix LGTM. Please click the failed Github Action, follow the instructions to fix the issues.

@yimin-yang yimin-yang force-pushed the fix-nested branch 2 times, most recently from 5cd704b to ee293e3 Compare December 27, 2021 12:21
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good. Pending CI.

@viirya
Copy link
Member

viirya commented Dec 27, 2021

GA seems unstable now. You can submit an empty commit to re-trigger it.

@viirya
Copy link
Member

viirya commented Dec 27, 2021

Tested manually. Just follow the bug reproduction steps in https://issues.apache.org/jira/browse/SPARK-37728 .

As you added an unit test, you can modify the PR description too.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CI)

@yimin-yang
Copy link
Contributor Author

Now all checks have passed. cc @cloud-fan @dongjoon-hyun @HyukjinKwon @viirya

@dongjoon-hyun
Copy link
Member

I revised the PR description. Merged to master.
Thank you, @yym1995 and all!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 28, 2021

Welcome to the Apache Spark community, @yym1995 .
I added you to the Apache Spark contributor group and assigned SPARK-37728 to you.
Could you make three backporting PRs to branch-3.2 and branch-3.1 and branch-3.0? We need to pass the UTs on those branches.

@yimin-yang
Copy link
Contributor Author

Welcome to the Apache Spark community, @yym1995 . I added you to the Apache Spark contributor group and assigned SPARK-37728 to you. Could you make three backporting PRs to branch-3.2 and branch-3.1 and branch-3.0? We need to pass the UTs on those branches.

OK, will do.

@c21
Copy link
Contributor

c21 commented Dec 28, 2021

@dongjoon-hyun - thanks for merging. We only need backport to branch-3.2, right? The related code was only introduced in 3.2 branch (in PR for ORC vectorized reader of nested column)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 28, 2021

Yes, we only need to backport to the branches where the test case fails.

BTW, @yym1995 , if branch-3.2 is the only branch, you should remove Affected Version of SPARK-37728. Currently, it has 3.0.2 and 3.1.2.

yimin-yang added a commit to yimin-yang/spark that referenced this pull request Dec 28, 2021
…can cause ArrayIndexOutOfBoundsException

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

Bugfix

No

Pass the CIs with the newly added test case.

Closes apache#35002 from yym1995/fix-nested.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@yimin-yang yimin-yang deleted the fix-nested branch December 29, 2021 06:56
dongjoon-hyun pushed a commit that referenced this pull request Jan 4, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of #35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin Yang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin Yang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin Yang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin Yang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <[email protected]>
Co-authored-by: Yimin Yang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 5f9b92c)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants