-
Notifications
You must be signed in to change notification settings - Fork 343
feat(scanner): Merge duplicate scan results that share a provenance #10502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
038c082
to
40190e3
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10502 +/- ##
=========================================
Coverage 56.86% 56.86%
Complexity 1606 1606
=========================================
Files 337 337
Lines 12512 12512
Branches 1179 1179
=========================================
Hits 7115 7115
Misses 4945 4945
Partials 452 452
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
40190e3
to
ed1df5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maennchen,
thanks for your contribution!
Do you mind adding a test that demonstrates the issue with the duplicated scan results and that your changes fix these issues?
Edit: As there is currently no existing test for the scan function that use an ORT result, I won't insist on this change.
Edit 2: I Found these existing tests which already test the scan()
function:
ort/scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Lines 63 to 102 in c086b3b
"Scanning all packages corresponding to a single VCS" should { | |
val analyzerResult = createAnalyzerResult(pkg0, pkg1, pkg2, pkg3, pkg4) | |
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) | |
"return the expected ORT result" { | |
val expectedResult = readResource("/scanner-integration-all-pkgs-expected-ort-result.yml") | |
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
"return the expected (merged) scan results" { | |
val expectedResult = readResource("/scanner-integration-expected-scan-results.yml") | |
val scanResults = ortResult.getScanResults().toSortedMap() | |
patchActualResult(scanResults.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
"return the expected (merged) file lists" { | |
val expectedResult = readResource("/scanner-integration-expected-file-lists.yml") | |
val fileLists = ortResult.getFileLists().toSortedMap() | |
fileLists.toYaml() shouldBe patchExpectedResult(expectedResult) | |
} | |
} | |
"Scanning a subset of the packages corresponding to a single VCS" should { | |
"return the expected ORT result" { | |
val analyzerResult = createAnalyzerResult(pkg1, pkg3) | |
val expectedResult = readResource("/scanner-integration-subset-pkgs-expected-ort-result.yml") | |
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) | |
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
} |
Can you have a look at them and extend them to test your use-case?
@MarcelBochtler I added a test to test the deduplication. Running the same test on
$ ./gradlew scanner:funTest --tests "org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest"
|
4432de1
to
263be95
Compare
263be95
to
276bc2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test. This really helps to understand the necessity of this change.
The change looks good to me. Any other opinions @oss-review-toolkit/kotlin-devs ?
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
message: "IOException: Could not resolve provenance for package 'Dummy::project:1.0.0'\ | ||
\ for source code origins [VCS, ARTIFACT]." | ||
severity: "ERROR" | ||
package_provenance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me at a quick glance that this is a good change. Isn't it intended for this dummy project to have a resolution issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? I have moved the test into a separate test in the meantime. Does your comment still apply?
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
When the SpdxDocumentFile package manager is used, the *project* and all contained *packages* often resolve to the **same VCS provenance** (e.g. the root of the Git repository). Before this change ORT stored two separate `ScanResult`s for such a provenance – one keyed to the project, one keyed to the package. That caused two follow-on problems: * Both results appeared in the `OrtResult`, so evaluators saw **duplicate findings** for the *same* source tree. * Because projects and packages are handled by different rules the package result was additionally **padded with a `SpdxConstants.NONE` finding** whenever `includeFilesWithoutFindings` was enabled. The evaluator therefore compared *real* license findings from the project result with `NONE` from the package result and failed with a violation. This patch * groups scan results by the pair `(provenance, scanner)` and folds them into a single `ScanResult`, * unions the inner finding sets to avoid duplicates, and * performs the "pad with NONE" step only **after** deduplication, so every path is represented exactly once. As a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch. Signed-off-by: Jonatan Männchen <[email protected]>
276bc2f
to
dfb5f87
Compare
The code can be seen in action here: https://github.com/elixir-lang/elixir/actions/runs/16086802339/job/45398947306 |
@@ -34,6 +34,7 @@ import org.ossreviewtoolkit.model.Package | |||
import org.ossreviewtoolkit.model.PackageReference | |||
import org.ossreviewtoolkit.model.PackageType | |||
import org.ossreviewtoolkit.model.Project | |||
import org.ossreviewtoolkit.model.Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit-message:
I believe, from the commit message it should be understandable what exact issues happened before this
change, which this change is supposed to fix. I still need help understanding these, and I believe commit
message should be adjusted so that it becomes clearer. Unclear to me:
- "so evaluators saw duplicate findings for the same source tree."
- Rules operate on a per package / project basis. If multiple originate from same source tree,
it is expected that same source tree is processed mutliple times. Would it maybe help to
explain this with an example?
- Rules operate on a per package / project basis. If multiple originate from same source tree,
- "Because projects and packages are handled by different rules the package
result was additionally padded with aSpdxConstants.NONE
finding
wheneverincludeFilesWithoutFindings
was enabled."- I don't why this padding happened, and why it is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of a result before this change. As you can see in the file, there's multiple projects in the lib
directory and the project contains everything. The project also contains files that belong to no package (for example the whole bin
directory or the .formatter.exs
file).
Taking the .formatter.exs
file as an example, it does not belong to any packages but it belongs to the project. Before this change, it would be correctly listed in scan #1 as Apache-2.0
and incorrectly listed as NONE
in scan result #2.
When the evaluator or reporter runs on this result, they will use both scan results since they share the provenance and since NONE
is not allowed in our rules, they will report an error.
} | ||
} else { | ||
deduplicatedScanResults | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a commit prior to this commit which factors out line 150-152 and lines 155 - 180 into a function which is responsible for adding these LicenseFinding(SpdxConstants.NONE, TextLocation(it, TextLocation.UNKNOWN_LINE))
, so that the diff of this commit becomes simpler?
e.g. idea is to separate the de-duplication code from the adding of these NONE findings.
|
||
"Scanning a project with the same provenance as packages" should { | ||
val analyzerResult = createAnalyzerResultWithProject(project0, pkg0) | ||
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move lines 106+107 below line 109.
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) | ||
|
||
"not have duplicated scan results" { | ||
val expectedResult = readResource("/scanner-integration-shared-project-package-provenance.yml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient (instead of asserting the expected result) to assert
OrtResult.scanner.scanResults
has only one entry per(provenance, scannerDetails)
.OrtResult.scanner.files
has only one entry perprovenance
.
I believe this what this commit ensures. Looking at these above proposed assertions, we could instead
add the as invariant checks into the constructor of ScannerRun
.
(Another consideration could be to make this de-duplication unit test-able, so that the test does not need to do scanning at all, @oss-review-toolkit/core-devs what do you think?)
When the SpdxDocumentFile package manager is used, the project and all contained packages often resolve to the same VCS provenance (e.g. the root of the Git repository).
Before this change ORT stored two separate
ScanResult
s for such a provenance – one keyed to the project, one keyed to the package.That caused two follow-on problems:
OrtResult
, so evaluators saw duplicate findings for the same source tree.SpdxConstants.NONE
finding wheneverincludeFilesWithoutFindings
was enabled. The evaluator therefore compared real license findings from the project result withNONE
from the package result and failed with a violation.This patch
(provenance, scanner)
and folds them into a singleScanResult
,As a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch.
This is the first time for me writing Kotlin. Sorry if the code is not up to the usual standards.