Skip to content

feat(scanner): Merge duplicate scan results that share a provenance #10502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maennchen
Copy link

When the SpdxDocumentFile package manager is used, the project and all contained packages often resolve to the same VCS provenance (e.g. the root of the Git repository).
Before this change ORT stored two separate ScanResults for such a provenance – one keyed to the project, one keyed to the package.

That caused two follow-on problems:

  • Both results appeared in the OrtResult, so evaluators saw duplicate findings for the same source tree.
  • Because projects and packages are handled by different rules the package result was additionally padded with a SpdxConstants.NONE finding whenever includeFilesWithoutFindings was enabled. The evaluator therefore compared real license findings from the project result with NONE from the package result and failed with a violation.

This patch

  • groups scan results by the pair (provenance, scanner) and folds them into a single ScanResult,
  • unions the inner finding sets to avoid duplicates, and
  • performs the "pad with NONE" step only after deduplication, so every path is represented exactly once.

As a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch.

This is the first time for me writing Kotlin. Sorry if the code is not up to the usual standards.

@maennchen maennchen requested a review from a team as a code owner June 19, 2025 18:38
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 038c082 to 40190e3 Compare June 19, 2025 18:42
@maennchen maennchen changed the title scanner: Merge duplicate scan results that share a provenance feat(scanner): Merge duplicate scan results that share a provenance Jun 19, 2025
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.86%. Comparing base (7e8d90c) to head (dfb5f87).

Additional details and impacted files
@@            Coverage Diff            @@
##               main   #10502   +/-   ##
=========================================
  Coverage     56.86%   56.86%           
  Complexity     1606     1606           
=========================================
  Files           337      337           
  Lines         12512    12512           
  Branches       1179     1179           
=========================================
  Hits           7115     7115           
  Misses         4945     4945           
  Partials        452      452           
Flag Coverage Δ
funTest-docker 70.95% <ø> (ø)
funTest-non-docker 33.00% <ø> (ø)
test-ubuntu-24.04 41.22% <ø> (ø)
test-windows-2022 41.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@maennchen maennchen force-pushed the jm/scan_deduplication branch from 40190e3 to ed1df5f Compare June 19, 2025 19:37
Copy link
Member

@MarcelBochtler MarcelBochtler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maennchen,
thanks for your contribution!
Do you mind adding a test that demonstrates the issue with the duplicated scan results and that your changes fix these issues?

Edit: As there is currently no existing test for the scan function that use an ORT result, I won't insist on this change.

Edit 2: I Found these existing tests which already test the scan() function:

"Scanning all packages corresponding to a single VCS" should {
val analyzerResult = createAnalyzerResult(pkg0, pkg1, pkg2, pkg3, pkg4)
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())
"return the expected ORT result" {
val expectedResult = readResource("/scanner-integration-all-pkgs-expected-ort-result.yml")
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
"return the expected (merged) scan results" {
val expectedResult = readResource("/scanner-integration-expected-scan-results.yml")
val scanResults = ortResult.getScanResults().toSortedMap()
patchActualResult(scanResults.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
"return the expected (merged) file lists" {
val expectedResult = readResource("/scanner-integration-expected-file-lists.yml")
val fileLists = ortResult.getFileLists().toSortedMap()
fileLists.toYaml() shouldBe patchExpectedResult(expectedResult)
}
}
"Scanning a subset of the packages corresponding to a single VCS" should {
"return the expected ORT result" {
val analyzerResult = createAnalyzerResult(pkg1, pkg3)
val expectedResult = readResource("/scanner-integration-subset-pkgs-expected-ort-result.yml")
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
}

Can you have a look at them and extend them to test your use-case?

@maennchen
Copy link
Author

maennchen commented Jun 25, 2025

@MarcelBochtler I added a test to test the deduplication. Running the same test on main will cause duplicated information.

$ ./gradlew scanner:funTest --tests "org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest"

Parallel Configuration Cache is an incubating feature.
Calculating task graph as configuration cache cannot be reused because a build logic input of type 'SemInfoVersionValueSource' has changed.
Type-safe project accessors is an incubating feature.

> Configure project :
Building ORT version 61.1.0.

> Task :scanner:funTest

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected ORT result STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected ORT result PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) scan results STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) scan results PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) file lists STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) file lists PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning a subset of the packages corresponding to a single VCS should > return the expected ORT result STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning a subset of the packages corresponding to a single VCS should > return the expected ORT result FAILED
    io.kotest.assertions.AssertionFailedError: expected:<[Deletion at line 298]           end_line: -1
      scanners:
        Dummy::pkg1:1.0.0:
        - "Dummy"
        Dummy::pkg3:1.0.0:
        - "Dummy"
        Dummy::project:1.0.0:
        - "Dummy"
      files:
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo.git"
            revision: "a732695e03efcbd74539208af98c297ee86e49d5"
            path: ""
          resolved_revision: "a732695e03efcbd74539208af98c297ee86e49d5"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s1/pkg-s1.txt"
          sha1: "e5fb17f8f4f4ef0748bb5ba137fd0e091dd5a1f6"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo2.git"
            revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
            path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
    advisor: null
    evaluator: null
    resolved_configuration: {}


    [Deletion at line 386] > but was:<[Deletion at line 298]           end_line: -1
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        scanner:
          name: "Dummy"
          version: "1.0.0"
          configuration: ""
        summary:
          start_time: "1970-01-01T00:00:00Z"
          end_time: "1970-01-01T00:00:00Z"
          licenses:
          - license: "NOASSERTION"
            location:
              path: "LICENSE"
              start_line: -1
              end_line: -1
          - license: "NOASSERTION"
            location:
              path: "pkg1/pkg1.txt"
              start_line: -1
              end_line: -1
          - license: "NOASSERTION"
            location:
              path: "pkg3/pkg3.txt"
              start_line: -1
              end_line: -1
      scanners:
        Dummy::pkg1:1.0.0:
        - "Dummy"
        Dummy::pkg3:1.0.0:
        - "Dummy"
        Dummy::project:1.0.0:
        - "Dummy"
      files:
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo.git"
            revision: "a732695e03efcbd74539208af98c297ee86e49d5"
            path: ""
          resolved_revision: "a732695e03efcbd74539208af98c297ee86e49d5"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s1/pkg-s1.txt"
          sha1: "e5fb17f8f4f4ef0748bb5ba137fd0e091dd5a1f6"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo2.git"
            revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
            path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
    advisor: null
    evaluator: null
    resolved_configuration: {}


    [Deletion at line 356]         path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
    advisor: null
    evaluator: null
    resolved_configuration: {}
    >

16:50:31.855 [ForkJoinPool-1-worker-1] DEBUG org.eclipse.jgit.internal.util.ShutdownHook - Cleanup org.eclipse.jgit.util.FS$FileStoreAttributes$$Lambda/0x00007eabf0386c10@e239dec during JVM shutdown

> Task :scanner:funTest FAILED

4 tests completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':scanner:funTest'.
> There were failing tests. See the report at: file:///workspaces/ort/scanner/build/reports/tests/funTest/index.html

* Try:
> Run with --scan to get full insights.

BUILD FAILED in 53s
81 actionable tasks: 52 executed, 29 up-to-date
Configuration cache entry stored.

@maennchen maennchen force-pushed the jm/scan_deduplication branch from 4432de1 to 263be95 Compare June 25, 2025 17:06
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 263be95 to 276bc2f Compare June 25, 2025 18:59
MarcelBochtler
MarcelBochtler previously approved these changes Jun 25, 2025
Copy link
Member

@MarcelBochtler MarcelBochtler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test. This really helps to understand the necessity of this change.

The change looks good to me. Any other opinions @oss-review-toolkit/kotlin-devs ?

message: "IOException: Could not resolve provenance for package 'Dummy::project:1.0.0'\
\ for source code origins [VCS, ARTIFACT]."
severity: "ERROR"
package_provenance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me at a quick glance that this is a good change. Isn't it intended for this dummy project to have a resolution issue?

Copy link
Author

@maennchen maennchen Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? I have moved the test into a separate test in the meantime. Does your comment still apply?

When the SpdxDocumentFile package manager is used, the *project* and all
contained *packages* often resolve to the **same VCS provenance** (e.g. the
root of the Git repository).
Before this change ORT stored two separate `ScanResult`s for such a
provenance – one keyed to the project, one keyed to the package.

That caused two follow-on problems:

* Both results appeared in the `OrtResult`, so evaluators saw **duplicate
  findings** for the *same* source tree.
* Because projects and packages are handled by different rules the package
  result was additionally **padded with a `SpdxConstants.NONE` finding**
  whenever `includeFilesWithoutFindings` was enabled.
  The evaluator therefore compared *real* license findings from the project
  result with `NONE` from the package result and failed with a violation.

This patch

* groups scan results by the pair `(provenance, scanner)` and folds them
  into a single `ScanResult`,
* unions the inner finding sets to avoid duplicates, and
* performs the "pad with NONE" step only **after** deduplication, so every
  path is represented exactly once.

As a consequence the evaluator now receives one consistent set of license
findings per provenance / scanner, eliminating the false mismatch.

Signed-off-by: Jonatan Männchen <[email protected]>
@maennchen
Copy link
Author

The code can be seen in action here: https://github.com/elixir-lang/elixir/actions/runs/16086802339/job/45398947306
(See the Action Artifacts for Details)

@@ -34,6 +34,7 @@ import org.ossreviewtoolkit.model.Package
import org.ossreviewtoolkit.model.PackageReference
import org.ossreviewtoolkit.model.PackageType
import org.ossreviewtoolkit.model.Project
import org.ossreviewtoolkit.model.Repository
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit-message:

I believe, from the commit message it should be understandable what exact issues happened before this
change, which this change is supposed to fix. I still need help understanding these, and I believe commit
message should be adjusted so that it becomes clearer. Unclear to me:

  • "so evaluators saw duplicate findings for the same source tree."
    • Rules operate on a per package / project basis. If multiple originate from same source tree,
      it is expected that same source tree is processed mutliple times. Would it maybe help to
      explain this with an example?
  • "Because projects and packages are handled by different rules the package
    result was additionally padded with a SpdxConstants.NONE finding
    whenever includeFilesWithoutFindings was enabled."
    • I don't why this padding happened, and why it is wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of a result before this change. As you can see in the file, there's multiple projects in the lib directory and the project contains everything. The project also contains files that belong to no package (for example the whole bin directory or the .formatter.exs file).

Taking the .formatter.exs file as an example, it does not belong to any packages but it belongs to the project. Before this change, it would be correctly listed in scan #1 as Apache-2.0 and incorrectly listed as NONE in scan result #2.

When the evaluator or reporter runs on this result, they will use both scan results since they share the provenance and since NONE is not allowed in our rules, they will report an error.

scan-result.json

}
} else {
deduplicatedScanResults
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a commit prior to this commit which factors out line 150-152 and lines 155 - 180 into a function which is responsible for adding these LicenseFinding(SpdxConstants.NONE, TextLocation(it, TextLocation.UNKNOWN_LINE)), so that the diff of this commit becomes simpler?

e.g. idea is to separate the de-duplication code from the adding of these NONE findings.


"Scanning a project with the same provenance as packages" should {
val analyzerResult = createAnalyzerResultWithProject(project0, pkg0)
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move lines 106+107 below line 109.

val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())

"not have duplicated scan results" {
val expectedResult = readResource("/scanner-integration-shared-project-package-provenance.yml")
Copy link
Member

@fviernau fviernau Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient (instead of asserting the expected result) to assert

  1. OrtResult.scanner.scanResultshas only one entry per (provenance, scannerDetails).
  2. OrtResult.scanner.files has only one entry per provenance.

I believe this what this commit ensures. Looking at these above proposed assertions, we could instead
add the as invariant checks into the constructor of ScannerRun.

(Another consideration could be to make this de-duplication unit test-able, so that the test does not need to do scanning at all, @oss-review-toolkit/core-devs what do you think?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants