Skip to the content.

harbor-framework/harbor — security scan

Repository: harbor-framework/harbor — 2.5k★, Apache-2.0, a framework for running agent evaluations across benchmarks. A monorepo: harbor’s own core (src/harbor/, packages/) plus 84 vendored benchmark adapters under adapters/*, each a packaged upstream benchmark (swe-bench, gaia2, mmau, swesmith, …) with its own uv.lock and Dockerfile environment. Commit scanned: 387625f07b4b (HEAD of main at scan time) Scan date: 2026-06-15 Disclosure status: Public courtesy issue filed — a single structural question about the sandbox-isolation threat model for vendored environments, not a 570-finding enumeration. Harbor’s own core is well-built; the entire critical surface lives in the vendored benchmark adapters.

Summary

Severity Count
Critical 10
High 333
Medium 227
Low 0
Info 0 (filtered)

570 findings — and the number means almost the opposite of what it looks like. An ownership split is the entire story: 464 of 570 findings (81%) are in vendored adapters/* (84 packaged upstream benchmarks), and all 10 criticals are vendored adapter dependencies (adapters/*/uv.lock). Harbor’s own code (src/harbor/ + packages/, 34 findings) has zero criticals and — on inspection — does the dangerous things correctly. The headline “10 critical / 333 high” is a property of the benchmark environments harbor bundles, not of harbor.

Harbor’s own core: built right

Before the vendored pile, the part that matters most — harbor’s own code — is clean on every pattern that fired:

So harbor-core earns a clean bill on the exploitability-shaped patterns. That reframes the other 536 findings.

The vendored surface: where the criticals actually are

All 10 criticals live in adapters/*/uv.lock — the pinned dependency sets of vendored upstream benchmarks:

CVE Package Adapter(s) Class
CVE-2026-7304 sglang swesmith Unauthenticated RCE via --enable-custom-logit-processor
CVE-2026-3059 sglang swesmith Multimodal-generation vuln
CVE-2026-3060 sglang swesmith Encoder parallel-disaggregation vuln
CVE-2026-42208 litellm ml_dev_bench, mlgym-bench, mmau Proxy SQL data access (proxy-surface — see reachability note)
CVE-2025-14009 nltk dacode, kramabench Zip Slip → code execution
Dockerfile cooperbench ×2 Multiple ENTRYPOINT instructions

Plus the high/medium tail: 86 “image runs as root” Dockerfile findings spread across ~40 adapter environments, 35 apt-get missing --no-install-recommends, 20 :latest-tag pins, a urllib3/aiohttp/requests/starlette web tail, 14 pickle.load sites (all in vendored adapter code), and 10 tarfile-extractall (the vendored ones, where the core ones were already filter="data"-guarded).

The structural reality: harbor vendors 84 third-party benchmark environments, and inherits each one’s dependency drift and container-hygiene debt. These environments run inside harbor’s eval sandboxes — which, for an agent-evaluation harness, is a trust boundary that runs arbitrary agent code by design. So the applicability of “an unauthenticated SGLang RCE in swesmith’s env” depends entirely on harbor’s isolation model: if the sandbox is the security boundary and a benchmark environment is assumed potentially-hostile, these are contained; if anything trusts a benchmark env’s integrity, they’re a real escalation surface. That’s the one question worth asking the maintainers — and it’s what the courtesy issue leads with, rather than enumerating 570 findings (the dstack-rejection failure mode).

Patterns observed

“570 findings, 10 critical” is the most misleading raw headline in the series — and ownership analysis is what corrects it. This scan is the strongest argument yet that the raw scanner count is not just an over- or under-count (the SCA-reachability and completeness-sweep lessons) but can be misattributed: 81% of findings and 100% of criticals belong to vendored third-party code, not to the project being scanned. A curated report that didn’t separate src/harbor/ from adapters/* would have told harbor’s maintainers they have 10 critical vulnerabilities — when their own code has none. The first curation step on any monorepo must be an ownership split (first-party vs vendored/third-party paths), before any severity reasoning.

An eval harness’s vendored benchmark environments are a genuinely novel surface shape. Unlike a docs/ lockfile (deepteam) or an examples/ subtree (Klavis) — which are out-of-runtime — these vendored environments are run, but inside a sandbox whose entire purpose is to contain untrusted agent behaviour. The right framing isn’t “out of scope” (they execute) nor “in scope” (they’re third-party and sandboxed) — it’s “what does your isolation model assume?” This is a threat-model question, not a finding, and it’s the highest-value thing a security review adds on a target like this.

Harbor-core is a second “what right looks like” reference for tar extraction. Where pixeltable had an unguarded extractall (a real finding) and zotero-mcp had a hand-rolled member-validation guard, harbor uses the modern filter="data" everywhere — the cleanest of the three approaches, with a comment explaining why. The recurring tarfile-extractall-traversal rule has now fired across the full spectrum from “exploitable” to “exemplary,” which is the entire case for reading the call site rather than the rule count.

Notes on the tool

Disclosure timeline

Reproduce

git clone https://github.com/elfrost/ai-patchlab
cd ai-patchlab
pip install -e ".[dev]"
python scanner/run_scan.py \
  --from-git-url "https://github.com/harbor-framework/harbor" \
  --reports-dir reports/harbor-framework-harbor \
  --min-severity medium \
  --ignore-samples

External tools (Semgrep, Gitleaks, Trivy, pip-audit) need to be installed separately — see the project README.