Abstract
Visual Studio Code (VS Code) extensions enhance productivity but pose serious security risks by inheriting full IDE privileges, including access to the file system, network, and system processes. We present the first automated solution that profiles, sandboxes, and enforces least-privilege execution policies for VS Code extensions at runtime. The system begins with a multi-layered static risk assessment that combines metadata inspection, supply chain auditing, AST-based code analysis, and LLM inference to identify sensitive behaviors and assign a risk category: reject, unrestricted, or sandbox. For sandboxed extensions, it generates fine-grained policies by mapping required Node.js APIs and dynamically constructed resources such as paths, endpoints, and shell commands. …
Abstract
Visual Studio Code (VS Code) extensions enhance productivity but pose serious security risks by inheriting full IDE privileges, including access to the file system, network, and system processes. We present the first automated solution that profiles, sandboxes, and enforces least-privilege execution policies for VS Code extensions at runtime. The system begins with a multi-layered static risk assessment that combines metadata inspection, supply chain auditing, AST-based code analysis, and LLM inference to identify sensitive behaviors and assign a risk category: reject, unrestricted, or sandbox. For sandboxed extensions, it generates fine-grained policies by mapping required Node.js APIs and dynamically constructed resources such as paths, endpoints, and shell commands. A combination of static analysis and dynamic runtime behavior offer a complete view of extension’s behavior. Enforcement is handled via a custom in-process sandbox that isolates extension behavior using dynamic require patching and proxy-based wrappers, without modifying VS Code’s core or affecting other extensions running within the same Extension Host. In a study of 377 extensions, 26.5% were high-risk, reinforcing the need for sandboxing. To evaluate the methodology described in the paper, the top 25 trending extensions were evaluated through static and dynamic analysis: 64% ran successfully while enforcing the sandbox policies generated through static analysis only, while 100% of them worked while enforcing the policy generated through a combination of static and dynamic analysis. Static analysis captured 95.9% of permissions required for an extension to function; dynamic monitoring covered the rest, ensuring that extensions do not break during use.
Keywords: VS Code extension, sandboxing, AST, static analysis, threat modeling, LLM, software supply chain, dynamic behavior modeling
Introduction
Visual Studio Code (VS Code) is a lightweight, open-source code editor developed by Microsoft that has become one of the most popular development environments worldwide due to its speed and customizability. Its power is significantly amplified by a vast ecosystem of extensions. However, this extensibility introduces a critical security challenge: VS Code extensions typically inherit the full privileges of the IDE. This unrestricted access allows extensions to interact with the file system, network, and system processes, meaning a single compromised extension can jeopardize the entire development environment. Several studies have confirmed that malicious VS Code extensions can exfiltrate credentials, modify files, and act as spyware by abusing trusted APIs1‘2 VS Code’s security team has also been removing malicious extensions they have found on the marketplace; however, by the time the extensions are removed, they have already had the opportunity to do harm3.
The core problem, identified by multiple security analyses, is VS Code’s lack of granular security controls for extensions. This paper addresses an important question: how can developers safely leverage VS Code extensions without unacceptable security exposure? Our solution enables secure execution by first rigorously assessing an extension’s risk profile. Malicious extensions are identified for rejection. For those deemed highly vulnerable but not malicious, our solution focuses on auto-generating precise sandboxing policies which can then be enforced during runtime, protecting the developer from vulnerable actions. This is achieved by identifying data sinks–extensions’ interactions with the file system, network, and data.
The Security Challenge and Need for Automated Extension Risk Management
VS Code extensions are implemented as Node.js applications, primarily written in JavaScript or TypeScript. As such, they inherit the full capabilities of the Node.js runtime; including access to the filesystem, network, and OS-level processes; without sandboxing or permission enforcement. This execution model exposes the IDE to severe security risks. A compromised extension can act with full local privileges compromising the entire local system. A systematic analysis revealed that 8.5% of 27,261 analyzed real-world VS Code extensions expose sensitive developer data, such as access tokens and configuration files4.
Extensions operate with unrestricted access to VS Code’s runtime environment. This allows a single malicious extension to exfiltrate data, modify user files, or establish persistence. Supply chain attacks through nested npm dependencies often go undetected, and developers lack tools to inspect or reason about extension behavior. Existing endpoint or antivirus tools are insufficient5‘6, as they do not account for the unique structure and privilege model of IDE-based plugins or too much trust is given to VS Code’s system.
Manual security auditing of the tens of thousands of VS Code extensions is impractical. The rapid pace of extension updates and the complexity of modern JavaScript/TypeScript codebases make human-only analysis insufficient.
An automated system addresses these challenges by enabling scalable and real-time analysis. Such a system can apply uniform criteria across the ecosystem, quickly triage new submissions, and detect security vulnerabilities introduced in updates. Beyond simple detection, automation enables the enforcement of fine-grained, least-privilege sandboxing policies. These policies can be informed by both static and dynamic analysis, offering a nuanced alternative to the current all-or-nothing trust model.
This approach transforms the security model from all-or-nothing to nuanced risk management. Researchers have expressed concern over the fact that extensions execute with the same privileges as the host IDE, without sandboxing or visibility7. Automated risk profiling enables a shift from binary trust models to nuanced, policy-driven enforcement. Developers can safely adopt powerful extensions under enforceable constraints. Extension authors receive actionable feedback, improving ecosystem hygiene. This architecture provides a scalable path to securing VS Code without sacrificing extensibility or productivity.
The Three-Step Solution
Our proposed solution for securing VS Code extensions is structured as a three-stage pipeline. Each stage builds upon the preceding one, enabling the system to move from coarse-grained risk identification to precise behavior confinement.
Figure 1 | High-level diagram of three step solution’s parts.
Methodology
High Level Overview of the Three Step Solution
Step 1: Extension Evaluation and Risk Profiling
The first stage of the pipeline conducts a multi-layered risk assessment that incorporates metadata, supply chain dependencies, and static code analysis. The initial screening examines the extension’s publisher profile and declared metadata, using indicators such as publisher reputation, install base, and frequency of updates. In documented cases, attackers have published look alike extensions that mimicked trusted tools to carry out credential theft8. Reviewing the publisher profile helps detect such impersonation attacks.
Next, the solution evaluates the extension’s dependency tree to identify vulnerable or suspicious packages. This includes scanning for known CVEs, unmaintained modules, and indicators of malicious installation behavior, such as post-install scripts or unexpected network requests.
Finally, a deep static analysis phase is performed on the extension’s JavaScript or TypeScript code using an Abstract Syntax Tree (AST) based engine augmented by Large Language Models (LLMs). This analysis maps sensitive operations such as filesystem interactions, process invocations, and network communications into a behavioral profile. Based on this composite view, the system categorizes the extension into one of three buckets: rejected (if malicious), unrestricted (if clearly benign), or sandboxing candidate (if highly vulnerable).
Step 2: Automated Sandbox Policy Generation
For extensions deemed suitable for sandboxing, the solution proceeds to automatically generate fine-grained, least-privilege sandboxing policies. This begins with the construction of a complete dependency graph, mapping all direct and transitive imports used throughout the extension’s codebase. This graph allows the system to determine exactly which Node.js core modules and APIs are required for execution.
Function-level analysis is then conducted using AST traversal techniques. The engine identifies the precise functions invoked within sensitive modules, such as fs.readFile, child_process.exec, or http.request. This produces a whitelist of API calls that constitute the extension’s behavior.
LLMs9‘10 have been shown to demonstrate impressive reasoning abilities in natural and programming languages tasks via fewshot11 and chain-of-thought12 prompting. Utilizing this ability, in order to resolve dynamic runtime targets such as file paths assembled from user input or URLs constructed through template strings, the system invokes an LLM through a refined prompt. The model analyzes the first-party code (excluding third-party dependencies) and produces a structured output describing all accessed resources, which includes justification for each resolution.
The system then transitions to dynamic monitoring. In this mode, the extension is observed in a non-enforcing sandbox for a limited period (e.g. 7 day period). All files accessed, commands executed, and network endpoints contacted are logged, forming an empirical baseline. This hybrid model of using static analysis and observing dynamic behavior ensures coverage across all extension functionality. This is similar to Chestnut13, which generates per-app seccomp policies for native OS applications via a compiler pass, then optionally refining them with dynamic tracing.
Step 3: Runtime Policy Enforcement
The final stage of this solution involves runtime enforcement of the generated policies. This is accomplished via a custom sandboxing architecture implemented directly within the VS Code Extension Host process. While Microsoft has implemented VS Code process sandboxing14, this paper extends the principles to extensions, ensuring compatibility across platforms and not requiring changes to the VS Code core.
The enforcement layer operates by intercepting Node.js’s module loading system using dynamic require patching. When an extension attempts to load a module, the system verifies whether it is operating within a sandboxed context. If so, it provides a proxy-wrapped version of the module that enforces resource access policies at the function level.
From this point onwards, the extension runs under an explicit allow-list (“default-deny”) runtime policy. All operations that are explicitly permitted by the policy execute; any operation not explicitly listed is blocked and an error is logged.
The enforcement is restricted to the specific extension under evaluation. This prevents interference with other extensions. This enforcement structure ensures seamless integration that doesn’t affect other extension’s functionality.
In-Depth Extension Evaluation and Risk Profiling
Our extension evaluation system employs a multi-layered, automated approach, with each layer providing increasingly detailed risk assessment data. This is accomplished through a suite of analysis services that examine the extension’s metadata, supply chain, and code, resulting in a granular risk score.
Layer 1: Metadata and Publisher Analysis
This initial screening provides a rapid risk assessment based on observable characteristics, viz the metadata and permissions required by the extension.
Publisher and Marketplace analysis: This analysis uses data from extension-info.json file, which contains marketplace data. It assesses publisher reputation by checking for domain verification and analyzes community engagement through metrics like install counts and average ratings. A low install count (<500) combined with a low rating count (<10) is flagged as a low-popularity risk factor, suggesting limited community scrutiny. The analysis also checks for outdated extensions, flagging any that have not been updated in over two years as a medium risk. As AquaSec has shown8, masquerading popular extensions is a known attack vector, and this analysis intends to shed light on such extensions.
Declared Intent and Permissions analysis: The Permission analysis statically analyzes the package.json manifest to evaluate an extension’s requested permissions and capabilities. This information is explicitly defined in the extension manifest15: Activation Events: This analysis identifies the use of broad activation events like * or onStartupFinished, which causes an extension to load early or be persistently active. This is flagged as a medium risk because it unnecessarily increases the extension’s exposure.
Sensitive Contributions: The manifest’s contributes section is scanned for high-risk contribution points. For instance, contributing debuggers, terminal profiles, or taskDefinitions are considered medium risk as these capabilities can execute arbitrary code. Contributing authentication providers are also flagged as medium risk because it involves handling credentials.
Layer 2: Supply Chain Security Assessment
The solution also performs a deep dive into the software supply chain.
Known Vulnerability Detection:
Dependency Auditing: The Dependency analysis executes npm audit16 on the extension’s package.json. If a package-lock.json file is present, it is used for the most accurate audit. If not, the service creates a temporary directory, runs npm install, and then performs the audit. Vulnerabilities are mapped from npm’s severity levels (moderate, high, critical) to our internal risk scores. As Node.js packages operate with high privileges ensuring that specific versions of the packages being used don’t have vulnerabilities is essential17.
OSSF Scorecard analysis: The OSSF Score18 analysis provides a security health check for the project’s repository and its direct dependencies. It automatically discovers the repository URL by first parsing the VSIX extension.vsixmanifest file for source code links and falls back to the repository field in package.json. The OpenSSF Scorecard tool assesses the repository on various security checks, including code review practices, vulnerability disclosure policies, and testing. A score below 3 is considered high risk, while a score between 3 and 6 is medium risk. This score is a key indicator of the project’s overall security maturity. However, it is important to note that the OSSF Scorecard analysis is more effective for Github workflows and thus non-Github workflows may receive a lower score19.
Malicious Behavior and Sensitive Data Detection:
Malware Scanning: We use VirusTotal20 service, an enterprise grade service from Google, to scan the extension package (.vsix) by first calculating its SHA256 hash and checking for existing reports. If no report is found, the file is uploaded for analysis. The risk is assessed based on the detection ratio from about 70 antivirus engines, which keeps the false negatives low. Moreover, we apply a threshold (e.g., requiring more than one engine to flag the extension as malicioius) before treating an extension as malicious, which helps keep false positives low.
Sensitive Information Leakage: The Sensitive Info analysis uses the ggshield21 command-line tool to perform a recursive scan of the extension’s source code. It executes ggshield secret scan path –recursive –json and processes the JSON output to identify hardcoded secrets like API keys, private credentials, and tokens. Each finding is categorized, and its value is redacted for safe reporting. These publicly-available secrets are vulnerabilities which potential attackers can use to get access to sensitive information or key operations.
Code Obfuscation: The Obfuscation detection analyzes the code for signs of intentional obfuscation, to expose malicious extensions that use obfuscated code to bypass review processes. It calculates the Shannon entropy22 of files to detect packed or encrypted data, identifies the use of hexadecimal or Unicode encoding, and flags the presence of long, minified lines of code. For smaller files, it performs a full AST analysis to detect techniques like control-flow flattening.
Network Communication Anomalies: The Network Endpoint analysis parses the AST to extract all domain names and IP addresses. It validates domains via DNS checks and then uses the VirusTotal API to check the reputation of all public endpoints. This is to make sure that the extensions aren’t sending or obtaining data from any dangerous network endpoints.
Layer 3: Deep Code Analysis and Behavioral Profiling
The AST analysis is the core of this layer, performing static analysis of the code’s behavior, similar to ODGen23, however optimized for speed by incorporating some of the concepts from GraphJS24 and FAST25. It parses all JavaScript/TypeScript files into Abstract Syntax Trees (ASTs) using @babel/traverse library which utilizes a depth-first traversal algorithm. In order to ensure that AST isn’t built repeatedly, care is taken that when each node is visited during the traversal, multiple rules based scans are done where each scan is highly specialized and looking for a single-purpose vulnerability/threat. It then uses LLMs to adjudicate the findings to reduce the number of false positives, inspired in large part by research into using LLMs to adjudicate static-analysis alerts26.
Code Execution and Injection:
Command Injection: This phase tracks the usage of the child_process module. It flags any use of exec or spawn where the command argument is constructed dynamically (e.g., through string concatenation or template literals), as this is a classic command injection method. The rules also check if the shell: true option is used, which is a high-risk practice. Unsafe Code Patterns: This rule set detects the use of eval(), new Function(), and setTimeout with string arguments, all of which can fetch and execute code during runtime bypassing any static security analysis.
Filesystem and Data Security:
Sensitive Path Access: This rule set checks for filesystem operations that interact with sensitive file locations. It uses a predefined list of high-risk path patterns, which includes browser profiles, SSH keys (/.ssh), and cloud credentials (/.aws/credentials). Extensions accessing sensitive paths are more risky for developers to utilize. Cryptographic Weaknesses: This rule set identifies the use of weak hashing algorithms (MD5, SHA1), insecure ciphers (DES, RC4), the use of static or hardcoded initialization vectors (IVs), and insufficient iteration counts in PBKDF2. Prototype Pollution: This rule set detects common prototype pollution patterns27, such as direct assignment to __proto__ or unsafe object merging with functions like Object.assign.
Risk Scoring and Aggregation
Our solution employs a quantitative risk scoring system to provide a consistent and objective assessment, similar to UntrustIDE, which labeled over 700 extensions by exploitability and behavior1. Each finding from the analysis services is assigned a risk level–low, medium, high, or critical. These levels are mapped to numerical values for aggregation: low = 0, medium = 1, high = 2, and critical = 3.
The overallRisk for a given analysis module (e.g., dependencies, metadata) is determined by the highest risk score found within that module. For example, the Dependency analysis assigns an initial risk of low to each dependency and then updates it based on the highest severity vulnerability found by npm audit. Similarly, the Permission analysis aggregates risk factors from the manifest; a single high risk factor will elevate the module’s overall risk to high.
Finally, these risk scores are aggregated to produce a comprehensive risk profile for the extension. This informs the decision to accept, reject, or sandbox it. This granular, bottom-up approach to scoring ensures that even a single critical issue in one area is not overlooked.
Analysis ModuleSub-Component / CheckRisk Score / Severity Logic****Notes Publisher and MarketplaceLow Install Count & RatingsinstallCount < 500 AND ratingCount < 10 -> Medium Unverified PublisherisVerified: false -> Medium Outdated ( > 2 years)lastUpdated is over 2 years ago -> Medium Intent and PermissionsActivation EventsPresence of one or more following: Broad Activation (*) onStartupFinished -> Medium Sensitive ContributionsPresence of one or more following: Task definitions Terminal profiles Debugger Contribution -> Medium Dependency auditingnpm audit VulnerabilitiesDirect mapping from npm audit severity (medium -> Medium, etc.)Severity is dependent on the specific CVE found. OSSF ScorecardRepository score of each dependencyscore <= 3.0 -> High score <= 6.0 -> Medium Malware ScanningMalicious Detections2+ marked malicious -> Critical 1 marked malicious and 1+ marked suspicious -> HighBased on ratings of about 70 antivirus engines from VirusTotal. Suspicious Detections2+ marked suspicious -> Medium Sensitive Information leakageHardcoded Secrets (ggshield)Any validated secret finding is assigned High risk.LLM is used to filter false positives Code ObfuscationUnicode or Hexadecimal EncodingPresence maps to Medium Long, Minified LinesPresence maps to Medium Control-Flow FlatteningPresence maps to Highe.g. detects while(true){switch(…)} evasion pattern High File EntropyHigh Shannon entropy score indicates packed/encrypted code -> Medium Network Communication AnomaliesMalicious Endpoint as per Virus Total2+ marked malicious -> Critical 1 marked malicious and 1+ marked suspicious -> HighLLM is used to filter out bogus endpoints before checking. Suspicious Endpoint as per Virus Total2+ marked suspicious -> Medium AST: Code ExecutionDynamic child_processPresence maps to Criticalexec(variable) or spawn(variable) Dynamic eval / FunctionPresence maps to Criticaleval(variable) or new Function(variable) AST: FilesystemAccess to Sensitive PathsPresence maps to Highe.g., fs.readFile(‘~/.ssh/id_rsa’) AST: Webview XSSinnerHTML AssignmentPresence maps to Highe.g. element.innerHTML = variable AST: CryptoHardcoded Encryption KeyPresence maps to Highe.g. createCipheriv(…, ‘static_key’,…) Weak Cipher (e.g., DES)Presence maps to Highe.g. createCipher(‘des’,…)Table 1 | Comprehensive risk scoring matrix
In-Depth Automated Sandbox Policy Generation
The cornerstone of our solution is the automated generation of an accurate and least-privilege sandboxing policy. This is achieved through a hybrid analysis pipeline that combines high-speed code analysis with advanced semantic reasoning powered by LLMs.
Novel Methodology Overview
Our approach goes beyond traditional static analysis by combining four distinct components. Deterministic code analysis is used to systematically map function calls through AST traversal. This initial analysis is inspired by the HODOR system which constructs call graphs to record required system calls for Node.js applications28. Semantic understanding is incorporated via Large Language Models (LLMs), enabling the resolution of dynamically constructed behavior not easily captured by static techniques. The technique of combining AST-based resolution with LLM-driven reasoning has also shown to be effective in IRIS for inferring dynamic behaviors29. The pipeline also optimizes for context window constraints by minimizing irrelevant input and ensuring scalability to large codebases. The results are then refined with dynamic monitoring of extensions to get complete coverage. Finally, the results of all analyses are merged into enforceable sandbox rules without requiring manual intervention.
Step 1: Comprehensive Entrypoint Identification and Dependency Graphing
The analysis begins by identifying the extension’s primary and secondary entry points. This includes determining the main script from the package.json manifest30, inferring secondary entrypoints through activation events, and enumerating user-triggered commands and WebView or worker thread initializers.
A bundler, such as esbuild31, is then employed to construct a full dependency graph. The system is configured to perform a comprehensive analysis by starting at the extension’s main entry points and recursively following all import and require statements. This graph captures both first-party and third-party modules, as well as conditional and dynamic imports, resulting in a detailed metafile that enumerates every required Node.js core module and external dependency. In the rare case that the esbuild-based static pass fails or cannot produce a graph, the system emits an empty allow-list policy (i.e., no static grants). Execution then relies entirely on the dynamic analysis.
Step 2: Coarse-Grained Module Restriction with Security Baseline
To establish an initial security boundary, we first classify Node.js modules based on their inherent risk. Sensitive modules such as fs, http, and child_process are flagged due to their potential to perform input/ output, networking, or command execution.
In particular, we keep track of core Node.js modules based on their functionality. For file access, we keep track of the modules fs, fs/promises, path, and stream. For network access we keep track of http, https, http2, net, dgram, tls, dns, and url. For process execution we keep track of child_process and process. We also keep track of os which is in its own miscellaneous category. We choose to only keep track of the core Node.js modules for sensitive operations, as any third-party module that also executes such sensitive operations will eventually in its own source code use one of the core Node.js modules to perform the operation. Thus, protecting against the core Node.js modules is enough to protect against all modules including third-party ones. The modules detailed above aren’t a comprehensive list of the core Node.js modules that can perform the sensitive operations.
Any module not present in the extension’s dependency graph is disabled entirely in the generated sandbox policy. This approach reduces the attack surface, avoids false positives, and eliminates unnecessary runtime checks, therefore improving both security and performance.
Step 3: Fine-Grained Function Whitelisting via Advanced AST Analysis
To determine exactly which functions an extension uses within sensitive modules, we parse its codebase into an Abstract Syntax Tree (AST) using a tool such as Babel. The AST allows systematic traversal to identify all CallExpression nodes, enabling us to extract function usage patterns at a granular level.
Our analysis supports a variety of JavaScript ways of doing the same thing. It detects direct calls like fs.readFile(), as well as destructured assignments such as const { readFile } = require(‘fs’); readFile(). It also handles aliased references (const myRead = fs.readFile; myRead()) and method chaining patterns, including require(‘fs’).promises.readFile().
Dynamic property access, such as fs[method]() is also considered. If the method name can be resolved statically, it is included in the sandbox policy; otherwise, it is flagged for dynamic verification. Conditional imports, like const fs = condition ? require(‘fs’) : require(‘fs/promises’), are also fully analyzed to ensure coverage of all code paths.
The result is a whitelist of only those sensitive functions the extension actually uses. This whitelist is directly translated into enforcement rules, blocking access to any function not explicitly observed, while ensuring essential functionality is preserved.
Step 4: Hybrid Runtime Target Resolution: Large Language Models and Dynamic Profiling
The next aspect of our pipeline is resolving dynamic runtime behavior that traditional AST analysis cannot determine. To ensure comprehensive coverage for all extensions, we employ a dual-approach strategy.
LLM-Based Resolution
We use Large Language Models (LLMs) to perform semantic analysis that complements AST-based static techniques. For the majority of extensions, whose bundled first-party code is 2 megabytes or less, the bundled file is directly given to the LLM. For larger extensions, the code is broken up into 2 megabyte chunks so as to not surpass the LLM’s context window. Each of the chunks are treated to the following process and results are compiled at the end.
To reduce token usage and focus the model on core logic, we construct a lean version of the extension by stripping all third-party code under node_modules, preserving only the extension’s own source files.
This lean bundle is provided to the LLM with a structured prompt instructing it to act as a security auditor. The model is tasked with identifying sensitive runtime behaviors, including file system paths, network endpoints, and shell commands, especially those constructed dynamically. It returns these results in a structured JSON format, along with reasons that explain how each target was inferred from code context.
There are cases where the LLM fails due to being busy/unavailable. In all such cases, exponential backoff is used to retry the LLM requests. This allows the request to eventually go through. If there is total failure in LLM analysis, it isn’t fatal as the dynamic analysis follows and generates a sandboxing policy.
The LLM excels at resolving patterns traditionally hidden from static tools. It can infer configuration-based paths built from variables, constants, or user settings (e.g., path.join(os.homedir(), ‘.config’, extensionName)), reconstruct URLs composed from template literals or conditional logic, and identify command strings assembled from multiple sources. Additionally, it handles runtime branching logic, such as conditional execution paths or ternary expressions, and can trace deeply nested variable assignments or cross-function data flow. Studies confirm that advanced LLM models can analyze code and generate correct and detailed explanations to a reasonable accuracy as long as the code isn’t overly obfuscated32.
Beyond primitive resolution, the LLM also supports enhanced function-level analysis. In cases where static traversal cannot definitively resolve which sensitive functions are invoked due to aliasing, dynamic imports, or abstraction through utility layers, the LLM can inspect broader context and trace indirect references. For example, it can recognize that a function passed to executeShellCommand() eventually resolves to a call to child_process.exec(), even if obscured by layers of indirection or callback wrapping.
By incorporating this semantically-aware reasoning, the system can generate more complete and accurate whitelists of accessed resources and called operations. This reduces false negatives during enforcement and eliminates the need for overly permissive fallback policies. The LLM’s analysis thus plays a critical role in ensuring the sandbox enforces strict boundaries without disrupting legitimate functionality.
Dynamic Behavioral Baselining for All Extensions
To complement our AST and LLM based static analysis, and to account for limitations of AST and LLMs ability to accurately generate data flow graphs33, we implement a dynamic monitoring system for all extensions, regardless of their size. For all extensions, this dynamic baselining serves as a verification layer to confirm static analysis findings.
This process leverages the same in-process sandboxing architecture detailed in Section 4. However, during an initial monitoring period, the sandbox is configured for observation rather than enforcement. Instead of blocking operations, the proxy-based interception layer meticulously logs all interactions with sensitive Node.js modules.
Our system goes beyond simply recording that a function was called. By inspecting the parameters passed to each function at runtime, we can capture detailed operational data as specified below:
File System Access: We record the full paths of all files and directories the extension attempts to read, write, or modify.
Network Communication: We log the specific URLs and IP addresses of all outgoing network requests.
Process Execution: We capture the exact commands and arguments the extension attempts to execute.
This rich log of runtime behavior creates an empirical baseline of the extension’s behavior. After the monitoring period, this data is analyzed to automatically synthesize a precise, least-privilege sandbox policy. The sandbox then transitions to auto-enforcement mode where any action that falls outside the observed baseline is blocked by default. This ensures that even the largest and most complex extensions can be securely managed, providing a practical and robust solution.
Step 5: Intelligent Policy Synthesis
The final step in our solution is the intelligent synthesis of a comprehensive and enforceable sandbox policy, which is achieved by combining the findings from both our static and dynamic analyses. While each approach provides valuable insights individually, they work best together to create a sandbox policy that is both secure and functional.
This hybrid approach ensures maximum coverage and accuracy.
Dynamic analysis is crucial for capturing the core, day-to-day functionality of the extension. By monitoring the extension’s behavior during the observation period, we ensure that the functions and resources essential to the user’s typical workflow are permitted. This guarantees that the user’s experience is not interrupted by overly restrictive rules.
Static analysis, however, addresses an important gap left by dynamic monitoring as is similarly addressed in the Confine system34. A user is unlikely to access every single feature of an extension within the monitoring period. Static analysis examines the entire codebase, accounting for rarely accessed functionalities like annual update checking or obscure import/export features. By including these legitimate but infrequently used code paths in the sandbox policy, static analysis prevents them from being incorrectly blocked later, ensuring the extension remains fully functional in the long term.
The combined data from these two ways of analysis is then used for sandbox policy structure generation. The findings–the list of used modules, the whitelist of functions from both static and dynamic checks, and the LLM-resolved or dynamically observed paths and URLs–are merged into a single, structured sandbox policy file, typically in JSON format. This final policy document serves as a complete sandboxing policy file for the extension. It contains metadata about the analysis, rules to enable or disable entire modules, a granular list of permitted functions, and precise access control rules for file paths, network endpoints, and commands. This comprehensive sandbox policy becomes the rulebook for the sandboxing enforcement layer.
The generated sandbox policies support several advanced enforcement features to accommodate real-world extension behavior. These include pattern-based rules, allowing the specification of file paths and URLs using glob patterns or regular expressions for flexible yet bounded access control.
This automated, dual-analysis process results in policies that are both precisely tailored to the extension’s legitimate needs and maximally restrictive of potentially malicious behavior.
In-Depth Policy Enforcement Implementation
The enforcement of our auto-generated policies requires a sophisticated sandboxing architecture that operates within VS Code’s unique extension execution environment while maintaining compatibility and performance.
Understanding VS Code’s Extension Host Architecture
Below, we describe three core components that shape the constraints and affordances of any security mechanism implemented within this environment.
Figure 2 | VS Code Extension Host architecture before adding sandboxing technology
Extension Host Process:
VS Code runs extensions inside one or more dedicated Extension Host processes15, which are separate from the core renderer and main processes. Each Extension Host is an isolated Node.js runtime, responsible for executing extension code and facilitating communication with the main editor process via a JSON-RPC-like protocol. All extensions within the same host process share the same memory space and Node.js global state.
Extension Lifecycle:
The lifecycle of a Visual Studio Code extension begins with the loading of its manifest file (package.json), which declares metadata, activation events, and the entry point script. Upon meeting a specified activation condition–such as opening a file of a particular type, or executing a registered command–the Extension Host process instantiates the extension by importing its main module. This triggers execution of the extension’s activate() function, which registers all relevant commands, providers, and event handlers with the editor environment. From this point onward, the extension operates with full access to the Node.js runtime and its global APIs.
Shared Runtime Challenges:
VS Code’s extension architecture presents significant challenges for runtime isolation due to its shared execution model. All extensions within a single Extension Host process operate in the same Node.js runtime, sharing the global object space, module cache, and core API surface. The require() function and its associated module resolution cache are common across all extensions, which means that any modification to a core module, intentional or accidental, can affect every other extension loaded in the same process. Furthermore, global Node.js APIs such as fs, http, and child_process are not compartmentalized per extension and can be monkey-patched or proxied by any extension in the same Extension Host. This lack of strict module and global object isolation complicates the implementation of security boundaries, as extensions can potentially tamper with or eavesdrop on the behavior of others. Consequently, any sandboxing mechanism must not only confine a target extension’s access to sensitive resources but also defend against cross-extension influence within the shared runtime environment.
Extension-Scoped Sandboxing Requirements
Effective sandboxing within the VS Code Extension Host requires extension-scoped enforcement that confines only the target extension without impacting others. The system must apply restrictions selectively, allowing trusted extensions to operate normally while preventing sandboxed ones from bypassing controls via shared globals or indirect invocation. It must maintain runtime transparency, requiring no code changes by extension authors, and preserve compatibility with VS Code APIs and expected workflows. Security isolation must be robust, blocking unauthorized access to resources and preventing extensions from using others as proxies to circumvent restrictions. These constraints necessitate a finely scoped, low-overhead enforcement mechanism that integrates seamlessly with the shared Node.js runtime.
Our Novel In-Process Sandboxing Method
We implement a custom sandboxing architecture that operates entirely within the Node.js runtime. A foundational step in this process is the preemptive caching of pristine, original Node.js modules. Before any sandboxing hooks are installed, the sandbox loader script explicitly loads all sensitive modules (e.g., fs, https, child_process) and stores them in a private, static cache. This “clean cache” ensures that the sandbox’s enforcement layer always has a secure reference to the untampered, original module functionality, preventing any possibility of a sandboxed extension poisoning the module cache for other extensions or the sandbox itself.
Figure 3 | VS Code Extension Host architecture with sandboxing technology
Extension Entry Point Redirection
The enforcement mechanism is initiated by cleanly redirecting how VS Code starts the extension. This involves programmatically modifying the extension’s package.json manifest file. The ‘main’ property, which normally points to the extension’s startup script, is changed to point to our custom sandbox loader script (sandbox.js). This loader becomes the first code to run when the extension is activated. Its job is to (1) access the clean cache of original modules, (2) install all necessary sandboxing hooks and patch require, and then (3) load and execute the extension’s original startup script, identified by a ‘realEntryPoint’ property we add to the manifest. This ensures the extension runs within the secured, policy-controlled environment from its first moment of execution. However, as other extensions in the Extension Host may have already been sandboxed, step (2) first checks whether the necessary hooks and patches have already been completed by another extension. If it has been, this step is skipped. This ensures there is no unnecessary duplication of hooks.
Figure 4 | Steps sandboxing technology takes from starting an extension to getting it running
Dynamic Require Patching with Caller Verification
The core of the sandbox relies on a technique called dynamic require patching. The system intercepts Node.js’s fundamental require() function by overwriting Module.prototype.require. The custom interception logic is designed to be extension-scoped, which is achieved through a caller verification check. For every require() call, the function inspects the this.filename property available within the require function’s execution context. This property contains the absolute path of the script file making the call. The sandbox compares this callerPath against the sandboxed extension’s root directory path. If the caller is not within the extension’s directory, the call is considered to be from an external source (another extension or VS Code itself), and our hook immediately invokes the original, cached require() function, ensuring zero performance impact or interference. If the caller is verified to be part of the target extension, the policy enforcement logic proceeds. It first checks whether the module being invoked is a sensitive Node.js module that we want to control (ex. fs or https). If not, the original, cached module is returned. If yes, then the further logic proceeds.
Advanced Module Wrapping with Proxy-Based Interception
When a sandboxed extension’s require() call is granted access to a sensitive module, it does not