A Deep Dive into China’s 500GB+ Censorship Data Breach
Introduction
In a historic breach of China’s censorship infrastructure (September 2025), over 500 gigabytes of internal data were leaked from Chinese infrastructure firms associated with the Great Firewall (GFW). Researchers now estimate the full dump is closer to ~600 GB, with a single archive comprising around 500 GB alone.
The material includes more than 100,000 documents, internal source code, work logs, configuration files, emails, technical manuals, and operational runbooks. (WIRED) The number of files in the dump is reported to be in the thousands (though exact totals vary by source). ([Bitdefender](https://www.bitdefender.com/en-us/bl…
A Deep Dive into China’s 500GB+ Censorship Data Breach
Introduction
In a historic breach of China’s censorship infrastructure (September 2025), over 500 gigabytes of internal data were leaked from Chinese infrastructure firms associated with the Great Firewall (GFW). Researchers now estimate the full dump is closer to ~600 GB, with a single archive comprising around 500 GB alone.
The material includes more than 100,000 documents, internal source code, work logs, configuration files, emails, technical manuals, and operational runbooks. (WIRED) The number of files in the dump is reported to be in the thousands (though exact totals vary by source). (Bitdefender)
Among the revealed artifacts are:
- RPM packaging server files (the packaging infrastructure used for distributing software artifacts)
- Project management data (Jira, Confluence) showing internal tickets, feature requests, bug reports, and deployment histories
- Communications and engineering documents showing how censorship tools are tested against VPNs, Tor, and other circumvention methods; e.g. methods of DPI, SSL fingerprinting, and filtering logic. (Tom’s Hardware)
- Deployment records indicating both domestic use (provinces like Xinjiang, Fujian, and Jiangsu) and export of censorship or surveillance systems to other countries, including Myanmar, Pakistan, Ethiopia, and Kazakhstan.
This report is the first in a three-part series which aims to document the dump’s contents, analyze its technical implications, and assess the geopolitical fallout stemming from the exposure of these sensitive tools and architectures.
Evidence of Failure and Oversight
The leaked IP logs and packet captures expose critical moments where the censorship apparatus faltered, revealing the inherent fragility of the Great Firewall’s distributed enforcement model. In multiple instances, cross-border leakage routes allowed foreign IPs to establish unfiltered sessions for extended periods, suggesting delays in rule propagation, temporary policy gaps, or the failure of heuristic detection systems. These lapses demonstrate that while the system is highly surveillant, it remains reactive and inconsistently enforced across regions.
Additionally, misconfigured mirrors inadvertently exposed internal blacklist data to external interfaces. These exposures included leaked regional UUIDs and configuration files, offering rare insight into the naming conventions and structural logic of localized rule deployment. Simultaneously, honeypot deployments on high-risk ports attracted and logged adversary interactions, including traceroutes and detailed packet-level reconnaissance, suggesting that foreign entities were already probing China’s defensive perimeter. These incidents, likely overseen by regional engineers or testbed maintainers, underscore the bureaucratic brittleness of a censorship regime built on siloed enforcement layers, inconsistent rule application, and latency in central-to-edge command synchronization.
The Nature of the Dump.
The dataset is a sprawling, multifaceted archive that lays bare the technical scaffolding of China’s digital surveillance regime. It includes raw IP access logs from state-run telecom providers such as China Telecom, China Unicom, and China Mobile, revealing real-time traffic monitoring and endpoint interaction.**downloading and research of such data should be handled by professionals in protected environments due to potential malware and information* *
Packet captures (PCAPs) and routing tables are paired with blackhole sinkhole exports, detailing how traffic is intercepted, redirected, or silently dropped. A trove of Excel spreadsheets enumerates known VPN IP addresses, DNS query patterns, SSL certificate fingerprints, and behavioral signatures of proxy services, offering insight into identification and blocking heuristics. Visio diagrams (.vsd/.vsdx) map out the internal firewall architecture, from hardware deployments to logical enforcement chains spanning various ministries and provinces. Application-layer logs dissect tools like Psiphon, V2Ray, Shadowsocks, and corporate proxy gateways, capturing how these are tested, fingerprinted, and throttled. The dataset also contains databases of FQDNs, SNI strings, application telemetry, and “sketch logs”, showing serialized behavioral data scraped from mobile apps. System-level monitoring exports reveal server CPU usage, memory utilization, stream session logs, and real-time user states. Crucially, metadata leaked from Word, Excel, and PowerPoint files exposes the usernames, organizational affiliations, and edit trails of engineers and bureaucrats working on censorship infrastructure. Finally, OCR-processed screenshots illustrate the UI panels of traffic control dashboards, logging mechanisms, and internal tooling, offering a visual window into how the Great Firewall is operated in practice.
The dataset includes:
- Raw IP access logs from state-run service providers (e.g., China Telecom, Unicom, Mobile)
- Packet captures (PCAPs), routing tables, and blackhole sinkhole exports
- Excel spreadsheets listing VPN IPs, DNS logs, SSL certs, and proxy service patterns
- Visio (.vsd/.vsdx) files mapping internal firewall topology and logical enforcement chains
- Application-layer analyses of tools like Psiphon, V2Ray, Shadowsocks, and enterprise proxies
- Databases of FQDNs (fully qualified domain names), SNI patterns, app telemetry, and app “sketch” logs
- Monitoring exports for CPU usage, system state, user sessions, and stream logs
- Metadata leaks from Word, Excel, and PowerPoint documents exposing usernames, organizations, and edit histories
- OCR’d screenshots showing UI interfaces of control panels and logging dashboards
The Implications of a 500GB Breach
The leak of over 500 gigabytes of internal data from China’s censorship infrastructure constitutes one of the most consequential exposures in the history of digital authoritarianism. Encompassing more than 7,000 files, the dataset provides not merely an isolated glimpse but an extended, multi-dimensional forensic cross-section of the Great Firewall’s operational anatomy, revealing system telemetry, logic flows, user sessions, document metadata, application analyses, and network schematics. Far from being an accidental disclosure of logs, this archive represents a curated corpus likely compiled over a prolonged period, indicating either a trusted insider with comprehensive access or a methodical and externally orchestrated data exfiltration campaign.
Two plausible breach pathways emerge from the data. First, a deep internal compromise likely stems from an operator with privileged access, potentially a systems administrator, subcontractor, or disillusioned insider, working from a centralized infrastructure hub. The breadth of materials, including internal routing tables, packet captures, monitoring exports, and user-generated documents, suggests systemic access to both operational and administrative layers of the censorship stack. Metadata uniformity and filename consistency point to deliberate organization, likely done incrementally and with operational awareness. Alternatively, the diversity of systems accessed hints at a second possibility: a coordinated external exfiltration effort carried out by a sophisticated threat actor, such as a nation-state or specialized red team. In this scenario, misconfigurations in firewalls, insecure admin panels, and segmented network seams may have been exploited to gain footholds and siphon data over time. PCAP captures, CPU load logs, and Visio diagram exports suggest persistent access and automated tooling were in play.
Regardless of the breach mechanism, the consequences are profound. Technically, the leak has rendered much of China’s detection arsenal obsolete: VPN heuristics, DPI rule sets, SNI-based fingerprinting algorithms, and application proxy classifiers are now open to scrutiny, replication, and evasion. Operationally, usernames, hostnames, and file authorship data risk exposing government contractors, telecom engineers, and researchers, increasing their vulnerability to naming and shaming, targeted sanctions, or exploitation by rival intelligence services. The documentation of flawed infrastructure, such as packet loss under scan load, looped sinkhole rules, and session state anomalies, presents ripe opportunities for adversarial exploitation. Strategically, this dataset arms censorship circumvention communities, policy advocates, and red teams with the ability to simulate and reverse-engineer enforcement logic, undermining the efficacy of centralized control. In sum, this breach collapses the asymmetry between censor and censored, offering, for the first time, a detailed blueprint of China’s digital surveillance leviathan.
Mapping the Human-Technical Interface
The organizational fingerprints uncovered within the leaked dataset provide a remarkably detailed view into the inner workings of the Great Firewall (GFW) and the ecosystem of actors that maintain and enforce it. Rather than a monolithic structure, the GFW emerges as a multi-tiered apparatus with clearly delineated, yet overlapping, spheres of responsibility. At the top are national censorship policy architects, likely operating under the auspices of the Ministry of State Security (MSS) or the Ministry of Industry and Information Technology (MIIT), who define strategic goals and traffic classification directives. These directives cascade down to regional enforcement units embedded within state-run ISPs like China Telecom, China Unicom, and China Mobile, where they are operationalized at backbone routers and internet exchange points. Academic collaborators, often based in state-linked institutions such as Tsinghua, USTC, or the Chinese Academy of Sciences, serve as technical force multipliers, crafting fingerprinting algorithms, traffic classifiers, and AI-driven detection heuristics. Finally, a shadow layer of software engineers and infrastructure operators maintain the technical systems, dashboards, scheduling agents, and rule propagation mechanisms that implement censorship policy at scale.
Screen shot from dump of console for management
Drawing from Excel logs, packet captures, and Visio topology diagrams, a clearer human and technical map is emerging. Dozens of usernames and hostnames traced across file metadata tie specific individuals to roles such as hardware engineering, data center administration, and network research. Internal monitoring logs document the real-time execution of regional scanning scripts; app-layer inspection routines flagging encrypted VPN protocols; and automated classification of TLS handshakes through SNI fingerprinting. Further network telemetry reveals sophisticated TCP/UDP port scanning patterns, clearly aligned with foreign traffic signature identification. Notably, even as these systems operate with impressive precision, lapses are evident: logs show instances of cross-border traffic escaping inspection, internal blacklist mirrors exposed through misconfiguration, and honeypots receiving foreign reconnaissance traffic. These data points not only reinforce the highly compartmentalized structure of GFW enforcement, but also highlight critical seams in its defensive perimeter, seams that adversaries could exploit with careful targeting.
Metadata Exposure: Attribution Through Digital Breadcrumbs
One of the most revealing and strategically valuable components of the GFW data dump lies not in the structured log files or architectural diagrams, but in the metadata accidentally embedded across thousands of files. These residual traces, often overlooked in threat modeling, offer a rare glimpse into the human and organizational machinery behind China’s censorship apparatus.
The dump exposes dozens of unique usernames, many of which follow consistent naming conventions indicative of internal departmental hierarchies. These include system-level account names (e.g., admin-jw, it_ops_lh, yunwei-wang) and author tags in Office documents, enabling correlation to individual operators. In many cases, authorship data and revision histories link technical documents, such as server topology diagrams, SQL queries, and application configuration logs, to specific personnel across government agencies, telecom subsidiaries, and third-party contractors.
Cross-referencing these metadata fields with known Chinese corporate entities and state-linked research institutes has enabled the construction of preliminary attribution clusters. These clusters show clear ties to China Telecom, China Unicom, and China Mobile, as well as connections to academic partners (including digital forensics labs) and MSS-linked infrastructure vendors such as Tietong, CETC, and provincial branches of the MIIT.
Notably, multiple files retain internal IP address references and machine hostnames mapped to sandbox and testbed environments used for evaluating censorship evasion tools. These include systems tagged for Psiphon, V2Ray, and Shadowsocks analysis. Some remote server addresses and reverse-proxy logs point to GFW staging zones used to pilot domain interdiction and traffic shaping prior to national rollout.
This corpus of metadata, when enriched through Whois pivots, OSINT facial recognition, and password reuse enumeration, allows for the development of organizational maps and adversary role modeling. These in turn can inform future red-team operations targeting the GFW’s human operators, backend infrastructure, and chain-of-command logic. With metadata drawn from Word, Excel, Visio, and network logs, researchers now hold the building blocks for a relational understanding of censorship personnel and policy execution, from engineers and system admins to project managers and analysts.
This is not just a technical leak, it is a rare unmasking of the people behind the policy.
Among the most valuable aspects of this dump are the accidental leaks of metadata that revealed:
- Dozens of usernames tied to internal departments
- System usernames and document authorship tied to technical operators and analysts
- Organizational affiliations across telecoms, research labs, and suspected MSS-linked infrastructure vendors
- Tracebacks to IP addresses tied to GFW testbed deployments and server farms
A correlation of this data has begun to yield early attribution clusters and organizational modeling, laying the groundwork for adversarial red teaming against censorship controls.
Organizational Fingerprints: Mapping the Bureaucracy Behind the Great Firewall
Beyond the technical evidence of censorship and traffic manipulation, the leaked dataset offers a rare opportunity to construct a socio-technical map of the Great Firewall (GFW) apparatus, not just how it works, but who builds it, who maintains it, and how China’s censorship ecosystem is organizationally compartmentalized.
The metadata extracted from over 7,000 documents, spreadsheets, Visio network maps, text logs, dashboards, and software configuration files reveals a complex lattice of state-linked entities operating in tightly controlled silos. Through usernames, author tags, internal IP assignments, system banners, and internal routing headers, we’ve begun to correlate individuals to functional roles and institutional affiliations.
The internal architecture of the Great Firewall is supported by a network of organizations ranging from state-owned enterprises to elite research institutions and private sector vendors. Core traffic monitoring and enforcement responsibilities are handled by China Telecom, China Unicom, and China Mobile, whose infrastructure appears repeatedly in PCAP logs, IP registries, and system-level telemetry. Metadata from Visio diagrams and scanning scripts links regional enforcement activities to provincial branches such as 广东联通 and 河北电信, indicating decentralized operational cells. At the academic and research level, contributors from the Chinese Academy of Sciences, CNCERT, Tsinghua University, and USTC are implicated in traffic modeling, VPN fingerprinting, and algorithmic SNI detection, functioning in a science-to-policy pipeline. Additional entities like Huaxin, Venustech, and Topsec, believed to have ties to the Ministry of State Security (MSS), appear responsible for developing packet inspection hardware, “smart gateways,” and modular control interfaces. System topology files suggest regional hubs under provincial control, with metadata pointing to a tiered model of command, central rule authors in Beijing, and localized operators managing disruptions and resets.
Supporting this infrastructure is a suite of internal tools, including web dashboards for traffic classification, rule propagation, and keyword blacklisting, many of which rely on LDAP-based access and appear to be integrated with institutional Single Sign-On systems. Screenshots and logs expose dynamic control capabilities such as automated session disruption and region-specific enforcement thresholds. Crucially, the dataset reveals extensive metadata leakage: usernames and computer hostnames link individuals to telecom offices and technical roles; document authorship trails help establish personal and institutional attribution*.* The documents further expose how responsibilities are compartmentalized, illustrating a strict vertical segmentation between engineering, monitoring, and enforcement functions. Overlapping IP clusters, authorship patterns, and PCAP exports across regions hint at interagency coordination, albeit scoped and isolated. Together, these findings allow for the construction of an emerging socio-technical map of the GFW’s human infrastructure, forming the groundwork for attribution modeling and adversarial counter-censorship strategy.
Technical Overview: Core Mechanisms of the GFW Architecture
The leaked dataset exposes a highly modular and deeply integrated censorship architecture underlying the Great Firewall of China. Rather than operating as a single centralized filter, the GFW is revealed to be a distributed system of surveillance and control spanning national, regional, and local network layers. Its enforcement mechanisms include everything from DPI inspection at major internet exchange points to application-layer behavioral analysis and live session manipulation through web-based dashboards. Across the dataset, there is a recurring pattern of siloed technical roles operating under central orchestration, with regional enforcement nodes acting as both detection points and policy executors.
*Network Topology Diagram (Five Rings Network 五环网络) – * This image is a logical and physical network topology map included in the dump of a segmented enterprise or academic network system referred to as 五环核心 (Five Rings Core Network). It displays VLAN segmentation, inter-switch trunking, DHCP assignments, and guest/staff/IPv6/WiFi zones, possibly reflecting real-world infrastructure used in Chinese internal IT or censorship-research testbeds.
At the core of traffic interception are the state-run ISPs, China Telecom, China Unicom, and China Mobile, which serve as both service providers and surveillance intermediaries. Logs from these providers document the interception and classification of traffic based on packet content, with the use of deep packet inspection techniques. These techniques target TLS/HTTPS session metadata, such as (SNI) fields, and distinguish potentially suspicious connections based on protocol anomalies, including entropy, timing patterns, and payload structures. The infrastructure supports detection of known circumvention tools such as Shadowsocks, V2Ray, and Psiphon. Visio network diagrams show these DPI modules deployed at key peering points, especially in major metropolitan areas and provincial backbones, suggesting a tiered control model.
Application-level analysis is conducted using fingerprinting heuristics derived from both raw network characteristics and behavioral modeling. Various Excel spreadsheets and telemetry exports include references to TLS fingerprinting rules, heuristic classifiers for VPN/proxy traffic, and statistical models used to flag encrypted tunnels. These analyses rely on databases of SNI patterns, handshake behaviors, and traffic volume profiles. Simpler applications are captured through static indicators, while more sophisticated obfuscated traffic is subjected to sketch-based detection, a form of lightweight signature modeling. This reveals a layered approach to detection, with different modules specializing in different levels of granularity and evasiveness.
Online translation: Anonymous DNS Resolution System via Tor Network with DOH (DNS-over-HTTPS) Encryption
Routing logic and censorship enforcement are governed by automated scripts and control schemas that appear to be distributed from centralized locations to regional nodes. Python and shell scripts uncovered in the dataset automate the scanning of IP ranges, the classification of foreign nodes, and the deployment of routing directives. Routing tables, sinkhole IP lists, and blackhole redirects provide insight into how traffic is rerouted or silently dropped based on the policy logic defined upstream. Several control files appear to be distributed on a schedule or in response to live triggers, showing both manual and autonomous enforcement methods. This system likely allows Beijing-based control centers to push directives to provincial-level enforcement arms, where localized engineers and systems perform filtering or inspection with scoped authority.
Operational state is maintained through a robust internal monitoring ecosystem. Included in the leak are comprehensive exports of CPU usage, memory performance, service uptime logs, and stream-based telemetry. These system-wide diagnostics provide not only visibility into the technical health of enforcement systems, but also allow higher-level auditing of session disruptions, filtering efficacy, and infrastructure stability. Screenshots from management interfaces and logs from web-based control dashboards suggest that operators are provided with real-time analytics, interactive filtering toggles, and user/session views. Most of these systems rely on enterprise-grade authentication mechanisms, such as LDAP-based Single Sign-On (SSO), indicating tight coupling between enforcement tooling and institutional IT frameworks.
System Status Network Topology Diagram Organization: China Information and Communication Design Institute Co., Ltd. (中讯邮电咨询设计院有限公司)
An unexpected but critical component of the breach is the metadata embedded within documents and logs. Authorship tags, file paths, and computer hostnames have linked hundreds of documents to individual users, systems, and organizations. These human fingerprints offer unprecedented visibility into the organizational structure behind the GFW’s operation. Engineers, data analysts, lab researchers, and regional technicians are all traceable by name or system alias. Many entries refer to known ISPs, national labs, or university-affiliated nodes, suggesting that the enforcement apparatus spans a wide constellation of public-private partnerships, military-academic collaborations, and centralized policy deployment.
Together, these findings constitute a unique technical cross-section of the Chinese censorship-industrial complex, revealing not just what is filtered or how, but who enforces it, who maintains the infrastructure, and how decisions flow through the layered topology of digital control.
What Comes Next
This report represents only the first installment in a three-part investigative series into the unprecedented breach of China’s censorship apparatus. While this Part 1 has centered on exposing the dataset’s contents and evaluating its technical, organizational, and strategic significance, it is only the beginning. The sheer scale and complexity of the leak, over 500GB of internal GFW infrastructure data, demands a methodical, layered approach to fully grasp its implications. The next two parts in this series will delve even deeper, uncovering the architecture of China’s censorship regime and examining the wider consequences for global digital governance.
Part 2 – The Architecture will offer a forensic reconstruction of how the Great Firewall actually works at the technical level. Leveraging the internal Visio network diagrams, log schematics, scanning schedules, app fingerprinting routines, and heuristic rule exports uncovered in the dump, we will map the core design of the censorship stack. This includes how packets are intercepted, filtered, redirected, or dropped; how apps like Psiphon and V2Ray are detected at the protocol level; and how traffic shaping is deployed based on geography, ISP, or session context. The analysis will also break down the GFW’s modular enforcement structure, highlighting regional control points, the roles of telecom and research institutions, and the likely contribution of vendors with MSS affiliations in building out control interfaces and automated classifiers.
Part 3 – Geopolitics and The Fallout will address the broader implications. This breach does more than just reveal technical controls, it changes the strategic calculus of censorship resistance. We will assess how the exposure reshapes China’s ability to sustain its domestic information control and international cyber operations, and how it informs countermeasures by VPN developers, privacy advocates, and democratic governments. Ethical and legal questions will also be raised: what does responsible engagement with such data look like? And how should open societies use this moment to harden digital rights, strengthen transparency norms, and resist the spread of authoritarian control models abroad? With this series, we aim to present not just the most complete picture yet of the GFW, but a roadmap for pushing back against the machinery of state censorship.