How Static Analysis Can Expose Personal Data Hidden in Source Code

Table Of Links Abstract 1 Introduction 2 Background 3 Privacy-Relevant Methods 4 Identifying API Privacy-relevant Methods 5 Labels for Personal Data Processing 6 Process of Identifying Personal Data 7 Data-based Ranking of Privacy-relevant Methods 8 Application to Privacy Code Review 9 Related Work Conclusion, Future Work, Acknowledgement And References \ Process Of Identifying Personal Data Before delving into the approach, it is crucial to differentiate between personal data and personally identifiable information (PII). While both are subsets of information that relate to an individual, PII is a category of data that directly identifies a person. Examples include account information, contact details, personal IDs, and national IDs. Not all the 10 categories of personal data we consider below fall under PII. The exposure of PII is especially concerning as it could lead to personal or psychological harm, such as identity theft. \ Our primary aim is to identify the flow of personal data within a codebase, focusing on its cruicial implications for privacy. To achieve this, we use a pattern-matching technique inspired by Tang et al. [?]. This technique effectively identifies data from 10 categories, including Account, Contact, Personal ID, Location, and National ID. We employ Semgrep, a tool tailored for pattern matching in code, to facilitate this process. Semgrep’s rules are specifically designed for Java and JavaScript languages. 6.1 Static Analysis for Personal Data Identification The initial phase of our approach involves using static analysis to locate code fragments that contain personal data. We use Semgrep for this task, given its efficiency and flexibility in analyzing large codebases. We rely on Semgrep’s support for multiple languages and its capabilities for local data flow analysis. 6.2 Defining Sources of Personal Data In the context of our analysis, sources refer to instances where personal data appears. We identify personal data in two ways: 1) as literal text present in the source code, and 2) as variables, based on their name identifiers. Our identification rules are designed to support Java, JavaScript, and TypeScript but can be extended to other languages that Semgrep supports. 6.3 Rule Crafting for Identification To pinpoint literal personal data, we use regular expression (regex) matching. This comes into play, for example, when identifying the format of national ID numbers. For variable sources, we maintain a default list of identifiers that correspond to the 10 categories of personal data. These identifiers help us formulate Semgrep rules. To reduce false positives, we impose specific conditions on these regex rules. For instance, to capture all human names in the code, we use a regex pattern that accommodates variations like first, last, and full names: (?i).(?:first|given|full|last|sur(?!geon)) [s/(;)|,=!>]name). :::info Authors: Feiyang Tang Bjarte M. Østvold ::: :::info This paper is available on arxiv under CC BY-NC-SA 4.0 license. ::: \

Similar Posts