Table Of Links Abstract 1 Introduction 2 Background 3 Privacy-Relevant Methods 4 Identifying API Privacy-relevant Methods 5 Labels for Personal Data Processing 6 Process of Identifying Personal Data 7 Data-based Ranking of Privacy-relevant Methods 8 Application to Privacy Code Review 9 Related Work Conclusion, Future Work, Acknowledgement And References \ Related Work Research in source code analysis for privacy is extensive, yet specific approaches for identifying personal data processing are limited. Ullah et al. [13] introduced an approach for extracting control and data dependencies in source code, potentially applicable for locating personal data processing methods, but not directly designed for this purpose. Hjerppe et al. [2] proposed an annotationbased static analysis for data prot…
Table Of Links Abstract 1 Introduction 2 Background 3 Privacy-Relevant Methods 4 Identifying API Privacy-relevant Methods 5 Labels for Personal Data Processing 6 Process of Identifying Personal Data 7 Data-based Ranking of Privacy-relevant Methods 8 Application to Privacy Code Review 9 Related Work Conclusion, Future Work, Acknowledgement And References \ Related Work Research in source code analysis for privacy is extensive, yet specific approaches for identifying personal data processing are limited. Ullah et al. [13] introduced an approach for extracting control and data dependencies in source code, potentially applicable for locating personal data processing methods, but not directly designed for this purpose. Hjerppe et al. [2] proposed an annotationbased static analysis for data protection, but its effectiveness is contingent on accurate developer annotations, a challenge in large projects. \ Dynamic analysis has been explored for sensitive data flow detection, with DAISY [15] focusing on Android apps and ConDySTA [16] combining dynamic taint analysis with static analysis. However, these methods have limitations, such as platform specificity or the need for executing projects. Automated assistance in code review has been explored by Li et al. [3] with their pre-trained model CodeReviewer, but it lacks a focus on personal data processing. \ SWANAssist [5] offers a semi-automated approach for identifying security-relevant Java code methods, which could potentially be adapted for privacy purposes. Other studies, like [1, 12], attempt to align GDPR compliance with static analysis. Novikova et al. [4] provided insights into privacy-enhancing technologies but did not focus on personal data processing in source code. \ These studies mark great progress in source code analysis, yet a gap exists in automated identification and categorization of personal data processing. Our work addresses this by proposing an automated approach for identifying personal data processing in real-world applications, enhancing efficiency in privacy code reviews. Conclusion In conclusion, our study introduces a method for identifying and categorizing privacy-relevant methods in source code, focusing on personal data processing. We have successfully narrowed the analysis scope to just 4.2% of methods across 100 popular open-source applications, offering a practical starting point for developers, data protection officers, and reviewers. \ This approach not only simplifies code reviews but also facilitates compliance with data protection regulations like GDPR, helping organizations align their software development with legal requirements. For future work, we aim to enhance the precision of our privacy-relevant method identification algorithms, possibly integrating machine learning for more accurate predictions of personal data processing activities. \ Expanding our approach to additional programming languages and integrating it into common development tools for real-time feedback are also key goals. These advancements will broaden the impact and applicability of our approach. Ultimately, our research paves the way for more focused and efficient privacy assessments in software development, contributing to the creation of software that is efficient, robust, and respectful of user privacy. Acknowledgement This work is part of the Privacy Matters (PriMa) project. The PriMa project has received funding from European Union’s Horizon 2020 research and innovation program under the Marie Sk lodowskaCurie grant agreement No. 860315. References Ferrara, P., Olivieri, L., Spoto, F.: Tailoring taint analysis to GDPR. In: Privacy Technologies and Policy: 6th Annual Privacy Forum, APF 2018, Barcelona, Spain, June 13-14, 2018, Revised Selected Papers 6. pp. 63–76. Springer (2018) Hjerppe, K., Ruohonen, J., Lepp¨anen, V.: Annotation-based static analysis for personal data protection. In: Privacy and Identity Management. Data for Better Living: AI and Privacy, pp. 343–358. Springer International Publishing (2020) Li, Z., Lu, S., Guo, D., Duan, N., Jannu, S., Jenks, G., Majumder, D., Green, J., Svyatkovskiy, A., Fu, S., Sundaresan, N.: Automating code review activities by large-scale pre-training (2022) Novikova, E., Fomichov, D., Kholod, I., Filippov, E.: Analysis of privacy-enhancing technologies in open-source federated learning frameworks for driver activity recognition. Sensors 22(8), 2983 (2022) Piskachev, G., Do, L.N.Q., Johnson, O., Bodden, E.: SWANAssist: Semi-Automated Detection of Code-Specific, Security-Relevant Methods. In: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. p. 1094–1097. ASE’19, IEEE Press (2020). https://doi.org/10.1109/ASE.2019.00110 van der Plas, N.: Detecting PII in Git commits (2022), http://resolver.tudelft.nl/uuid: fe195c17-ecf5-4811-a987-89f238a6802f Ren, J., Rao, A., Lindorfer, M., Legout, A., Choffnes, D.: ReCon: Revealing and Controlling PII Leaks in Mobile Network Traffic. In: Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. p. 361–374. MobiSys ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2906388.2906392 Tang, F., Østvold, B.M.: Assessing Software Privacy Using the Privacy Flow-Graph. In: Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. p. 7–15. MSR4P&S 2022, Association for Computing Machinery, New York, NY, USA (2022) Tang., F., Østvold., B., Bruntink., M.: Identifying Personal Data Processing for Code Review. In: Proceedings of the 9th International Conference on Information Systems Security and Privacy - ICISSP. pp. 568–575. INSTICC, SciTePress (2023). https://doi.org/10.5220/0011725700003405 Tang, F., Østvold, B.M., Bruntink, M.: Helping Code Reviewer Prioritize: Pinpointing Personal Data and Its Processing. IOS Press (Sep 2023). https://doi.org/10.3233/faia230228 Thongtanunam, P., Hassan, A.E.: Review dynamics and their impact on software quality. IEEE Transactions on Software Engineering 47(12), 2698–2712 (2020) Tokas, S., Owe, O., Ramezanifarkhani, T.: Static checking of GDPR-related privacy compliance for object-oriented distributed systems. Journal of Logical and Algebraic Methods in Programming 125, 100733 (2022) Ullah, F., Wang, J., Jabbar, S., Al-Turjman, F., Alazab, M.: Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access 7, 141987– 141999 (2019) Vall´ee-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., Sundaresan, V.: Soot: A java bytecode optimization framework. In: CASCON First Decade High Impact Papers, pp. 214–224 (2010) Zhang, X., Heaps, J., Slavin, R., Niu, J., Breaux, T., Wang, X.: DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive Data. ACM Trans. Softw. Eng. Methodol. 32(4) (May 2023) Zhang, X., Wang, X., Slavin, R., Niu, J.: ConDySTA: Context-Aware Dynamic Supplement to Static Taint Analysis. In: 2021 IEEE Symposium on Security and Privacy (SP). pp. 796–812 (2021). https://doi.org/10.1109/SP40001.2021.00040 \ :::info Authors: Feiyang Tang Bjarte M. Østvold ::: :::info This paper is available on arxiv under CC BY-NC-SA 4.0 license. ::: \