PhD Position F/M Constructing Cyber Security Knowledge Encoding and Reasoning from Heterogeneous Sources with Large Language Models
Inria
Rennes
PhD Position F/M Constructing Cyber Security Knowledge Encoding and Reasoning from Heterogeneous Sources with Large Language Models
Le descriptif de l’offre ci-dessous est en Anglais
Type de contrat : CDD
Niveau de diplôme exigé : Bac + 5 ou équivalent
Fonction : Doctorant
A propos du centre ou de la direction fonctionnelle
The Inria Centre at Rennes University is one of Inria's eight centres and has more than thirty research teams. The Inria Centre is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc.
Contexte et atouts du poste
Within the framework of the ANR project "CKRISP".
Mission confiée
Background
AI-based cyber-attack detection models face several bottlenecks that need to be addressed. First, AI models require large amounts of training data to cover as many as possible attack behaviours. However, in the practice of cybersecurity, the deployed probes cannot guarantee to provide comprehensive coverage over different attack behaviours, especially emerging new attacks. The lack of coverage over different attack behaviours, i.e. the issue of out-of-distribution samples in attack detection [Yang21], makes it challenging to accurately categorise the threats that are being faced. Second, AI models may generate a large number of false positives that are difficult to distinguish from real intrusion. The origin of this issue stems from the lack of interpretability of AI models with complex architectures, e.g., Deep Neural Networks. The difficulty of comprehending the detection logic of AI models leads to mistrust and unaccountability of AI-based detection technologies. In light of the aforementioned limitations of AI-based attack detection, we believe representing cyber security knowledge, e.g. encoding and inferring attack behaviours with AI models is the key to deliver verifiable and accountable AI-based cyberattack prediction system.
The objective of this thesis
We will design automated attack knowledge fusion from heterogeneous attack information sources, such as dynamic analysis traces of malware samples and high-level threat reports of malware families written by human analysts, based on the combination of Large Language Model (LLM) and Knowledge Graph (KG) based AI models. First, we will focus on identifying entities of security incidents (Named Entity Recognition, NER) and relations between the entities (Extraction of Relation, ER) [Rastogi20] in logs and reports using LLMs. LLMs are deployed as a generative model to produce and match candidates of meaningful entities in security incidents [Rastogi20, Das22]. Furthermore, LLMs can be used to tokenise the low-level behavioural logs and perform reasoning between logs to explore possible causal relations via text embedding techniques. The extracted entities and relations between them will be used to build the security knowledge graph.
Over the constructed knowledge graphs, we propose to perform the causal-reasoning-based attack prediction via reinforcement learning (RL)-based exploration in cyber security knowledge graphs. We further integrate human analysts' feedback to refine the reasoning capability of the RL-based exploration models. To achieve this goal, we plan to develop few-shot learning methods [Hejna22], such as few-shot preference learning with human feedback to tailor the exploration policy learnt using knowledge graphs. Furthermore, we will focus on continual learning [Zhangxk22] to update dynamically the contents of the cyber security knowledge graphs. Similar as active learning, this algorithm also allows human experts to encode the verified and corrected causal relations as additional supervision to update links/entities in the knowledge graphs.
Expectations
The candidate for this thesis is expected to have accomplished courses on Machine Learning and/or have experience of implementing Machine Learning algorithms using Python toolboxes, e.g. Pytorch, for practical data mining problems. Knowledge about cyber security, such as malware classification and/or network intrusion detection is required. Candidates with the experience of security data analysis will be preferred.
References
[Yang21] Yang, L.M, Guo, W.B., Hao, Q.Y., et al., CADE: Detecting and Explaining Concept Drift Samples for Security Applications, In 30th Usenix Security, pp.2327-2344, 2021.
[Rastogi20] Rastogi, N., Dutta, S., Zaki, M.J., MalOnt: An ontology for malware threat intelligence, International Workshop on Deployable Machine Learning for Security Defense, pp. 28-44, 2020.
[Das22] Das, S.S., Dutta, A., Purohit, S., Serra, E., Halappanavar, M., Pothen, A, Towards Automatic Mapping of Vulnerabilities to Attack Patterns using Large Language Models. In IEEE International Symposium on Technologies for Homeland Security, pp. 1-7, 2022.
[Hejna22] Hejna, J. and Sadigh, D., Few-Shot Preference Learning for Human-in-the-loop RL, In Annual Conference on Robot Learning, 2022.
[Zhangxk22] Zhang, X.K. et al., CGLB: Benchmark Tasks for Continual Graph Learning, In NeurIPS 2022, 2022.
Principales activités
This thesis will be conducted at INRIA Rennes and co-supervised with Eurocom researchers at Sophia-Antipolis. The Inria Rennes – Bretagne Atlantique Centre is one of Inria's eight centres and has more than thirty research teams. The Inria Center is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc. The monthly gross salary for the PhD candidate amounts around 2000 euros. For every applicant, please submit online your resume, cover letter and letters of recommendation.
Compétences
Technical skills and level required : Machine Learning, Statistics, Information theory, Pytorch, Intrusion Detection
Languages : English
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Possibility of teleworking (90 days per year) and flexible organization of working hours
- Partial payment of insurance costs
Rémunération
Monthly gross salary amounting to 2082 euros for the first and second years and 2190 euros for the third year