17/01/2024
Join us this time next week for our first CeADAR Tech Talk of 2024, 'GPT-like Pre-Training on Unlabelled System Logs for Malware Detection' with guest speaker Dmitrijs Trizna of Microsoft
In recent years, self-supervised language modeling techniques, such as those used in GPT-like language models, have shown great success in natural language processing tasks, without requiring supervision from domain experts to learn language semantics. In this talk, we explore the transferability of these techniques to system logs and share pre-training methodology of a Transformer model on unlabeled logs for malware detection.
Infrastructures generate vast amounts of system logs suitable for cybersecurity needs, but only a fraction of these logs are labeled and annotated for specific events or anomalies. Our experiments demonstrate that pre-training the model on unlabeled system logs leads to improved performance on the task of malware detection, compared to training on labeled data alone. Moreover, we show that the pre-trained model learns patterns that are similar to what a human engineer would consider relevant in detecting malware.
These findings highlight the potential of pre-training GPT-like models on system logs for cybersecurity applications, and demonstrate the benefits of self-supervised learning approaches in domains where labeled data is scarce. Overall, our work contributes to the growing body of literature on applying language modeling techniques beyond natural language processing and opens up new avenues for research in the field of cybersecurity.
Register using the link below:
https://ucd-ie.zoom.us/webinar/register/WN_WZ-MAtOJQii8cs3Bmi0fUw #/registration