The Verrazzano Voyager: Detecting Anomalies in Network Traffic Data

Isabel Loci, Verrazzano Class of 2026, completed major in Computer Science and minor in Mathematics

Somewhere in the last year, I fell down a rabbit hole of learning more about the most popular cyber-attacks in history. These attacks dated back decades ago, when the online world was still fairly new and became more complex as it developed. I have always been concerned about the safety and privacy of any information I use online, so it got me thinking: how do people mitigate attacks at such a high scale? At the time, I was participating in a one-year long data science boot camp, and we were focusing on building AI-models using machine learning. The base idea of machine learning is a system that is able to learn from previously collected data, detect patterns, and then make decisions on its own without any explicit programming beforehand. This method allows systems to handle unknown data well, as they can make decisions independently after training. Neat, right?

Theoretically, I thought that it should be possible to create a machine learning-based AI model that is able to distinguish normal network traffic from malicious network traffic. So, me and a few friends decided to bring this idea to life. Our project was built in Python and then deployed through HuggingFace. The final accuracy percentage of the model was 92.94%. Good, but could've been way better.

When I was brainstorming ideas with my mentor, Dr. Huo, I mentioned my project. I talked about how the methodology we chose to implement had a lot room for improvement, and that I would've liked to go back and start all over again if I knew what approach I wanted to try next. Then she suggested this simple, but very brilliant idea to me: What if my capstone was about a deep research on the topic of anomaly intrusion detection?

It all clicked then. I had previously read one or two works of people that attempted the same project, but not much more beyond that. What I didn't realize was just how many ways data scientists had solved this problem with in the past, not with just traditional machine learning methods but also deep learning ones. My work revolved around studying as many research papers as I could, summarizing my findings, and presenting them in the form of a survey paper.

What I found hardest (and a little funny) was how when I came across a concept I hadn't heard of before, four more fundamental concepts were attached to it that I could absolutely not exclude from the survey paper. I spent hours upon hours studying these concepts, breaking down complex formulas and comparing results of different methodologies simply to understand the hidden connections between them. There is a general step-by-step process that data scientists follow when building a ML model, and it is the following: sourcing the data to be used for model training, data preprocessing, selecting the appropriate algorithm, model training, and finally model evaluation. Data preprocessing is important because a dataset could have issues such as missing values, duplicate columns, bad column names and inconsistent feature names, all of which can interfere with the accuracy of the model. Then the dataset could have too many samples of one attack and not enough of another attack, needing to be balanced. Like we mentioned before, ML algorithms are split into two categories, and each category is split into several unique algorithms. After the model is trained, it then needs to be evaluated using industry standard performance metrics that give an estimation on how the model is doing. If the methods are implemented correctly, the metrics will reflect the true performance of the model. Just be careful, a score that is too good to be true is often misleading! We strive to be as close-- but not too close-- to the 100% accuracy score.

This has been the most extensive research I have gotten to work on throughout my undergraduate years, and it has sparked a love in me to do more. Already I wish to delve into more papers written for different fields. I want to study Literature, Psychology, Physics, the Arts, and so much more. It has changed the way I interact with the world around me and how I decide to embrace and utilize new information.

Working on a thesis might have not been my first idea for completing my capstone, but I am so glad it's what I chose in the end.

The Verrazzano Voyager

Monday, March 23, 2026

Detecting Anomalies in Network Traffic Data

No comments:

Post a Comment