Code and datasets for my bachelor's thesis.
The code of this project is licensed under the GNU GPLv3. The data that I have collected and processed (e.g. in
the data_*
folders) is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
Note that the domains
folder contains the list of domains as used in the experiment. Courtesy of the Tranco list.
@misc{darwinkel2024fingerprinting,
title={Fingerprinting web servers through Transformer-encoded HTTP response headers},
author={Patrick Darwinkel},
year={2024},
month = mar,
eprint={2404.00056},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://doi.org/10.48550/arXiv.2404.00056}
}
We investigated the feasibility of using state-of-the-art deep learning, big data, and natural language processing techniques to improve the ability to detect vulnerable web server versions. As having knowledge of specific version information is crucial for vulnerability analysis, we attempted to improve the accuracy and specificity of web server fingerprinting as compared to existing rule-based systems.
To investigate this, we made various ambiguous and non-standard HTTP requests to 4.77 million domains, and collected HTTP response status lines for each request. We then trained a byte-level BPE tokenizer and RoBERTa encoder to represent these individual status lines through unsupervised masked language modelling. To represent the web server of a single domain, we dimensionality-reduced each encoded response line and concatenated them. A Random Forest and multilayer perceptron classified these encoded, concatenated samples. The classifiers achieved 0.94 and 0.96 macro f1-score respectively on detecting five of the most popular publicly available origin web servers, which make up roughly half of the World Wide Web. On the task of classifying 347 major type and minor version pairs, the multilayer perceptron achieved a weighted f1-score of 0.55. Analysis of Gini impurity suggests that our test cases are meaningful discriminants of web server types, and our high f1-scores are unprecedented and demonstrate that our proposed method is a viable alternative to traditional rule-based systems. In addition, this innovative method opens up avenues for future work, many of which will likely result in even greater classification performance.
The code was originally written in Python 3.10, guided by isort
,black
, mypy
, pylint
, vulture
, and eradicate
. It's a bit
of a mess, and the documentation is subpar. I haven't bothered to clean it properly as nobody will probably try to
replicate my research. If you are interested in using the code and data, don't hesitate to contact me so I can help you
get started. If desired, I can send my own processed files for reference.
You should probably read the thesis itself first.
- Run
collector.py
throughharbinger.sh
to create raw output files asbatch_.tsv
. - Concatenate the generated batch files (including headers) with Unix utilities into a single file.
- Clean the data and select viable samples with
preprocess_filter_by_bad_samples.py
fromconcatenated_data.tsv
intopreprocessed_*.tsv
.python3 code/create_dataset/preprocess_filter_by_bad_samples.py data_processed/concatenated_data_withheaders.tsv
- Filter the data by target classes with
preprocess_filter_by_target_labels.py
frompreprocessed.tsv
intopreprocessed_filtered.tsv
.python3 code/create_dataset/preprocess_filter_by_target_labels.py preprocessed_.tsv
- Generate a list of unique values for unsupervised training with
prepare_embeddings_list.py
fromconcatenated_data.tsv
intoembeddings_list_.tsv
.python3 code/create_support_data/prepare_embeddings_list.py data_processed/concatenated_data_withheaders.tsv
- Remove some redundant HTML-only lines from
embeddings_list.tsv
with Unix utilities. - Generate a list of unique classes for HuggingFace's datasets library.
python3 code/create_support_data/prepare_classes_lists.py preprocessed_filtered_.tsv
- Train a new tokenizer with
train_tokenizer.py
from anembeddings_list.tsv
.python3 code/create_embedding_model/train_tokenizer.py embeddings_list.tsv
- Train/fine-tune a (new) model with
train_embeddings_scratch.py
ortrain_embeddings_finetune.py
from anembeddings_list.tsv
.python3 code/create_embedding_model/train_embeddings_.py embeddings_list.tsv
- Run
prepare_encoded_inputs.py
onpreprocessed_filtered.tsv
andembeddings_list.tsv
to extract features from raw text columns, downsize them to 64 dimensions, and save them as a HuggingFace dataset.python3 code/create_embedding_model/prepare_encoded_inputs.py preprocessed_filtered.tsv embeddings_list.tsv
Run any classifier from code/evaluate_embeddings
after datasets/http-header-split-embedded-data-v1
has been created.
See assorted files in code/analyze_dataset
.
See assorted files in code/play_with_model
.