A new AI language model trained using data from the dark web is showing interesting results, according to a paper recently published by researchers from the Korea Advanced Institute of Science and Technology (KAIST). Following in the footsteps of popular language models like ChatGPT, DarkBERT is trained to recognize associations between words and patterns in language. Unlike ChatGPT, it is solely trained on data gathered from the darknet, which includes around 6.1 million pages of text.
Results thus far include DarkBERT’s ability to accurately identify and classify ransomware leak sites, its ability to monitor noteworthy threads in carding or hacking forums, and to identify keywords related to cyber threats and even illicit drug sales on various darknet markets. It has thus outperformed other training models in these tasks, including those on which it is based (BERT and RoBERTa).
Illustration of the DarkBERT pretraining process and potential use cases. Source: KAIST paper
Unlike its predecessors, DarkBERT is capable of unraveling the complexity of the highly-coded language often used on the dark web, tying together words and phrases that would normally have no contextual similarities on the clear web. Its creators believe DarkBERT’s unique abilities in this regard will serve to help cybersecurity experts and even law enforcement detect new data leaks and other threats as they emerge on the dark web.
An example of DarkBERT’s supremacy over other language models is provided in the paper in the form of asking the models to each suggest words related to a popular Dutch MDMA (Ecstasy) pill known as Philipp Plein. After hiding the term “MDMA,” the results found that DarkBERT was more likely to associate it with more semantically-meaningful words, including “pills,” “import,” “speed,” “up,” “oxy,” “script,” and “champagne” (another popular term for Ecstasy pills). The classic language model, however, associated it with words that were more likely to come after the term “Dutch” on the clearnet, including “man,” “champion,” “singer,” “producer,” and “citizen.”
“We show that DarkBERT outperforms existing language models with evaluations on Dark Web domain tasks, as well as introduce new datasets that can be used for such tasks,” wrote the KAIST authors in their research paper, concluding that “DarkBERT shows promise in its applicability on future research in the Dark Web domain and in the cyber threat industry.”