The dataset was systematically developed by augmenting and refining the Aposemat-Bot-IoT-23 dataset to address limitations in class imbalance, labeling consistency, and feature representation. Unlike prior datasets that include limited or uneven distributions of malware families, this dataset focuses on high-quality botnet traffic and benign behavior, ensuring reliable and scalable modeling of IoT botnet activities. It captures detailed network and transport-layer behaviors using protocol-aware flow representations, enabling comprehensive analysis of bot-driven cyber threats in IoT environments.
Captured and labeled 1 Data source: TCP/IP-based network traffic converted into bidirectional flow representations
Testbed: Real-world IoT network traffic scenarios from the Aposemat-Bot-IoT benchmark environment
Attacks Profile: Multiple botnet families including Mirai, Gagfyt, IRCBot, Kenjiro, Torii, Linux Mira, Okiru, and others, alongside benign traffic
Data size: Hundreds of millions of network flow records derived from large-scale PCAP files
Data records: Over 235 million malicious bot records in addition to benign traffic samples
Data capturing: Derived from labeled PCAP files with Zeek logs and flow reconstruction
Extracted Features: 315 flow-based features capturing packet-level, statistical, temporal, and bidirectional traffic characteristics.
This dataset introduces a robust flow-based representation framework using the NTLFlowLyzer analyzer, enabling the extraction of bidirectional, time-dependent behavioral features across the network and transport layers. A novel and precise labeling methodology was applied by aligning flow records with Zeek-generated logs using IP-port matching, thereby ensuring accurate binary and multi-class annotations. To address significant class imbalance and scalability challenges, a cluster-based undersampling (CBUS) strategy was employed to preserve the data's structural characteristics while maintaining computational feasibility. Furthermore, careful preprocessing steps, including the removal of ambiguous “suspicious” samples, normalization, and proportional sampling, ensure high-quality, reliable training data. This dataset supports the development of advanced AI and LLM-based intrusion detection systems, enabling behavior-centric, scalable, and realistic modeling of IoT botnet threats in complex network environments.
The full research paper outlining the details of the dataset and its underlying principles:
"Unveiling Intruders' Behaviors: Explainable AI-Based Profiling of Malicious Bot Activities in IoT Networks”, Sepideh Niktabe, Dilli Sharma, and Arash Habibi Lashkari, Journal of Supercomputing, Volume 82, April 2026
Download Dataset:
