Dataset Information

Malicious and Benign Webpages Dataset.

ABSTRACT: Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API. The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis.

SUBMITTER: Singh AK

PROVIDER: S-EPMC7648114 | biostudies-literature | 2020 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Malicious and Benign Webpages Dataset.

Singh A K AK

Data in brief 20200912

Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset co ...[more]

PMID: 33204771

Dataset Information

Malicious and Benign Webpages Dataset.

Publications

Malicious and Benign Webpages Dataset.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

DNS dataset for malicious domains detection.
| S-EPMC8437788 | biostudies-literature

Dataset of anomalies and malicious acts in a cyber-physical subsystem.
| S-EPMC5536820 | biostudies-literature

The effect of malicious and benign envy on career plateau in nurses: a cross-sectional study.
| S-EPMC12403576 | biostudies-literature

CPDR tumor-benign 80 genechip dataset
2011-09-29 | GSE32448 | GEO

CPDR tumor-benign 80 genechip dataset
2011-09-28 | E-GEOD-32448 | biostudies-arrayexpress

NGS-dataset of putative driver mutations associated with benign peritoneal strumosis.
| S-EPMC6122335 | biostudies-literature

Detection of malicious nodes based on consortium blockchain.
| S-EPMC11232615 | biostudies-literature

AndroAnalyzer: android malicious software detection based on deep learning.
| S-EPMC8157142 | biostudies-literature

Electrophysiological correlates of aesthetic processing of webpages: a comparison of experts and laypersons.
| S-EPMC5463973 | biostudies-literature

Phishing detection on webpages in European non-English languages based on machine learning.
| S-EPMC12559260 | biostudies-literature