Problem Statement : Malicious Webpage Detection using Machine Learning.

Solution :

Our team came up with a holistic solution where we used Google APIs along with a Deep Learning model to classify a webpage as malicious or safe. We also build a machine learning model to parse the embedded javascript to identify any malicious content. We started by building our model and the first phase in the process is the data collection where URLs are collected from multiple sources and are classified as malicious or safe by taking into account the opinions of various security vendors. The second phase involves pre-processing where the collected URL is parsed and tokenized to extract features like Domain Name, Protocol and WHOIS updates. Then, the source HTML of the address is fetched, parsed and tokenized. The tokens being variable in length are bucketed using hashing. The source HTML is then divided into 16 equal parts with the length of each part being (Len of Doc/16). These 16 code blocks are converted into 16 “1024” length vectors which can then be aggregated hierarchically to obtain “31” vectors which represent the same document at multiple levels of granularity. Features Extracted from the URL are fed into a classifier to generate the result. For an In-depth analysis, the features extracted from the document’s source are fed into the deep learning model. The 31 Vectors are fed as inputs into a CNN which yields 1024 feature maps for each input vector. The results of which are MAX Pooled in order to identify the strongest activation at each level of granularity. It is then fed into a dense network with 2 hidden layers which classifies the URL as either Malicious or Safe. The front end was developed using HTML, CSS and displayed features such as IP adresses, Geospatial data, and Graphical representations of the output of our model.