How to Classify Web Pages Using Machine Learning?

Kajal Kumari 10 Mar, 2023 • 17 min read

Introduction

A web page is a document or information resource that is accessible through the World Wide Web. It is typically made up of HTML (Hypertext Markup Language), which provides the structure and content of the page, and CSS (Cascading Style Sheets), which provides the styling information for how the page should be presented to the user.

A web page can contain text, images, videos, audio, and other multimedia content. It can also contain interactive elements, such as forms, links, and buttons, that allow users to interact with the page and access additional information or services.

Web pages are hosted on web servers and accessed through web browsers, such as Google Chrome, Mozilla Firefox, or Apple Safari. A web page’s URL (Uniform Resource Locator) provides a unique identifier that can be used to access the page from anywhere in the world.

Web pages play a central role in the World Wide Web, providing the primary means of accessing information and services online. They are used for a wide range of purposes, including personal communication, commerce, education, entertainment, and information dissemination.

In this article, we will build a classifier to pre-identify which URLs are important and what you’re looking for.

This article was published as a part of the Data Science Blogathon.

Table of Contents

Uses of Web Pages

  1. Information Dissemination: Web pages are used to provide information to a large audience. This includes educational information, news, and other content intended to inform or educate users.
  2. E-Commerce: Businesses use web pages to sell products and services online. This includes both physical goods, such as books and electronics, and digital goods, such as software and music.
  3. Communication: Web pages are used for personal communication, such as email and instant messaging, as well as for professional communication, such as video conferencing and team collaboration tools.
  4. Entertainment: Web pages are used for entertainment, such as streaming video and audio, playing games, and accessing social media sites.
  5. Government Services: Government agencies use web pages to provide information and services to citizens, such as paying taxes and renewing licenses.
  6. Job Searching: Job seekers use web pages to find job opportunities, research employers, and apply for jobs online.
  7. Personal Branding: Individuals use web pages to build their personal brand and establish their online presence.

Web Page Classification

Web page classification refers to the process of categorizing web pages into predefined classes or categories based on their content and structure. This can be useful for a variety of purposes, such as organizing web pages for easier searching and browsing, filtering out irrelevant or malicious web pages, and improving the accuracy of search engine results.

Classification of web pages can be performed using machine learning techniques, such as decision trees, random forests, support vector machines, and neural networks. These algorithms take as input a set of features extracted from the web pages, such as the frequency of specific HTML tags, the presence of certain keywords in the text, or the structure of the links between web pages. The algorithms then learn to map these features to class labels based on a training dataset of web pages that have been manually annotated with their correct class labels.

In order to obtain high accuracy in web page classification, it is important to have a large and diverse training dataset and effective feature engineering to capture the relevant information about the web pages. The choice of machine learning algorithm will also impact performance and should be selected based on the nature of the problem and the available training data.

Why do we Need Web Page Classification?

There are several reasons why web page classification is important:

  1. Information Organization: By categorizing web pages into different classes, organizing and navigating large amounts of online content becomes easier. This can help users find the information they are looking for more quickly and easily.
  2. Search Engine Optimization: Search engines use web page classification to rank and sort the results of a user’s search query. By accurately categorizing web pages, search engines can return more relevant results and improve the user experience.
  3. Content Filtering: Web page classification can be used to filter out malicious or irrelevant content, such as spam or adult websites. This can help protect users from exposure to inappropriate or harmful material and improve the quality of online content.
  4. Ad Targeting: Web page classification can also be used to target ads more effectively by determining the topic or subject matter of a web page and displaying ads that are relevant to that content.
  5. Personalization: By analyzing a user’s web browsing history, web page classification can be used to personalize the user’s online experience by recommending content that is relevant to their interests.

In summary, web page classification is a crucial component in the organization, discovery, and analysis of online content and has numerous applications in fields such as search engines, online advertising, and user personalization.

Understanding the Problem Statement

The problem of web page classification is to accurately categorize a given web page into one or more predefined classes based on its content and structure.

When you type breast cancer symptoms and effects in a google search, you will get around 130 million results, and going through each of those is actually impossible.

So instead of this, what you can do is you can build a classifier to pre-identify which of the URLs are actually important and what you’re looking for.

Problem Statement:

In this case study, we are provided with URLs from 53000+ web pages. The objective is to build a classifier that can classify the web pages into their respective classes (Each web page can belong to only 1 class).

CLASSIFIER

Data Description

Below is the list of the classes that we have in the target variable; note that each of the URLs in our data set will belong to only one class. So this will be a multi-class classification problem.

Basically, given the complete URL, predict the tag a web page belongs to out of 9 predefined tags as given below:

  • People Profile: It looks like an individual’s profile gives in their education details.
  • Conference: Details of a particular conference or an event which share the details.
  • Forums: different forums such as discussion, support, etc.
  • News Article: It’s quite self-explanatory. Some news articles would be in different categories.
  • Publication:  which might be relevant to some research papers.
  • Others: might behold on web pages or URLs.

The dataset contains the following features:

  • Web page_ID: Unique ID for the web page (1,2,3…. )
  • Domain: Domain of the web page
  • Url: Complete the URL of the web page
  • Tag: (Target) Tag (class) of the web page

The objective here is to predict the class of the web page from the above-mentioned 9 classes. So, let’s quickly go through the components of a URL.

Components of a URL

Let us have a look at each component of a URL to have a better understanding of the data.

component of URL

Here we have an example of this part here:

  • HTTP: This is the hypertext transfer protocol.
  • www: World wide web
  • mck: domain name here which is given mck. (it can be facebook.com,analytics.com)
  • edu: Top-level domain-like the type of site(Edu means educational site, gov means government site)
  • geographical domain: which basically tells you the website’s source. So for this example, we have the source Australia so it’s written au here.
  • Domain/File path: It will be unique to every page and basically determines the complete domain.

let’s go with the particle example. First, import all libraries.

Import Packages

In the above section, we learn about the problem statement of webpage classification. We’ll start with importing the required packages, and then we’ll load the data set.  Start with importing the required packages, and then we’ll load the data set.

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
from tqdm import tqdm
from urllib.parse import urlparse
pd.set_option("display.max_colwidth", 200)
import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_predict, GroupKFold
from sklearn.pipeline import FeatureUnion
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import hstack
%matplotlib inline

Load Datasets

We have the data set saved as a web page data.csv file.

df = pd.read_csv("webpage data.csv")

Basic Exploration

Let’s look at the shape of the dataset.

df.shape

So we have 53,229 rows and 4 columns, and let’s look at what these columns look like.

web page

The webpage id is a unique id against each row. Then we have the domain, which, as you can see for the initial few rows, we have fiercepharma.com and the URL against each of these and, finally, the target or the tag for each URL.

So just to confirm, let’s see how many unique classes we have in the variable tag. So I can do this using the data frame and column name with a unique function.

df['Tag'].unique()
web page

So it returns all the unique categories inside this column which is just tagged here, and we can see news, clinical trials, conferences, and so on.

As mentioned in the problem statement, there are 9 separate categories. Let us have a look at a few samples from each category to have.

Public Profiles

To start with, we have the class profile, and let’s just see a few examples. So we have the domain of healthcare for people.

df[df['Tag'] == 'profile'].head(2)
public profiles

now let’s look at some other examples.

Conferences/Congress

So conferences it’s quite self-explanatory. We have events within the URL which suggests that it’s a page which shares the details of a particular conference conducted by different organizations.

df[df['Tag'] == 'conferences'].head(2)
web page

let’s look at some other examples.

Forum

We have another example here for the class forums. We can see the tags community here, which says forum within the URL.

df[df['Tag'] == 'forum'].head(2)
web page

Others

df[df['Tag'] == 'others'].head(2)
web page
  • Looking at the samples, we can see that there are a few words in the URLs that appear consistently for each category. For example:- there is the word forum for URLs of tag forum, gid for the URLs of tag guidelines, and CFM for conferences.
  • This implies that it would be a good idea to find out each word’s frequency and use that as a feature.

So that’s a brief about the data set and the target classes.

Target Exploration

Now, let’s just look at the distribution of the target class. So we’ll be using the value counts function here and we’ll just plot this.

cnt_tag = df['Tag'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(cnt_tag.index, cnt_tag.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.show()
web page

As you can see above image, a maximum of the URLs belong to the class others. And then, we have news and publications with approximately a frequency of 7500.

Other classes, like thesis and guidelines, have quite a low frequency compared to others in news and publications. So, clearly, we can say this is an imbalanced classification problem.

Understanding the Common Words Used in the URLs: WordCloud

A word cloud, also known as a tag cloud, is a visual representation of the most frequent words in a text corpus. In the context of web pages, a word cloud can be generated for the URLs of a set of web pages in order to gain insight into the common words used in the URLs.

Now I want to see how well the given sentiments are distributed across the training dataset. One way to accomplish this task is by understanding the common words by plotting word clouds.

A word cloud is a visualization where the most frequent words appear in large sizes, and the less frequent words appear in smaller sizes.

Let’s visualize all the words in our data using the word cloud plot.

all_words = ' '.join([text for text in df['Url']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).
                    generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

Let’s just have a look at it. Now the word cloud represents the frequency of the words. So, whichever words in the word clouds are bigger have a higher frequency. So, I can see Biomed central and articles as the two most frequent words.

Overall, most of the URLs direct to healthcare pages, as can be seen from the word cloud, and there are words such as thesis, Edu, etc. which again imply that the frequency of words should be an important feature for prediction.

A good idea would be to create word clouds for each category. So I am going to do this for the category thesis.

Word Cloud for the Tag: Thesis

In this case, I have set a condition that will take the words present in the column URL but also satisfies the condition that the tag is a thesis.

we have the general parameters of weight, height, and so on.

all_words = ' '.join([text for text in df[df['Tag'] == 'thesis']['Url']])
wordcloud = WordCloud(width=800, height=500, random_state=21,
             max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

Now, this shows a word cloud for the category thesis. The most common words are EDU, handle, kernel thesis library, etc.

Similarly, you can print word clouds for particular categories or classes to get an idea about the most common words in this URL, and using those words would be helpful in determining the final class.

Feature Extraction

Here we simply have domain and URL, which are neither numeric nor categorical variables, as each URL is unique.

The URLs in the dataset can be considered as a single string since the words in the URL have no spaces. Instead, there are 2 types of separators here ‘/’ and ‘-.’ We can replace these with spaces and get individual words this way.

def clean_url(df):
    df["Url"] = df["Url"].str.replace("/", " ")
    df["Url"] = df["Url"].str.replace("-", " ")
    df["Url"] = df["Url"].str.replace("https:", "")
    df["Url"] = df["Url"].str.replace("http:", "")
    return df
df = clean_url(df)
df.head(5)

look at the head of the newly cleaned data.

So, here as you can see, we have separated words now instead of a complete string.

Now we can move to the feature extraction part. We have tokenized the words and done the necessary cleaning. It is time to convert these to features.

  • Bag of Words
  • TFIDF (Term Frequency Inverse Document Frequency)
  • Word Vectors

Here, we will use BOW features. Sklearn provides functionality for both. Let’s use that to create these features.

If you look at the documentation for Sklearn feature extraction, you will see that:

  • ngram_range helps to define the range of n for the ngrams.
  • When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold min_df.

Bag-of-Words Features

The URL has a lot of abbreviations for the same words, so it could be a good idea to use create a bag of word features from characters as well.

So let’s start with creating the bag of words; in this case, we’ll be using the count vectorizer here. We have given the count vectorizer a parameter n grams range, which goes from 1 to 3.

# Word and character BOW on URLs
vec_bow = CountVectorizer(ngram_range=(1, 3), min_df=400)
vec_bow.fit(df['Url'])
Url_bow = vec_bow.transform(df['Url'])

We’ll keep on adding the features as we go ahead.

Let’s just try the build first model and see the accuracy. So for building a model, we first need to create a train and
validation set.

Train Test Split

We will not randomly shuffle the data set into the train and validation set in this particular problem.

Randomly splitting the dataset into train and test and checking performance will not correct this problem. Here’s why.

Let’s say we have a domain ecommons.cornell.edu. This is basically Cornell University’s digital repository and predominantly contains thesis classes. Now, suppose this domain (ecommons.cornell.edu) and class (thesis) combination are contained in both train and test, just on the basis of the domain. In that case, I can predict the class to be a thesis, but this model would not be useful and would not generalize well on a new thesis by a different domain.

Well, let me explain that using a simple example. So let’s say here we have a subset of the data set. We have agelab.mit.edu as one of the domains and aac.asm.org as the other domain, and against both these domains, we have the following tags.

The Solution?

The train and test data split should be done based on the Domain-Tag combination, such that no 2 URLs for the same class and domain are kept in the train and test, respectively, because, in that case, the domain can be directly mapped to the tag and that would be a leakage.

Model Building and Model Validation

This is a multi-class classification problem, and the metric we will use here is the weighted F1-Score. As discussed in the multi-class module, a weighted F1 score basically assigns weights proportional to the class frequency in the train set.

To implement the above logic. We will use Group K-Fold from Sklearn. The same group will not appear in two different folds in group K-Fold. The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

Here we are going to create a new variable which is called target_str. And this variable is basically the domain_tag. We’ll be using the group k fold to create the train and validation set and have given the number of folds to five.

The group Kfold split will create five different folds of the complete dataset df.

# Replicate train/test split strategy for cross validation
df["target_str"] = df["Domain"].astype(str) + '_' + df["Tag"].astype(str)
# cvlist = list(GroupKFold(5).split(df, groups=df["target_str"]))

Here we have given the groups as the new variable that we just created target_str, and this group Kfold split will ensure that all the categories are present in the target_str variable.

df["target_str"].head()
model building
cvlist
web page

Now, a variable x stores all the features created from the bag of words we used in the count vectorizer above.

X = Url_bow
TAG_DICT = {"others":1, "news": 2, "publication":3, "profile": 4,
            "conferences": 5, "forum": 6, "clinicalTrials": 7,
            "thesis": 8, "guidelines": 9}
df['target'] = df.Tag.map(TAG_DICT)
y = df["target"].values

Above the line of code, we are just converting the strings in the target variable to numbers, and we will store that in the variable y.

def cv_score(ml_model, df):
    i = 1
    cv_scores = []
    X = df
    
    # Custom Cross validation based on group KFold
    for df_index,test_index in cvlist:
        print('n{} of Group kfold {}'.format(i,5))
        xtr,xvl = X[df_index],X[test_index]
        ytr,yvl = y[df_index],y[test_index]
            
        # Define model for fitting on the training set for each fold
        model = ml_model
        model.fit(xtr, ytr)
        pred_probs = model.predict_proba(xvl)
        label_preds = np.argmax(pred_probs, axis=1) + 1
        
        # Calculate scores for each fold and print
        score = f1_score(yvl, label_preds, average="weighted")
        sufix = ""
        msg = ""
        msg += "Weighted F1 Score: {}".format(score)
        print("{}".format(msg))
         
         # Save scores
        cv_scores.append(score)
        i+=1
    return cv_scores

Now, this function takes in the model that we will build and the features that we have. Let’s now we’ll be building our first model, which is the multinomial Naive Bayes.

Naive Bayes

We are giving in the features, which are the word features we created for the function cv score.

cv_score(MultinomialNB(alpha=.01), Url_bow)
Naive Bayes

So this will give us the scores for all five folds. So, we can see we have a score of 0.59 for the first fold and then 0.63, and so on. So the highest score goes to 0.68.

Now we’ll create some more features. so let’s look at how we can create features using the characters.

Character N-Grams

So previously, what we did was we created features which was a bag of words using each individual word. And then, we grouped words. we used diagrams and trigrams. so we took two words into three words together. now we can perform the same for characters.

These scores are low. Since the URLs are not regular sentences. It would be a good idea also to build features using character n-grams as well.

For sequences of characters, the 3 grams that can be generated from “good morning” are “goo,” “ood,” “od “, “d m,” ” mo”, “mor”

Let’s do that and check performance again.

# Word and character BOW on URLs
vec1 = CountVectorizer(analyzer='char', ngram_range=(1, 5), min_df=500)
vec2 = CountVectorizer(analyzer='word', ngram_range=(1, 3), min_df=400)
vec_bow = FeatureUnion([("char", vec1), ("word", vec2)])
vec_bow.fit(df['Url'])
Url_bow = vec_bow.transform(df['Url'])

These will be our new features or additional features using the characters and the words. Now we’ll again build the same multinomial Naive based model, and let’s see if there is any improvement in the score.

cv_score(MultinomialNB(alpha=.01), Url_bow)
Character N-Grams

So the scores have improved. So, we have the score for the first four last 0.67, and then we have the best to be 0.72. We have created the bag of features using count Vectorizer. We see significant improvement by using the character N-Grams. Now, let’s try the TFIDF features

TF-IDF Features

we have a TF-IDF vectorizer here. And we’ll use both the analyzer as characters and the words. And we’ll be creating the features for engram range of 1 to 5 for characters and 1 to 3 for words.

# Word and character TFIDF on URLs
vec1 = TfidfVectorizer(analyzer='char', ngram_range=(1, 5), min_df=500)
vec2 = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=400)
vec_tfidf = FeatureUnion([("char", vec1), ("word", vec2)])
vec_tfidf.fit(df['Url'])
Url_tfidf = vec_tfidf.transform(df['Url'])

And again, build the same model and see if these new features add any more importance. And if the score improves or the score remains the same in this case.

nb = cv_score(MultinomialNB(alpha=.01), Url_tfidf)
Web Page

So, the scores again have improved significantly. And so far, we have built the Naive Bayes models for a different number of features that we created.

Let’s try a few different models in this case. So I’m going to try the logistic regression here. This will take in the same features which we just created from the TF-IDF vectorizers.

Logistic Regression

We are getting better performance from TFIDF features. Let’s go with that and try logistic regression now. Here, I have used class weight as balanced. This specifically changes the weights of samples inversely proportional to the frequency, meaning the classes with fewer samples will have more weight.

log_reg = cv_score(LogisticRegression(C=0.1,class_weight="balanced"), Url_tfidf)
Web Page

So, in this case, the logistic regression model has actually out-shown our naive Bayes models. Although for the first initial folds, we had a score similar to the Naive based models that we had. But for the other two folds, we have the best score to be 0.8 in this case.

Let’s now give try the tree-based model.

Decision Tree

Now, we will try tree-based methods to check performance. So here I’m building a decision tree model again for our same set of features.

dtree = cv_score(DecisionTreeClassifier(min_samples_leaf=25,
             min_samples_split=25), Url_tfidf)
Web Page

Looks like the decision tree model hasn’t performed really well on the given data set and has not been able to classify the URLs very well.

Let’s give the random forest model another try.

Random Forest

So let’s see if using the same features is a random forest model able to classify our URLs better. And we have set the estimator to be 100 and the maximum depth here to 50.

rf_params = {'random_state': 0, 'n_jobs': -1, 'n_estimators': 100, 
                'max_depth': 50, 'n_jobs': -1}
rf = cv_score(RandomForestClassifier(**rf_params), Url_tfidf)

Web Page

the random forest has shown a similar performance to the decision tree model. Although the performance has improved slightly still, we cannot say that random forest has performed well on this status set.

So let’s compare the performance of all the models we have built so far.

results_df = pd.DataFrame({'Random Forest':rf, 'Decision Tree': dtree, 
                'Logistic Regression': log_reg, 'Naive Bayes':nb})
results_df.plot(y=["Random Forest", "Decision Tree","Logistic Regression",
                "Naive Bayes"], kind="bar")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

And here, we have all the models shown in different colors, and on the x-axis, we have the number of folds that we build.

Web Page

On the y-axis, we have the reference scores. So we can clearly see that the logistic regression has performed well in all the cases. And then the next best here is the Naive Bayes and then the random forest algorithms. And we can also add more features to improve the model’s performance.

Conclusion

A web page is a document that is accessible through the World Wide Web and typically contains text, images, and other multimedia
content, along with interactive elements that allow users to interact with the page. Web pages play a central role in the World Wide Web and
are used for various purposes.

Classification of web pages is a significant system for Web mining because the original step of Web mining is grading the web pages of different classes.

Web page classification uses machine learning algorithms like decision trees, random forests, support vector machines, and neural networks. These algorithms take as input features extracted from the web pages and learn to map these features to class labels based on a training dataset.

Webpage classification is a supervised learning problem that categorizes a webpage into predefined categories based on labeled training data. We observed these key points while building a webpage classifier with different machine learning algorithms below.

  • Logistic Regression is getting the best performance.
  • Interestingly Tree Based methods are performing badly.
  • The scores are unstable due to the many classes and few samples.

The accuracy of web page classification depends on the quality and diversity of the training data and the choice of the machine learning
algorithm. Word clouds can be used to gain insight into the common words used in the URLs of a set of web pages and can be useful for
understanding the most common themes or topics.

If you want to read my previous blogs, you can read Previous Data Science Blog posts here. Connect with me on Linkedin.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Kajal Kumari 10 Mar 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

elena
elena 26 Sep, 2023

Hello in this article you say "We have the data set saved as a web page data.csv file". how i can collect web page data automatically?

Related Courses