Abstract—the based clustering which eliminates the odd records while

Abstract—the personalized web page recommendation is much needed these
days. Generally, Web page recommendation systems are implemented in Web
servers. They use data implicitly obtained as a collection of
Web browsing patterns of the users for recommending webpages. The existing
system collects the Web logs and generates a cluster of similar users and
recommends pages to the user by actively analyzing it in online. However the
time complexity for analyzing it in online is more. In order to optimize this
and to improve the correctness of recommendation systems we
propose the method of applying Firefly based algorithm for recommending Web
pages along with Naïve Bayes clustering. It clusters Web logs in
offline using Naive Bayes clustering technique. To find the similarity
between the active user query and other users in the cluster Firefly algorithm
based similarity measure is used. The proposed approach uses a probability
based clustering which eliminates the odd records while forming clusters.
Firefly algorithm meticulously searches the generated web logs present in the
cluster of the active user and recommends the top pages. Firefly algorithm
utilizes time efficiently, thus it can be used for processing in online. When
pages are obtained, they are ranked and the top pages that are more relevant to
the query are recommended. The efficiency of proposed system can be evaluated
using the measures like precision, recall-Score, Matthews’s correlation and
Fallout rate. The proposed approach is expected to improve time
utilization in online process as well as recommends more accurate
Webpages.  

Introduction- Web page recommendation system is a sub-domain of recommendation
systems that recommends a set of Web pages to the users based on their past
browsing patterns. It is done by applying special mining techniques on
the data that are previously gathered from the users which in turn discovers and
extract information from Web documents and services. The major concern is about
finding the most accurate recommendation algorithms. Recommendation system typically
produces the result by following one of the two ways – through collaborative and
content based filtering.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

A.   
Colloborative Filtering

Most recommendation system has wide use
of collaborative filtering for recommending items. This method lies on collecting
and processing the information’s on user’s behaviors or activities and then predicting
the items relating to their similarity with other users. Collaborative filtering
approaches building a structure from a
user’s past behaviors and decisions of other
similar users. This model is then used
to predict items that the user may have an interest in. Since collaborative
filtering does not rely on machine analyzable contents, it is capable
of recommending for complex items accurately without “understanding” of the item
itself.

B.    Content Based Filtering

Content based filtering is another common approach
when designing recommendation systems. This technique is based
on a definition of the item and a user’s preferred profile. In a content based recommendation
systems, the keywords are considered as user’s interest. Content based filtering
approaches utilize a series of distinct property of an item in
order to obtain and recommend items with same properties.
These approaches are often combined as Hybrid Recommendation Systems.
These algorithm try to recommend items based on examining
the items that are liked by a user in the past or in the
present. In general, various candidate items are
compared with items previously rated by the user and the best matching
items are recommended.

                                                                                                                                           
II.    Literature survey

 

Recommendation system plays a vital role in
recommending personalized items for the users based on their interest in a web services.
The web also contains a rich and dynamic information’s. The amount of
information on the web is growing rapidly, as well as the number of web sites
and webpages per web site. Predicting the needs of a web user as she visits web
sites has gained importance. Many webpage recommendation system were developed in
the past, since they compute recommendations in online process, their time utilization
should be efficient.

A system 4 that uses support
vector machine (SVM) learning based model was
developed for computing similarity between two items which
performed better than latent factor approach for group recommendations.
Since the matrix representation was followed, the
data sparsity problem was solved. However, the system was
not able to stably scale when size of the group
dynamically increased.

Hybrid recommender systems that combines
two or more recommendation techniques was designed 5. It eliminates any
weakness which exist when only one recommender system is used. There are
several ways in which the systems can be combined, such as weighted hybrid
recommender where the score of a recommended item is computed from the results
of all of the available recommendation techniques present in the system.
However, data sparseness was still a problem, the system may generate week recommendations
if few users have rated the same items and also the system doesn’t overcome the
cold start problem.

Hyperspectral sensors can acquire
hundreds of contiguous bands over a wide electromagnetic spectrum for each
pixel. The rich spectral information allows for distinguishing materials with
subtle spectral discrepancy, but it usually leads to the “curse of
dimensionality”. To address this, an improved firefly algorithm based band
selection method 8 was used. The Firefly algorithm is an evolutionary optimization
algorithm proposed by Yang 13. After the initializations of parameters, the
brightness is calculated with the objective function (2.1), where t is the
maximum iterations, ? is the step size and ? is the light absorbance of m
number of fireflies. The moment states are then evaluated and the bands are
selected. In order to avoid employing an actual classifier within the band searching
process to greatly reduce computational cost, criterion functions that can
gauge class separability are preferred which provided better results. Firefly
algorithm also had a faster convergence even at the size of the data is larger

To improve the accuracy of similarity
measure, a nature inspired algorithm which is based in the behaviour of
Fireflies were introduced 10.We consider separate effects for ratings of
users with similar opinions and conflicting opinions. In order to generate
initial population of fireflies, half of population randomly generated and the other
half of population are randomly generated. Mean absolute error was chosen as objective
function to measure recommendation accuracy which is obtained by difference between
predicted rating and real rating. An optimal similarity measure via a simple
linear combination of values and ratio of ratings for user-based collaborative
filtering provides better results. It increased speed of finding nearest neighbours
of active user and reduce its computation time. Similarity function equation
based on Firefly algorithm was simpler than the equation used in traditional
metrics therefore, the proposed method provided recommendations faster than
traditional metrics. Graph colouring problems are generally discrete.
Algorithms to discrete problems are quite complex. A new algorithm based on
Similarity and discretize firefly algorithm directly without any other hybrid
algorithm was developed 11. It was adoptable to dynamic graph sizes.

 A
system for assigning an electronic document to one or more
predefined categories or classes based on its textual context and use of
agglomerative clustering algorithm was developed 6. This type of
clustering along with sample correlation coefficient as similarity measure,
allowed high indexing term space reduction factor with a gain of
higher classification accuracy.

In order to minimize noise and outlier
data, a modified DBSCALE algorithm using Naïve Bayes has been designed 7.
This algorithm is basically a prospect based utility. This function is used to
estimate the outlier cluster data and increase the correctness rate of
algorithm on given threshold value. Since Naïve Bayes is a probability based
function, it removes outlier cluster data and increases the correctness rate
according to threshold value. It also computes maximum posterior hypothesis for
outlier data. In order to minimize noise and outlier data, a modified DBSCALE
algorithm using Naïve Bayes has been designed 7. This algorithm is basically
a prospect based utility. This function is used to increase the
correctness rate of algorithm on given threshold value
and to estimate the outlier cluster data. Since Naïve Bayes is a
probability based function, it removes outlier cluster data
and increases the correctness rate according to
threshold value. It also computes maximum posterior hypothesis for
outlier data.

The memory based collaborative system
uses matrix based computation and solves data sparsity problem but, scalability
of the system cannot be stable when size of the group dynamically increases.
Hybrid system could be helpful in overcoming the scalability issue but it again
leads to cold start problem. To eliminate outliers as well as overcoming
other two
problems Naive Bayes clustering, a probability based method
was used in past. Firefly algorithm has a faster convergence and
searches all possible subsets with better time utilization. Thus, to design
an efficient recommendation system, Naïve Bayes method can be
followed for clustering in offline. Since the time complexity
should be less, Firefly algorithm that is more
efficient in terms of time utilization, it can be used for
calculating similarity in online. Combination of these two technique might
increase the accuracy of the recommendation system as well as results in efficient
time utilization.

                                                                                                                   
III.   Overview of the proposed
work

Initially, the web log files are obtained
from the 1 America Online Inc. The log files consists of five
fields i.e. anonymous ID for individual user, query of each user along with
query time, list of URLs which user proceeded and its
rank in the result. These logs are collected
and grouped based on anonymous ID. The URL among all
the users are obtained and its content are downloaded and
processed. The processing of data includes removal of stop
words from the URL’s data and keyword extraction. Similar users are clustered based
on fetched keywords by using Naïve Bayes clustering technique which provides efficient
clusters compared to clustering by the use of association rules.

The created clusters are given to online component.
In online process, when an active user gives a query, the keywords from
the query is extracted. The similarity between the extracted
keywords with the other users in the same cluster
of the active user is calculated using Firefly similarity
measure. The similarity values are sorted along with the web pages browsed by
similar users in the cluster. The top k web pages are recommended for the active user
as a result.

                                                                                                                                                
IV.   proposed work

The proposed system follows a linear
process of initially collecting the web logs and processing them followed by
clustering similar users by Naïve Bayes clustering technique and finally
generating recommendations based on a similarity measure from firefly
algorithm.

A.    Preprocessing of Web Logs

The web logs are collected form 1 AOL Inc.
It consists of 20 million web queries from 650 thousand real users over 3
months. The data set includes anonymous ID, query, query time,
item rank and click URL. The log file contains
many number of users along with the web pages visited by
them. It is validated and separated based on anonymous ID. The user
is separated into individual file using anonymous ID. The content from the URL
are fetched and downloaded. Those keywords are processed which undergoes stop
words removal and stemming process. The final keywords are then extracted. The
features like keywords, Timings, Frequency, Click URL and Revisit are fetched.
The user profile is constructed using those features. The user profile that
constructed is based on the features that are taken form the user log files.

·      
Timing: The timing that the
user spent on that particular URL

·      
Frequency: The amount of time
the user visited the URL

·      
Clickstream: The number of
click stream that are visited by user

·      
Revisit: Whether the user
visited the web page

The keywords are generated from the
data fetched form the URL. Timing for each URL is estimated
from the given date and time by calculating the difference between the each URL
that are searched in a single day by having some time constraints.
Frequency is hence calculated such that number of times the user clicked the URL.
The clickstreams are those that are clicked by
the user for additional information. The timing of revisit is
calculated such that to decide whether the user preferred it much or
not. Keywords: Keywords are those which are extracted from
the URL. The information from the URL is hence collected and
processed to obtain features of the user.

 

B.    Naïve Bayes
Clustering

Clustering, also known as unsupervised
classification, is a descriptive task with many applications. Clustering is
decomposition or partition of a data set into groups in such a way that the
object in one group are similar to each other but as different as possible from
the object in other groups. Three main approach for clustering of data is
partition based clustering, hierarchical clustering and probabilistic model
based clustering. Probabilistic model based clustering is a soft clustering
were an object can belong to more than one cluster following a probability
distribution. A clustering is useful if it produces some interesting insight in
the problem that we are analysing. Naïve Bayes clustering is also a
probabilistic clustering technique that is based in Bayes theorem with strong
independent assumption between features. The feature variables can be discrete
or continuous. This probabilistic clustering lies on nominal and numeric
variables in the data set and its novelty lies in the use of mixture of
truncated exponential (MTE) densities to model the numeric variables. In Naïve
Bayes clustering the class is the only root variable and all the attributes
are conditionally independent given the class. The clustering problem
reduces to take a data set of instances and a previously
specified number of clusters (k), and work out each cluster’s distribution
and the population distribution between the clusters. To obtain these
parameters the expectation maximization (EM) algorithm is used. Since
Naïve Bayes clustering is a probability based techniques. The items
belongs to the cluster if and only if it has a relation to it.
This helps in eliminating outlier data in the process of clustering. It also
provides proper clustering with less computations. The given dataset is divided
into two parts, one for the training and other for testing. For each record in
the test and train databases, the distribution of the class variable is
computed. According to the obtained distribution, a value for the class variable
is simulated and inserted in the corresponding cluster. The log-likelihood of
the new model is computed. If it is higher than the initial model, the process
is repeated. Otherwise, the process is stopped, obtained clusters are returned.

 

C.    Optimisation Using
Firefly Algorithm

Firefly algorithm is an evolutionary
algorithm that is based on the behaviour of fireflies. Fireflies live in
colonies and cooperate for the survival of the colony. Generally, in order to
model the behaviour of fireflies, three assumptions will always be considered i.e.
all fireflies are homogeneous, Attractiveness of each firefly is related to its
level of brightness, rightness of firefly is determined with an exponential
objective function. Each firefly always emits a kind of light that by which
attracts other fireflies. The amount of accessed light depends on parameters
such as distance and absorption coefficient of the surroundings. The longer the
distance the lesser the amount of accessed light will be. Also in surroundings
with high light absorption coefficient such as foggy weathers, the intensity of
light decreases. The certain issue is that every firefly regardless of its
gender has always been attracted to and moved toward the brighter firefly. Firefly
has a light intensity of its own. The key concept is, the firefly with low light
intensity is always attracted to the firefly with high light intensity. This
concept can be incorporated for calculating similarity. By using firefly based
similarity measure unique and distinguished results can be obtained which is a
useful feature for ranking. It can deal with highly non- linear, multi-modal
optimization problems naturally and efficiently. It does not use velocities,
and there is no problem as that associated with velocity in PSO. The speed of
convergence is very high in probability of finding the global optimized answer.
It has the flexibility of integration with other optimization techniques to
form hybrid tools. It does not require a good initial solution to start its iteration
process.

Each web pages visited by the user i are
considered a firefly. The number of user visited the particular page is assumed
as the light intensity of the firefly. The objective function is formulated
based on the frequency and duration. Frequency is calculated as the ratio to
the number of visits per page to the average vests of all pages. The duration
is the ratio of duration of page to the total duration of all the pages visited
by the user.Thus, the objective function can be defined as in equation 5.1

Interest (i)= 2*Frequency
(i)*Duration (i)

Frequency (i)+Duration (i) (5.1)

 

The interest of all users in the cluster
is calculated. Then the pages to be recommended are found by using page rank
algorithm 2 on the obtained result. The results after applying page rank
algorithm is given as the recommended web page to the user.

 

 

D.    5.2.6 Rankng The Web
Pages

The result, set of web pages obtain
should ranked in an order that the user might have higher interest. Thus, they
are ranked in a sorted order based on the interest of the active user. The
association rule checks the maximum possible combinations which provides more
accurate pages.

E.    5.2.7 Recommendaiton
Process

The URL that are to be recommended will
be identified based on ranking and similarity measure. The similarity measure
is calculated among the users by comparing their similar interest. From the
obtained result of pages, page rank algorithm is used to rank the most relevant
pages to the user. Thus, resultant URL’s are recommended to the users. Hence
the web page that is to be recommended to the user will be more relevant. The
use of Naïve Bayes clustering will eliminate the outliers and Firefly based
similarity calculation will check all the subsets of the clusters.