admin 管理员组文章数量: 1087139
2023年12月19日发(作者:海淀html5培训)
Fast Webpage Classification Using URL Features
Min-Yen Kan Hoang Oanh Nguyen Thi
Department of Computer Science, School of Computing
3 Science Drive 2, Singapore 117543
{kanmy,nguyent6}@
ABSTRACT
We demonstrate the usefulness of the uniform resource locator
(URL) alone in performing web page classification. This approach
is faster than typical web page classification, as the pages do not
have to be fetched and analyzed. Our approach segments the URL
into meaningful chunks and adds component, sequential and
orthographic features to model salient patterns. The resulting
features are used in supervised maximum entropy modeling. We
analyze our approach's effectiveness on two standardized
domains. Our results show that in certain scenarios, URL-based
methods approach the performance of current state-of-the-art full-text and link-based methods.
Categories and Subject Descriptors: H.3.1 [Information
Storage and Retrieval]: Content Analysis and Indexing –
Linguistic Processing
General Terms: Algorithms, Experimentation.
Keywords: Uniform resource locator, webpage classification.
2. FEATURE INVENTORY
In our system, a baseline segmentation using whitespace
delimitation and case change [B] is augmented with entropy-based segmentation [S]. We distill the following features from
the URL and discuss why they contribute to better accuracy:
• URL Components [C] + Length [L]: A token that occurs in
different parts of URLs may contribute differently to
classification ( vs. /protocols/).
The absence of certain components can influence classification.
URLs that underlie an advertising image often have a long
query component. URL length may also influence
classification. For example, departmental staff listings are
usually not deeply nested, while software drivers usually are.
As such, token inventories per component and component
lengths are added as features (Rows 2 and 3 in Table 1).
• Orthographic [O]: Using the surface form of a token also
presents challenges for generalization. For example, tokens
2002 and 2003 are distinct tokens. We add orthographic
features that generalize tokens with capitalized letters and/or
numbers that differentiate these tokens by their length.
• Sequential n-grams [N] and Precedence [P]: URL trees [5]
showed that token sequences are effective in classification. In
their work, a tree rooted at the leftmost token (usually http) is
created, in which successive tokens are inserted as children. A
tree structure emerges after many URLs are inserted. The
intuition is that subtrees have similar classification. However,
this approach does not generalize over multiple websites, as
websites appear as separate subtrees. Recurring patterns in
different websites are not captured.
We note it is the sequential order of nodes (from general to
specific) in the URL tree that is of import, and not the rooted
path. We thus reverse the order the components within the
server hostname, as hostnames are written specific-to-general.
We use n-gram token sequences on the resulting URL to
capture this phenomenon. Furthermore, consider the sequences
states georgia cities atlanta and georgia altanta. Modeling
these as token sequences fails to capture their similarity; as the
precedence between georgia and atlanta are missed. We can
capture this by introducing features that model left-to-right
precedence between tokens (rows 5 and 6).
1. INTRODUCTION
Current webpage classification techniques use a variety of
information to classify a target page: the text of the page itself, its
hyperlink structure, the link structure and anchor text from pages
pointing to the target page and its uniform resource locator
(URL). Of this information, a web page's URL is the least
expensive to obtain and one of the more informative sources with
respect to classification. As the URL is short, ubiquitous (all web
pages, whether or not they are accessible or even exist, have
URLs) and is largely content-bearing, it seems logical to expend
more effort in making full use of this resource.
We approach this problem by considering a classifier that is
restricted to using the URL as the sole source of input. Such a
classifier is of interest as it is magnitudes faster than traditional
approaches as it does not require fetching pages or parsing the
text. Our implementation uses a two-step approach, in which a
URL is first segmented into meaningful tokens, which are then
analyzed as features for classification. We use a recursive, entropy
reduction based technique to derive tokens from the URL for the
first step. We focus here on the second step: deriving useful
features for suitable for classification. A more complete report of
these experiments and others are discussed in [2]. These features
model sequential dependencies between tokens, their orthographic
patterns, length, and originating URI component. A key result is
that the combination of quality URL segmentation and feature
extraction results in a significant improvement in classification
accuracy over baseline approaches.
3. EVALUATION
Here we apply maximum entropy (ME) based learning [1] on our
features on two standardized datasets. We show that our approach
compares to or outperforms previous methods, even when only
using the URL alone. All results here reported in this section are
significant at the 95% confidence level, largely due to the size of
the datasets (which are noted in the table captions).
Copyright is held by the author/owner(s).
CIKM’05, October 31-November 5, 2005, Bremen, Germany.
ACM 1-59593-140-6/05/0010.
325
Feature Class (class tag)
0. Baseline (B)
1. Entropy Reduction (S)
2. URI Components (C)
3. Length (L)
4. Orthographic (O)
5. Sequential n-grams (N)
6. Precedence Bigram (P)
/services/?source=cnn&id=203&value=hurricane+isabel
http audience cnn com services activatealert jsp source cnn id 203 value hurricane Isabel
http audience cnn com services activate alert jsp source cnn id 203 value hurricane isabel
scheme:http extHost:audience dn:cnn tld:com ABSENT:port path:services … ABSENT:fragment
chars:total:42 segs:total:8 chars:scheme:4 segs:scheme:1 chars:extHost:8 segs: segs:extn:1
Numeric:3 Numeric:queryVal:3
com|cnn cnn|audience audience|services services|activate cnn|audience|services ... services|activate|alert|jsp
com>services com>activate com>alert com>jsp cnn>services cnn>activate cnn>alert cnn>jsp ... activate>jsp
Table 1: URL feature classes and examples.
In link recommendation, the goal is to build a classifier to
recommend useful links given a current webpage in a browser. In
[5][5], such a dataset was created from 176 users that examined
five news web pages. We follow their published experimental
procedure to extend their experiments.
Table 2 shows the results of the experiment in which the classifier
recommends the best 1, 3, 5, or 10 links on a page with the
highest probability of similarity to user clicks on the training
pages. The specialized tree learning algorithm (row TL-URL)
using their URL features performs best at recommending the
single most probable link, but is outperformed on the top 3, 5 and
10 metrics. Better classification is achieved by better URL feature
extraction, and outweighs the gains made by using a specialized
learner. This is exemplified in rows 2 and 3, where the same
learner is used (Support Vector Machines (SVM), with a linear
kernel) but using different features.
Table 2: Number of correct recommendations (2 classes, 182K
data points). Bolded numbers indicate top performers.
Learner Configuration
TL-URL (from [5])
SVM-URL (from [5])
SVM (our features)
ME (our features)
Top 1
385
308
363
365
Top 3
839
996
Top 5
1268
1456
Top 10
1953
2412
979 1388 2149
Table 3: WebKB performance (4 classes, 4.1K data points).
Past results re-printed on top half. ‘NR’ = not reported.
Learner Configurations [Cite]
SVM w/ URL (Kan, [3])
SVM w/ Full Text (Sun et al.[7])
SVM w/ Anchor Text (also [7])
ME w/ our URL features
ME w/ Full Text
ME w/ Full Text + our URL features
Accuracy
--
--
--
76.18%
78.39%
80.98%
Macro F1
.338
.492
.582
--
.525
.603
.627
ME w/ Full Text (Nigam et al. [4]) 92.08%
4. CONCLUSIONS AND FUTURE WORK
Given that the URL is a ubiquitous feature of web pages, we study
how they can be maximally leveraged for classification tasks. In
this report, we concentrate on URL feature extraction and have
added features to model URL component length, content,
orthography, token sequence and precedence. A full disclosure of
these experiments and per-feature class analyses are reported in
[2]. Results indicate that these extra features significantly improve
over existing URL features and suggest that they may also
improve full text methods.
1100 1682 2775
5. REFERENCES
[1] A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum
entropy approach to natural language processing.
Computational Linguistics, 22(1):39-71, 1996.
[2] M.-Y. Kan and H. O. Nguyen Thi. Fast Webpage
Classification Using URL Features. NUS Tech. Rpt. TRC
8/05.
[3] M.-Y. Kan. Web page classification without the web page.
In Proc. of WWW ‘04, 2004. Poster paper.
[4] K. Nigam, J. Lafferty, and A. McCallum. Using maximum
entropy for text classification. In IJCAI-99 Workshop on
Machine Learning for Information Filtering, 1999.
[5] L. K. Shih and D. Karger. Using URLs and table layout for
web classification tasks. In Proc. of WWW ‘04, 2004.
[6] S. Slattery and M. Craven. Combining statistical and
relational methods for learning in hypertext domains. In 8th
Int'l Conf. on Inductive Logic Programming, 1998.
[7] A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using
support vector machine. In 4th Int'l Workshop on Web
Information and Data Management (WIDM 2002), 2002.
For multi-class classification, we employ the standardized subset
of the WebKB corpus (ILP 98 [6]), in which each page is also
associated with its anchor text. The task is identical to earlier
published experiments: pages are classified as student, faculty,
course and project pages, and leave-one-university-out cross-validation is done. Previous work using the full text have
employed SVMs [7], maximum entropy [4], and inductive logic
programming [6]. Our results are shown alongside past results.
Performance is measured both by instance accuracy and macro F1,
as both metrics are used previously. Our new URL features
perform very well, boosting performance over URL-only previous
work by over 30% in the best case, resulting in 76% accuracy.
This is impressive as our URL-only method achieve about 95% of
the performance of full text methods. Also, our URL features can
supplement full text methods, as a small performance gain is
observed when the two methods are combined.
Note that our experiment show a best performance of ~78%
accuracy using full text in contrast with [4] which showed 92%
accuracy. The difference may be due to our use of leave-one-university-out cross validation, which we feel is more fair.
326
版权声明:本文标题:Fast Webpage Classification Using URL Features 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://roclinux.cn/b/1702951815a437156.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论