Fast Webpage Classification Using URL Features-Linux大棚

admin 管理员组

文章数量: 1087139

2023年12月19日发(作者：海淀html5培训)

Fast Webpage Classification Using URL Features

Min-Yen Kan Hoang Oanh Nguyen Thi

Department of Computer Science, School of Computing

3 Science Drive 2, Singapore 117543

{kanmy,nguyent6}@

ABSTRACT

We demonstrate the usefulness of the uniform resource locator

(URL) alone in performing web page classification. This approach

is faster than typical web page classification, as the pages do not

have to be fetched and analyzed. Our approach segments the URL

into meaningful chunks and adds component, sequential and

orthographic features to model salient patterns. The resulting

features are used in supervised maximum entropy modeling. We

analyze our approach's effectiveness on two standardized

domains. Our results show that in certain scenarios, URL-based

methods approach the performance of current state-of-the-art full-text and link-based methods.

Categories and Subject Descriptors: H.3.1 [Information

Storage and Retrieval]: Content Analysis and Indexing –

Linguistic Processing

General Terms: Algorithms, Experimentation.

Keywords: Uniform resource locator, webpage classification.

2. FEATURE INVENTORY

In our system, a baseline segmentation using whitespace

delimitation and case change [B] is augmented with entropy-based segmentation [S]. We distill the following features from

the URL and discuss why they contribute to better accuracy:

• URL Components [C] + Length [L]: A token that occurs in

different parts of URLs may contribute differently to

classification ( vs. /protocols/).

The absence of certain components can influence classification.

URLs that underlie an advertising image often have a long

query component. URL length may also influence

classification. For example, departmental staff listings are

usually not deeply nested, while software drivers usually are.

As such, token inventories per component and component

lengths are added as features (Rows 2 and 3 in Table 1).

• Orthographic [O]: Using the surface form of a token also

presents challenges for generalization. For example, tokens

2002 and 2003 are distinct tokens. We add orthographic

features that generalize tokens with capitalized letters and/or

numbers that differentiate these tokens by their length.

• Sequential n-grams [N] and Precedence [P]: URL trees [5]

showed that token sequences are effective in classification. In

their work, a tree rooted at the leftmost token (usually http) is

created, in which successive tokens are inserted as children. A

tree structure emerges after many URLs are inserted. The

intuition is that subtrees have similar classification. However,

this approach does not generalize over multiple websites, as

websites appear as separate subtrees. Recurring patterns in

different websites are not captured.

We note it is the sequential order of nodes (from general to

specific) in the URL tree that is of import, and not the rooted

path. We thus reverse the order the components within the

server hostname, as hostnames are written specific-to-general.

We use n-gram token sequences on the resulting URL to

capture this phenomenon. Furthermore, consider the sequences

states georgia cities atlanta and georgia altanta. Modeling

these as token sequences fails to capture their similarity; as the

precedence between georgia and atlanta are missed. We can

capture this by introducing features that model left-to-right

precedence between tokens (rows 5 and 6).

1. INTRODUCTION

Current webpage classification techniques use a variety of

information to classify a target page: the text of the page itself, its

hyperlink structure, the link structure and anchor text from pages

pointing to the target page and its uniform resource locator

(URL). Of this information, a web page's URL is the least

expensive to obtain and one of the more informative sources with

respect to classification. As the URL is short, ubiquitous (all web

pages, whether or not they are accessible or even exist, have

URLs) and is largely content-bearing, it seems logical to expend

more effort in making full use of this resource.

We approach this problem by considering a classifier that is

restricted to using the URL as the sole source of input. Such a

classifier is of interest as it is magnitudes faster than traditional

approaches as it does not require fetching pages or parsing the

text. Our implementation uses a two-step approach, in which a

URL is first segmented into meaningful tokens, which are then

analyzed as features for classification. We use a recursive, entropy

reduction based technique to derive tokens from the URL for the

first step. We focus here on the second step: deriving useful

features for suitable for classification. A more complete report of

these experiments and others are discussed in [2]. These features

model sequential dependencies between tokens, their orthographic

patterns, length, and originating URI component. A key result is

that the combination of quality URL segmentation and feature

extraction results in a significant improvement in classification

accuracy over baseline approaches.

3. EVALUATION

Here we apply maximum entropy (ME) based learning [1] on our

features on two standardized datasets. We show that our approach

compares to or outperforms previous methods, even when only

using the URL alone. All results here reported in this section are

significant at the 95% confidence level, largely due to the size of

the datasets (which are noted in the table captions).

CIKM’05, October 31-November 5, 2005, Bremen, Germany.

ACM 1-59593-140-6/05/0010.

325

Feature Class (class tag)

0. Baseline (B)

1. Entropy Reduction (S)

2. URI Components (C)

3. Length (L)

4. Orthographic (O)

5. Sequential n-grams (N)

6. Precedence Bigram (P)

/services/?source=cnn&id=203&value=hurricane+isabel

http audience cnn com services activatealert jsp source cnn id 203 value hurricane Isabel

http audience cnn com services activate alert jsp source cnn id 203 value hurricane isabel

scheme:http extHost:audience dn:cnn tld:com ABSENT:port path:services … ABSENT:fragment

chars:total:42 segs:total:8 chars:scheme:4 segs:scheme:1 chars:extHost:8 segs: segs:extn:1

Numeric:3 Numeric:queryVal:3

com>services com>activate com>alert com>jsp cnn>services cnn>activate cnn>alert cnn>jsp ... activate>jsp

Table 1: URL feature classes and examples.

In link recommendation, the goal is to build a classifier to

recommend useful links given a current webpage in a browser. In

[5][5], such a dataset was created from 176 users that examined

five news web pages. We follow their published experimental

procedure to extend their experiments.

Table 2 shows the results of the experiment in which the classifier

recommends the best 1, 3, 5, or 10 links on a page with the

highest probability of similarity to user clicks on the training

pages. The specialized tree learning algorithm (row TL-URL)

using their URL features performs best at recommending the

single most probable link, but is outperformed on the top 3, 5 and

10 metrics. Better classification is achieved by better URL feature

extraction, and outweighs the gains made by using a specialized

learner. This is exemplified in rows 2 and 3, where the same

learner is used (Support Vector Machines (SVM), with a linear

kernel) but using different features.

Table 2: Number of correct recommendations (2 classes, 182K

data points). Bolded numbers indicate top performers.

Learner Configuration

TL-URL (from [5])

SVM-URL (from [5])

SVM (our features)

ME (our features)

Top 1

385

308

363

365

Top 3

839

996

Top 5

1268

1456

Top 10

1953

2412

979 1388 2149

Table 3: WebKB performance (4 classes, 4.1K data points).

Past results re-printed on top half. ‘NR’ = not reported.

Learner Configurations [Cite]

SVM w/ URL (Kan, [3])

SVM w/ Full Text (Sun et al.[7])

SVM w/ Anchor Text (also [7])

ME w/ our URL features

ME w/ Full Text

ME w/ Full Text + our URL features

Accuracy

76.18%

78.39%

80.98%

Macro F1

.338

.492

.582

.525

.603

.627

ME w/ Full Text (Nigam et al. [4]) 92.08%

4. CONCLUSIONS AND FUTURE WORK

Given that the URL is a ubiquitous feature of web pages, we study

how they can be maximally leveraged for classification tasks. In

this report, we concentrate on URL feature extraction and have

added features to model URL component length, content,

orthography, token sequence and precedence. A full disclosure of

these experiments and per-feature class analyses are reported in

[2]. Results indicate that these extra features significantly improve

over existing URL features and suggest that they may also

improve full text methods.

1100 1682 2775

5. REFERENCES

[1] A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum

entropy approach to natural language processing.

Computational Linguistics, 22(1):39-71, 1996.

[2] M.-Y. Kan and H. O. Nguyen Thi. Fast Webpage

Classification Using URL Features. NUS Tech. Rpt. TRC

8/05.

[3] M.-Y. Kan. Web page classification without the web page.

In Proc. of WWW ‘04, 2004. Poster paper.

[4] K. Nigam, J. Lafferty, and A. McCallum. Using maximum

entropy for text classification. In IJCAI-99 Workshop on

Machine Learning for Information Filtering, 1999.

[5] L. K. Shih and D. Karger. Using URLs and table layout for

web classification tasks. In Proc. of WWW ‘04, 2004.

[6] S. Slattery and M. Craven. Combining statistical and

relational methods for learning in hypertext domains. In 8th

Int'l Conf. on Inductive Logic Programming, 1998.

[7] A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using

support vector machine. In 4th Int'l Workshop on Web

Information and Data Management (WIDM 2002), 2002.

For multi-class classification, we employ the standardized subset

of the WebKB corpus (ILP 98 [6]), in which each page is also

associated with its anchor text. The task is identical to earlier

published experiments: pages are classified as student, faculty,

course and project pages, and leave-one-university-out cross-validation is done. Previous work using the full text have

employed SVMs [7], maximum entropy [4], and inductive logic

programming [6]. Our results are shown alongside past results.

Performance is measured both by instance accuracy and macro F1,

as both metrics are used previously. Our new URL features

perform very well, boosting performance over URL-only previous

work by over 30% in the best case, resulting in 76% accuracy.

This is impressive as our URL-only method achieve about 95% of

the performance of full text methods. Also, our URL features can

supplement full text methods, as a small performance gain is

observed when the two methods are combined.

Note that our experiment show a best performance of ~78%

accuracy using full text in contrast with [4] which showed 92%

accuracy. The difference may be due to our use of leave-one-university-out cross validation, which we feel is more fair.

326

本文标签：海淀培训作者

版权声明：本文标题：Fast Webpage Classification Using URL Features 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://roclinux.cn/b/1702951815a437156.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

Fast Webpage Classification Using URL Features

更多相关文章

蚂蚁爱吃什么英语作文

简单的实验报告英语作文

lam-manual_图文

药用微胶囊的制备

美迪辛士化学产品清单表说明书

Chromocult Listeria Agar (Base) acc. OTTAVIANI 和 A

微信的英语作文

Stata MI命令详解说明书

DEPENDABLEMSIANDAPP-VAPPLICATION…

Novagen S

Endrich NBIoT M910-GL 三模LPWA模块说明书

CPSIA electronics豁免项目_图文

纤维素酶经典综述_免费下载

MAK034 Coenzyme A Assay Kit 产品说明书

RayBio Human IDO ELISA Kit 说明书

promega celltiter-fluor cell viability assay protocol

Spyder安装调试说明

WEBs AX TR70_v2 Spyder 介绍 连接 设置 编程

计算机类考试11场应知原题

最新国家开放大学电大专科《管理信息系统》机考网络考试标准题库及答案

发表评论

推荐文章

javascript - AngularJS window.close not working - Stack Overflow

Can client JavaScript use its own HTTP proxy? - Stack Overflow

javascript - Convert text to and from Serbian cyrillic letters - Stack Overflow

证件照制作工具免费有哪些？这8个软件的能一键制作

如何下载centos7的iso文件

热门文章

javascript - Splicing first object returns TypeError: Cannot read property of undefined - Stack Overflow

javascript - How to get actual Dom node with React.useRef ? Element.getBoundingClientRect() not working on useRef variable - Sta

vue.js - Vue+Vite+Quasar as web component - Stack Overflow

How to pull the HTML page for a GitHub issue from the command line? - Stack Overflow

html - javascript window.location id - Stack Overflow

javascript - Switch window between normal and full-screen mode - Stack Overflow

typescript - Module not Found when using Custom Type .d.ts in Next.js - Stack Overflow

javascript - Is there a way to put every configuration file to a config directory rather than root in Node project? - Stack Over

windows7如何自己重装系统,windows7怎么自己重装系统

WSL到Windows,最简单的端口转发配置方案

最新文章

javascript - How do I toggle the readonly attribute of all child element with jquery - Stack Overflow

javascript - Might it be possible to block an entire US state from accessing my site, using PHP? - Stack Overflow

c++ - Is dereferencing std::span::end always undefined? - Stack Overflow

javascript - Delay function execution if it has been called recently - Stack Overflow

javascript - Google Maps Autocomplete List - Stack Overflow

Windows 安装和连接使用 PgSql数据库

cmd打开计算机D盘,Win7利用cmd命令进入d盘文件夹的操作方法

如何在VMare中制作Windows Embedded Standard 7 (WES 7)

开机、注销后自动登录Windows

【教程】Python Flask快速学习

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

Exploring the Exquisite Aloft Chicago O'Hare: A Blend of Modern Luxury and Convenience

A Culinary Journey: Discovering the Finest Dining Experiences in Waco, TX

A Culinary Journey: Discovering the Finest Dining Experiences in Athens, GA

WEBs AX TR70_v2 Spyder 介绍连接设置编程