VGTU talpykla > Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences > Moksliniai straipsniai / Research articles >

Lietuvių   English
Please use this identifier to cite or link to this item: http://dspace.vgtu.lt/handle/1/1730

Title: Unsupervised structured data extraction from template-generated web pages
Authors: Grigalis, Tomas
Čenys, Antanas
Keywords: Deep Web
Data extraction
Structured web data
Wrapper induction
Issue Date: 2014
Publisher: Graz University of Technology
Citation: Grigalis, T.; Čenys, A. 2014. Unsupervised structured data extraction from template-generated web pages, Journal of Universal Computer Science (J.UCS) 20(3): 169-192
Series/Report no.: Vol 20; iss. 2 (2014)
Abstract: This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle the problem of automatic structured Web data extraction we present a new approach - structured data extraction based on clustering visually similar Web page elements. Our method called ClustVX combines visual and pure HTML features of Web page to cluster visually similar Web page elements and then extract structured Web data. ClustVX can extract structured data from Web pages where more than one data record is present. With extensive experimental evaluation on three benchmark datasets we demonstrate that ClustVX achieves better results than other state-of-the-art automatic structured Web data extraction methods.
URI: http://dspace1.vgtu.lt/handle/1/1730
ISSN: 0948-695X
Appears in Collections:Moksliniai straipsniai / Research articles

Files in This Item:

File Description SizeFormat
j.ucs_Vol20_Iss2_169-192_grigaitis.pdf411.37 kBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback