You are here: Foswiki>About Web>MscThesis (29 Aug 2009, ErikBorra)EditAttach
For the completion of my Master of Science in Artificial Intelligence (2006) I wrote the thesis 'An analysis of the tree-edit-distance for wrapperinduction of HTML-trees' (pdf in Dutch).

Abstract:

This paper discusses how the tree-edit-distance may be used for the problem of wrapper induction. The tree-edit-distance is used to find a mapping with minimal cost between the tree representation of HTML pages. With this mapping a template may be constructed in which only the elements common to the trees are kept. The parts specific to each tree are represented as wildcards. The template will thus be the most specific generalization of the pages and may be used for recognizing other pages of the same semantic type. By using the template for extraction on similar pages the instance specific information may be retrieved.

This paper shows that the domain of automatically generated HTML pages contains a number of characteristics with which the tree-edit-distance may be approached and calculated faster. A number of post processing steps are considered to make the templates more condensed and protect them from overfit. It is found that pruning always gives good results.

 


Blog


 


 


 


About