Mining Deep Web Repositories
@ ECML-PKDD 2012

Home Outline Material Links

Gautam Das, University of Texas at Arlington and Qatar Computing Research Institute
Nan Zhang, George Washington University

Abstract: With the proliferation of online repositories (e.g., databases or document corpora) hidden behind proprietary web interfaces, e.g., keyword-/form-based search and hierarchical/graph-based browsing interfaces, efficient ways of enabling machine learning and data mining tasks over contents in such hidden repositories are of increasing importance. There are two key challenges: one on the proper understanding of interfaces, and the other on learning/mining over a properly understood interface. There are three ways to enable efficient machine learning and data mining over deep web data – (1) crawling the deep web repository before applying conventional mining techniques, (2) sampling the deep web repository before learning/mining the retrieved samples, at the expense of additional error introduces by sampling, and (3) estimating aggregates over slices of data in the deep web repository, and then using the estimated aggregates to support machine learning or data mining tasks. In this tutorial, we focus on the fundamental developments in the field, including web interface understanding, crawling, sampling, and aggregate estimation over web repositories with various types of interfaces and containing structured or unstructured data. We also discuss the potential changes required for machine learning and data mining algorithms should one choose to use the second and third methods described above. Our goal is to encourage the audience to initiate their own research in these exciting areas.