You are in:

Wrapper Inference with the Power of the Crowd


Martedý 10 Settembre 2013
ore 14.30-15.30
Sala Riunioni, Via Anzani 42 (3. piano) - Como

Wrapper Inference with the Power of the Crowd
Speaker: Valter Crescenzi, Researcher at UniversitÓ Roma Tre

Scaling the extraction of data from web sources is still a challenging issue. Supervised approaches capable of producing wrappers of high accuracy require expensive training data, i.e., annotations over a set of sample pages, that limit their scalability. Crowd sourcing platforms represent an opportunity to make the manual annotation process more affordable, also at large scale. However, crowd sourcing suffers some limitations: (i) the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people; (ii) the number of tasks should be minimized, to contain the costs; (iii) suitable strategies should be implemented to deal with the accuracy of workers and the possible presence of adversarial spammers. We introduce a framework to support a wrapper inference system supervised by the crowd. Our framework aims at catching the opportunities of crowd sourcing, overcoming its limitations: (i) the training data are labeled values generated by means of membership queries, the simplest form of queries; (ii) the proposed inference algorithm minimizes the number of membership queries by means of an original active learning approach that chooses the expressiveness of the wrapper formalism at runtime and actively selects the queries; (iii) an original probabilistic model that takes into account the workers' mistakes is used to decide at runtime the number of workers to engage.