报告题目：Data Glitches = Constraint Violations – Empirical Explanations
报告人： Divesh Srivastava, ACM Fellow, the head of Database Research,
主持人：Professor Xuemin Lin
Data glitches are unusual observations that do not conform to data quality expectations, be they semantic or syntactic, logical or statistical. By naively applying integrity constraints, potentially large amounts of data could be flagged as being violations. Ignoring or repairing significant amounts of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large volumes and varieties of data from disparate sources are integrated, it is likely that significant portions of these violations are actually legitimate usable data. We conjecture that empirical glitch explanations – concise characterizations of subsets of violating data – could be used to (a) identify legitimate data and release them back into the pool of clean data, thereby reduce cleaning-related statistical distortion of the data; and (b) refine existing integrity constraints and generate improved domain knowledge. We present a few real-world case studies in support of our conjecture, outline scalable techniques to address the challenges of discovering explanations, and demonstrate the utility of the explanations in reclaiming over 99% of the violating data.
Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). His research interests and publications span a variety of topics in data management. He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.