Conventional databases are deterministic Conventional databases are deterministic  An item either is in the database or not in the database
 Database is a “complete world”
 A tuple either is in the query answer or is not
In many current applications, data are uncertain  Sensor networks
 Information extraction
 Moving objects tracking
 Data cleaning
“A tuple belongs to the database” is a probabilistic event “A tuple is part of query answer” is a probabilistic event Data models, e.g., relational, XML, etc .. need to be extended to capture uncertainty
New semantics for ranking and aggregation queries on uncertain data New semantics for ranking and aggregation queries on uncertain data  Studying the interaction between scores and probabilities
 Integrating scoring and uncertainty dimensions into the same query definitions
Efficient support of probabilistic ranking and aggregation queries  Incremental and optimized query processing
 Unified framework for handling both ranking and uncertainty
 Finding the most probable answers
Intensional Semantics [Dalvi et al., VLDB’04] Intensional Semantics [Dalvi et al., VLDB’04]  Independent base tuples
 Relational processing induces correlations among intermediate results
We treat tuples as probabilistic events associated with membership probabilities We treat tuples as probabilistic events associated with membership probabilities  Later, we discuss modeling uncertain attribute values
Each tuple has a score computed using a queryspecified scoring function We assume a general uncertainty model that allows for computing the joint probability of an arbitrary combination of tuple events Computing this probability is the only interface between the uncertainty model, and our processing framework Example Models  Independent Tuples [Dalvi et al.VLDB04]
 Correlated Tuples with Simple Rules [Sarma et al., ICDE06]
 General Inference Model [Sen et al. ICDE07]
Tuple Access Tuple Access  Tuples are consumed incrementally, i.e., one by one, from the output of a query executed by the relational engine in the Tuple Access Layer
Group Access  We assume an interface that allows accessing tuples from specific groups incrementally
Available Dependency Information  Dependencies among query output tuples are only known when these tuples are consumed by the Probabilistic Ranking Layer
Group Cardinality Information  The Tuple Access Layer provides (bounds on) the size of each group
Event Probability Computation  We assume the Rule Engine responds with exact probabilities to the submitted questions (tuple events combinations)
Topl State Topl State  Topl vector in one or more possible worlds
State Probability  The joint probability of involved tuple events
Search Goal  UTopk: A state of length k with the maximum probability
Probability Reduction Property State Comparability (Space Pruning)  Complete states prune partial ones
 With independence, partial states prune each other
Tuple Access layer executes a scorebased topk query plan Tuple Access layer executes a scorebased topk query plan Probabilistic Ranking Layer consumes tuples (when needed) in score order  Necessary to compute nontrivial bounds on state probabilities
 Other retrieval orders necessitate fully materializing the space
Our main cost metric is the number of consumed tuples n  Space size is in O(nk)
 We adopt lazy space materialization to avoid space explosion
Query processing algorithms aim at  Minimizing the number of consumed tuples, and
 Minimizing the number of materialized space states
Optimality guarantees on the size of materialized space, and the number of consumed tuples  Based on optimality of A* Search
Procedural definition Procedural definition  Expand possible worlds
 Group each world based on the grouping attributes
 Score each group in each world based on groupaggregate function
 Rank worlds based on groups’ scores
 Aggregate the probabilities of topranked groups across worlds
 Report the most probable answers
Very expensive !
2Level Space Model 2Level Space Model Group World  A possible world restricted to a specific group
 Projects the space of possible worlds on grouping attributes
 Aggregating tuple scores in a group world gives a possible group aggregate value
Global World  A ranking of one or more group worlds, coming from different groups, based on their aggregate scores
 Represents a candidate query answer
Goal: Materialize the necessary group worlds, and combine them into global worlds, while looking for the global worlds with the highest probabilities
Intragroup search Intragroup search  Applying the aggregate function to group worlds defines a probability distribution over the possible aggregates of each group
 We are interested in exploring the aggregate space in score order to satisfy query’s ranking requirements
 Basic Idea:
 Incrementally scan group tuples in score order, and construct partitions of group aggregate space
 Each partition has a score range and a probability
 Partitions can be ordered if they have nonoverlapping score ranges
Assume a group g={t1,t2,….} Assume a group g={t1,t2,….}
Intergroups search Intergroups search  Aggregate partitions coming from different groups are combined into acrossgroups ranking
 Candidate answers are constructed incrementally based on probability
kNearest Neighbor Query reports the closest k objects to a query point q kNearest Neighbor Query reports the closest k objects to a query point q Objects are points in an ndimensional space Many applications:  Pattern recognition
 Statistical classification
 Location Tracking
In real applications, objects have imprecise descriptions We propose a novel approach to efficiently answer NN queries over uncertain data
Locations of cell phones are probabilistically defined based on signal strength Locations of cell phones are probabilistically defined based on signal strength Moving objects have:  Uncertain locations
 Uncertain membership
Example query: Locate the nearest witnesses/police cars to an accident site User’s location is anonymized to preserve privacy Example query: locate the closest gas stations to a given user
An Uncertain Object has: An Uncertain Object has:  A membership probability
 An uncertain attribute represented as a PDF defined on an uncertainty region
NN Probability The probability of an object Oi to be the NN, Pnn(Oi ,Q), is computed using nested integration over all objects
TopkPNN Query TopkPNN Query  Report the topk probable NNs based on Pnn(Oi,Q)
Two cost factors:  I/O cost due to object retrieval
 CPU cost due to integral evaluation
Main Idea:  Incremental retrieval model
 Progressive (lazy) tightening of bounds computed on Pnn until stopping criteria is met
A tuple’s score is uncertain A tuple’s score is uncertain  Data entry errors
 Privacy concerns
 Integrating heterogeneous data sources
 Presentation style
Uncertain scores induce a partial order Multiple valid rankings conform with the partial order Enumerating possible rankings is infeasible Multiple challenges  Ranking Model
 Query Semantics
 Query Processing
A tuple’s score is a PDF defined on a real interval A tuple’s score is a PDF defined on a real interval Tuples’ scoring PDFs are independent Probabilistic Partial Order Model  Nonintersecting intervals are totally ordered
 Intersecting intervals have a Probabilistic Dominance Relationship
A probabilistic process generates a probability distribution on the space of linear extensions
RecordRank Queries
A UTopRank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors A UTopRank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors A UTopRank(1, k) query can be used to find the mostlikely location to be in the topk hottest locations based on uncertain sensor readings represented as intervals A UTopPrefix query can be used in market analysis to find the mostlikely product ranking based on fuzzy evaluations in users’ reviews A UTopSet query can be used to find a set of products that are mostlikely to be ranked higher than all other products.
MonteCarlo Integration MonteCarlo Integration  Transform the complex space of linear extensions to the space of all possible score combinations
 Can be easily sampled at random
 Sample from each [ loi, upi ] a score value biased by fi(x)
 Allows estimating the summation of the probabilities of linear extensions having tuple t in the rank range i … j
MonteCarlo integration allows us to avoid computing complex multidimensional integrals with dependent limits MonteCarlo integration allows us to avoid computing complex multidimensional integrals with dependent limits MonteCarlo integration gives the expected value of the multidimensional integral The accuracy depends only on the number of drawn samples  The huge size of possible linear extensions does not affect accuracy
 Error in O(1/s(1/2)), where s is the number of samples
We need to evaluate a linear number of integrals (one MonteCarlo integral per tuple) Given a sample size, the total cost is linear in database size Using a heap of size l, we can easily compute the l most probable answers on the fly
The answer space is exponential The answer space is exponential  We need to compute a nested integral for each possible permutation of k tuples, and report the most probable permutations
 Exact computation cannot scale with respect to k
Computing approximate answers using Markov Chain MonteCarlo (MCMC) methods by sampling the space biased by prefix probability:  Start from an initial linear extension with Prefix p0
 Apply a transformation to move to another linear extension with Prefix p1
 Accept the new linear extension biased by the probability of p1 with respect to p0
 Repeat from (1)
To avoid biased estimation of approximation error, we use multiple independent Markov chains that simulate the same prefix space independently To avoid biased estimation of approximation error, we use multiple independent Markov chains that simulate the same prefix space independently Using GelmanRubin Markov chains diagnostic to test the convergence of all chains to the true prefix distribution  MCMC method converges to the stationary distribution with arbitrary state transitions provided that each state has nonzero probability of being visited, and the chain is aperiodic
The l most probable visited states across all chains are the approximate query answers
Optimal rank aggregation under Spearman Footrule distance Optimal rank aggregation under Spearman Footrule distance  Measures distance between two rankings ωi and ωj as F(ωi,ωj)= ∑c ωi(c)ωj(c); where ωi(c) is the rank of element c in ωi
 For a set of rankings {ω1,..,ωm}, find the permutation ω* that minimizes 1/m ∑i F(ωi,ω*)
 Immediate applications in metasearching
Polynomial time algorithm [Dwork et al., 2001] Polynomial time algorithm [Dwork et al., 2001]  Bipartite weighted graph model
 The first vertex set is for elements
 The second vertex set is for ranks
 Connect each element to each rank with a weighted edge
 Given m different rankings {ωi …ωm }, the weight of the edge connecting element c to rank r is ∑i ωi(c)r
 Find the minimum cost perfect matching of the graph
We view each LE as a voter, where we need to find the ranking that minimizes the disagreements among all LE’s We view each LE as a voter, where we need to find the ranking that minimizes the disagreements among all LE’s  Main challenge: the number of voters is huge.
Given λi(t) (the probability of each tuple t to appear at rank i), we compute the edge weights of the bipartite graph efficiently  Overall polynomial cost to compute λi’s (using MonteCarlo integration) and solve the graph matching problem
For each rank i and each tuple t, compute λi(t) using MonteCarlo integration (similar to RecordRank queries) For each rank i and each tuple t, compute λi(t) using MonteCarlo integration (similar to RecordRank queries) The edge weight w(t,r) α ∑i λi(t).ir Apply polynomial graph matching algorithm
Optimal rank aggregation under Kendall’s tau distance Optimal rank aggregation under Kendall’s tau distance  NPHard in general (reduction from minimum feedback arc set)
 We define a key property that allows identifying easier instances of the problem
 Weak Stochastic Transitivity:
 [Pr(x>y) ≥ 0.5 and Pr(y>z) ≥ 0.5] → [Pr(x>z) ≥ 0.5]
 A PPO with nonuniform densities:
 If Weak Stochastic Transitive, then computing optimal rank aggregation has polynomial time algorithm
 Else the problem is NPHard
 A PPO with uniform score densities
 The PPO is provably Weak Stochastic Transitive and optimal rank aggregation can be computed in O(nlog(n)).
A PPO with uniform score densities A PPO with uniform score densities  Theorem: [E(fi) ≥ E(fj)] ↔ [Pr(ti > tj) ≥ 0.5]
 The optimal rank aggregation ω*: [Pr(ti > tj) ≥ 0.5] →[ω*(ti) < ω*(tj)]
 Guaranteed to be a valid ranking (no cycles)
 Can be computed by sorting on expected scores
 Must belong to the set of LE’s
We build on topk query processing optimizations to integrate scores and probabilities as two interacting ranking dimensions We build on topk query processing optimizations to integrate scores and probabilities as two interacting ranking dimensions  Current topk processing methods are not designed to treat probability as an additional ranking dimension
We analyze the implication of incorporating probability as an additional ranking dimension  Several potential interpretations of query results arise from this incorporation
Uncertainty models Uncertainty models  The interpretation of tuple scores is challenging due to their interaction with probabilities
 Need for new semantics that integrate both scores and probabilities
Supported queries are mainly Boolean. Current techniques are insufficient to handle probabilistic topk queries efficiently  Proposed algorithms are primarily tuplebased, while topk queries can be after tuple vectors
 Conventional scorebased ranking criteria are not considered in current algorithms
Topk Query Processing in Uncertain Databases M. A. Soliman, I. F. Ilyas, K. CC. Chang, ICDE’07 Topk Query Processing in Uncertain Databases M. A. Soliman, I. F. Ilyas, K. CC. Chang, ICDE’07 Finding Skyline and Topk Bargaining Solutions (poster paper) M. A. Soliman, I. F. Ilyas, N. Koudas, ICDE’07 URank: Formulation and Efficient Evaluation of Topk Queries in Uncertain Databases (demo paper) M. A. Soliman, I. F. Ilyas, K. CC. Chang, SIGMOD’07 Probabilistic Topk and RankingAggregate Queries M. A. Soliman, I. F. Ilyas, K. CC. Chang, TODS’08 Efficient Search for the Topk Probable Nearest Neighbors in Uncertain Databases G. Beskales, M. A. Soliman, I. F. Ilyas, VLDB’08 Ranking with Uncertain Scores M. A. Soliman, I. F. Ilyas, ICDE’09 Supporting Ranking Queries on Uncertain and Incomplete Data M. A. Soliman, I. F. Ilyas, Shalev BenDavid, VLDBJ’10 MashRank: Towards Uncertaintyaware and RankAware Mashups (demo paper) M. A. Soliman, Mina Saleeb, I. F. Ilyas, ICDE’10
