Conventional databases are deterministic Conventional databases are deterministic


НазваConventional databases are deterministic Conventional databases are deterministic
Дата конвертації26.04.2013
Розмір554 b.
ТипПрезентации



Conventional databases are deterministic

  • Conventional databases are deterministic

    • An item either is in the database or not in the database
    • Database is a “complete world”
    • A tuple either is in the query answer or is not
  • In many current applications, data are uncertain

    • Sensor networks
    • Information extraction
    • Moving objects tracking
    • Data cleaning
  • “A tuple belongs to the database” is a probabilistic event

  • “A tuple is part of query answer” is a probabilistic event

  • Data models, e.g., relational, XML, etc .. need to be extended to capture uncertainty

















New semantics for ranking and aggregation queries on uncertain data

  • New semantics for ranking and aggregation queries on uncertain data

    • Studying the interaction between scores and probabilities
    • Integrating scoring and uncertainty dimensions into the same query definitions
  • Efficient support of probabilistic ranking and aggregation queries

    • Incremental and optimized query processing
    • Unified framework for handling both ranking and uncertainty
    • Finding the most probable answers




Intensional Semantics [Dalvi et al., VLDB’04]

  • Intensional Semantics [Dalvi et al., VLDB’04]

    • Independent base tuples
    • Relational processing induces correlations among intermediate results






We treat tuples as probabilistic events associated with membership probabilities

  • We treat tuples as probabilistic events associated with membership probabilities

    • Later, we discuss modeling uncertain attribute values
  • Each tuple has a score computed using a query-specified scoring function

  • We assume a general uncertainty model that allows for computing the joint probability of an arbitrary combination of tuple events

  • Computing this probability is the only interface between the uncertainty model, and our processing framework

  • Example Models

    • Independent Tuples [Dalvi et al.VLDB04]
    • Correlated Tuples with Simple Rules [Sarma et al., ICDE06]
    • General Inference Model [Sen et al. ICDE07]












Tuple Access

  • Tuple Access

    • Tuples are consumed incrementally, i.e., one by one, from the output of a query executed by the relational engine in the Tuple Access Layer
  • Group Access

    • We assume an interface that allows accessing tuples from specific groups incrementally
  • Available Dependency Information

    • Dependencies among query output tuples are only known when these tuples are consumed by the Probabilistic Ranking Layer
  • Group Cardinality Information

    • The Tuple Access Layer provides (bounds on) the size of each group
  • Event Probability Computation

    • We assume the Rule Engine responds with exact probabilities to the submitted questions (tuple events combinations)


Top-l State

  • Top-l State

    • Top-l vector in one or more possible worlds
  • State Probability

    • The joint probability of involved tuple events
  • Search Goal

    • UTop-k: A state of length k with the maximum probability
  • Probability Reduction Property

  • State Comparability (Space Pruning)

    • Complete states prune partial ones
    • With independence, partial states prune each other




Tuple Access layer executes a score-based top-k query plan

  • Tuple Access layer executes a score-based top-k query plan

  • Probabilistic Ranking Layer consumes tuples (when needed) in score order

    • Necessary to compute non-trivial bounds on state probabilities
    • Other retrieval orders necessitate fully materializing the space
  • Our main cost metric is the number of consumed tuples n

    • Space size is in O(nk)
    • We adopt lazy space materialization to avoid space explosion
  • Query processing algorithms aim at

    • Minimizing the number of consumed tuples, and
    • Minimizing the number of materialized space states
  • Optimality guarantees on the size of materialized space, and the number of consumed tuples

    • Based on optimality of A* Search








Procedural definition

  • Procedural definition

    • Expand possible worlds
    • Group each world based on the grouping attributes
    • Score each group in each world based on group-aggregate function
    • Rank worlds based on groups’ scores
    • Aggregate the probabilities of top-ranked groups across worlds
    • Report the most probable answers
  • Very expensive !



2-Level Space Model

  • 2-Level Space Model

  • Group World

    • A possible world restricted to a specific group
    • Projects the space of possible worlds on grouping attributes
    • Aggregating tuple scores in a group world gives a possible group aggregate value
  • Global World

    • A ranking of one or more group worlds, coming from different groups, based on their aggregate scores
    • Represents a candidate query answer
  • Goal: Materialize the necessary group worlds, and combine them into global worlds, while looking for the global worlds with the highest probabilities





Intra-group search

  • Intra-group search

    • Applying the aggregate function to group worlds defines a probability distribution over the possible aggregates of each group
    • We are interested in exploring the aggregate space in score order to satisfy query’s ranking requirements
    • Basic Idea:
      • Incrementally scan group tuples in score order, and construct partitions of group aggregate space
      • Each partition has a score range and a probability
      • Partitions can be ordered if they have non-overlapping score ranges


Assume a group g={t1,t2,….}

  • Assume a group g={t1,t2,….}



Inter-groups search

  • Inter-groups search

    • Aggregate partitions coming from different groups are combined into across-groups ranking
    • Candidate answers are constructed incrementally based on probability






k-Nearest Neighbor Query reports the closest k objects to a query point q

  • k-Nearest Neighbor Query reports the closest k objects to a query point q

  • Objects are points in an n-dimensional space

  • Many applications:

    • Pattern recognition
    • Statistical classification
    • Location Tracking
  • In real applications, objects have imprecise descriptions

  • We propose a novel approach to efficiently answer NN queries over uncertain data



Locations of cell phones are probabilistically defined based on signal strength

  • Locations of cell phones are probabilistically defined based on signal strength

  • Moving objects have:

    • Uncertain locations
    • Uncertain membership
  • Example query: Locate the nearest witnesses/police cars to an accident site

  • User’s location is anonymized to preserve privacy

  • Example query: locate the closest gas stations to a given user



An Uncertain Object has:

  • An Uncertain Object has:

    • A membership probability
    • An uncertain attribute represented as a PDF defined on an uncertainty region
  • NN Probability

  • The probability of an object Oi to be the NN, Pnn(Oi ,Q), is computed using nested integration over all objects



Topk-PNN Query

  • Topk-PNN Query

    • Report the top-k probable NNs based on Pnn(Oi,Q)
  • Two cost factors:

    • I/O cost due to object retrieval
    • CPU cost due to integral evaluation
  • Main Idea:

    • Incremental retrieval model
    • Progressive (lazy) tightening of bounds computed on Pnn until stopping criteria is met








A tuple’s score is uncertain

  • A tuple’s score is uncertain

    • Data entry errors
    • Privacy concerns
    • Integrating heterogeneous data sources
    • Presentation style


  • Uncertain scores induce a partial order

  • Multiple valid rankings conform with the partial order

  • Enumerating possible rankings is infeasible

  • Multiple challenges

    • Ranking Model
    • Query Semantics
    • Query Processing


A tuple’s score is a PDF defined on a real interval

  • A tuple’s score is a PDF defined on a real interval

  • Tuples’ scoring PDFs are independent

  • Probabilistic Partial Order Model

    • Non-intersecting intervals are totally ordered
    • Intersecting intervals have a Probabilistic Dominance Relationship
  • A probabilistic process generates a probability distribution on the space of linear extensions



Record-Rank Queries

  • Record-Rank Queries

    • UTop-Rank(i,j)


A UTop-Rank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors

  • A UTop-Rank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors

  • A UTop-Rank(1, k) query can be used to find the most-likely location to be in the top-k hottest locations based on uncertain sensor readings represented as intervals

  • A UTop-Prefix query can be used in market analysis to find the most-likely product ranking based on fuzzy evaluations in users’ reviews

  • A UTop-Set query can be used to find a set of products that are most-likely to be ranked higher than all other products.



Monte-Carlo Integration

  • Monte-Carlo Integration

    • Transform the complex space of linear extensions to the space of all possible score combinations
      • Can be easily sampled at random
        • Sample from each [ loi, upi ] a score value biased by fi(x)
      • Allows estimating the summation of the probabilities of linear extensions having tuple t in the rank range i … j


Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits

  • Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits

  • Monte-Carlo integration gives the expected value of the multidimensional integral

  • The accuracy depends only on the number of drawn samples

    • The huge size of possible linear extensions does not affect accuracy
    • Error in O(1/s(1/2)), where s is the number of samples
  • We need to evaluate a linear number of integrals (one Monte-Carlo integral per tuple)

  • Given a sample size, the total cost is linear in database size

  • Using a heap of size l, we can easily compute the l most probable answers on the fly



The answer space is exponential

  • The answer space is exponential

    • We need to compute a nested integral for each possible permutation of k tuples, and report the most probable permutations
    • Exact computation cannot scale with respect to k
  • Computing approximate answers using Markov Chain Monte-Carlo (MCMC) methods by sampling the space biased by prefix probability:

    • Start from an initial linear extension with Prefix p0
    • Apply a transformation to move to another linear extension with Prefix p1
    • Accept the new linear extension biased by the probability of p1 with respect to p0
    • Repeat from (1)






To avoid biased estimation of approximation error, we use multiple independent Markov chains that simulate the same prefix space independently

  • To avoid biased estimation of approximation error, we use multiple independent Markov chains that simulate the same prefix space independently

  • Using Gelman-Rubin Markov chains diagnostic to test the convergence of all chains to the true prefix distribution

    • MCMC method converges to the stationary distribution with arbitrary state transitions provided that each state has non-zero probability of being visited, and the chain is aperiodic
  • The l most probable visited states across all chains are the approximate query answers



Optimal rank aggregation under Spearman Footrule distance

  • Optimal rank aggregation under Spearman Footrule distance

    • Measures distance between two rankings ωi and ωj as F(ωi,ωj)= ∑c |ωi(c)-ωj(c)|; where ωi(c) is the rank of element c in ωi
    • For a set of rankings {ω1,..,ωm}, find the permutation ω* that minimizes 1/m ∑i F(ωi,ω*)
    • Immediate applications in meta-searching


Polynomial time algorithm [Dwork et al., 2001]

  • Polynomial time algorithm [Dwork et al., 2001]

    • Bipartite weighted graph model
      • The first vertex set is for elements
      • The second vertex set is for ranks
      • Connect each element to each rank with a weighted edge
    • Given m different rankings {ωi …ωm }, the weight of the edge connecting element c to rank r is ∑i |ωi(c)-r|
    • Find the minimum cost perfect matching of the graph


We view each LE as a voter, where we need to find the ranking that minimizes the disagreements among all LE’s

  • We view each LE as a voter, where we need to find the ranking that minimizes the disagreements among all LE’s

    • Main challenge: the number of voters is huge.
  • Given λi(t) (the probability of each tuple t to appear at rank i), we compute the edge weights of the bipartite graph efficiently

    • Overall polynomial cost to compute λi’s (using Monte-Carlo integration) and solve the graph matching problem


For each rank i and each tuple t, compute λi(t) using Monte-Carlo integration (similar to Record-Rank queries)

  • For each rank i and each tuple t, compute λi(t) using Monte-Carlo integration (similar to Record-Rank queries)

  • The edge weight w(t,r) α ∑i λi(t).|i-r|

  • Apply polynomial graph matching algorithm



Optimal rank aggregation under Kendall’s tau distance

  • Optimal rank aggregation under Kendall’s tau distance

    • NP-Hard in general (reduction from minimum feedback arc set)
    • We define a key property that allows identifying easier instances of the problem
      • Weak Stochastic Transitivity:
      • [Pr(x>y) ≥ 0.5 and Pr(y>z) ≥ 0.5] → [Pr(x>z) ≥ 0.5]
    • A PPO with non-uniform densities:
      • If Weak Stochastic Transitive, then computing optimal rank aggregation has polynomial time algorithm
      • Else the problem is NP-Hard
    • A PPO with uniform score densities
      • The PPO is provably Weak Stochastic Transitive and optimal rank aggregation can be computed in O(nlog(n)).


A PPO with uniform score densities

  • A PPO with uniform score densities

    • Theorem: [E(fi) ≥ E(fj)] ↔ [Pr(ti > tj) ≥ 0.5]
    • The optimal rank aggregation ω*: [Pr(ti > tj) ≥ 0.5] →[ω*(ti) < ω*(tj)]
      • Guaranteed to be a valid ranking (no cycles)
      • Can be computed by sorting on expected scores
      • Must belong to the set of LE’s


We build on top-k query processing optimizations to integrate scores and probabilities as two interacting ranking dimensions

  • We build on top-k query processing optimizations to integrate scores and probabilities as two interacting ranking dimensions

    • Current top-k processing methods are not designed to treat probability as an additional ranking dimension
  • We analyze the implication of incorporating probability as an additional ranking dimension

    • Several potential interpretations of query results arise from this incorporation


Uncertainty models

  • Uncertainty models

    • The interpretation of tuple scores is challenging due to their interaction with probabilities
    • Need for new semantics that integrate both scores and probabilities
  • Supported queries are mainly Boolean. Current techniques are insufficient to handle probabilistic top-k queries efficiently

    • Proposed algorithms are primarily tuple-based, while top-k queries can be after tuple vectors
    • Conventional score-based ranking criteria are not considered in current algorithms


Top-k Query Processing in Uncertain Databases M. A. Soliman, I. F. Ilyas, K. C-C. Chang, ICDE’07

  • Top-k Query Processing in Uncertain Databases M. A. Soliman, I. F. Ilyas, K. C-C. Chang, ICDE’07

  • Finding Skyline and Top-k Bargaining Solutions (poster paper) M. A. Soliman, I. F. Ilyas, N. Koudas, ICDE’07

  • URank: Formulation and Efficient Evaluation of Top-k Queries in Uncertain Databases (demo paper) M. A. Soliman, I. F. Ilyas, K. C-C. Chang, SIGMOD’07 

  • Probabilistic Top-k and Ranking-Aggregate Queries M. A. Soliman, I. F. Ilyas, K. C-C. Chang, TODS’08

  • Efficient Search for the Top-k Probable Nearest Neighbors in Uncertain Databases G. Beskales, M. A. Soliman, I. F. Ilyas, VLDB’08

  • Ranking with Uncertain Scores M. A. Soliman, I. F. Ilyas, ICDE’09

  • Supporting Ranking Queries on Uncertain and Incomplete Data M. A. Soliman, I. F. Ilyas, Shalev Ben-David, VLDBJ’10

  • MashRank: Towards Uncertainty-aware and Rank-Aware Mashups (demo paper) M. A. Soliman, Mina Saleeb, I. F. Ilyas, ICDE’10



Схожі:

Conventional databases are deterministic Conventional databases are deterministic iconAtomic Model for stochastic environments with generalized rewards Deterministic worlds + goals of attainment


Додайте кнопку на своєму сайті:
dok.znaimo.com.ua


База даних захищена авторським правом ©dok.znaimo.com.ua 2013
звернутися до адміністрації
dok.znaimo.com.ua
Головна сторінка