Distributed Database System Lecture Note

Mind Map

Instructor’s name: Ali Safari
Textbook: Principles of Distributed Database Systems, 4th edition, M. Tamer Özsu and Patrick Valduriez, Springer, 2020, ISBN 978-3-030-26252-5

Distributed and Parallel Database Design

fragmentation

  • correctness

    • completness

      • each data in relation can also be found after fragmentation
    • reconstruction

      • by JOIN, the fragment can recovery to the original relation
    • disjointness

      • data in one fragment should not also be in other fragment
  • type

    • horizontal fragmentation (HF

      • primary horizontal (PHF

        • key points

          • simple prdicate

            • predicate: key + operator + value

              • eg. salary > 1000
          • minterm predicate

            • all possible combination of predicate

              • eg. loc = “France” ^ salary > 1000
          • minterm selectivities, sel(mi)

            • the percentage of records that minterm selected
          • access frequency, acc(qi)

            • how many times the same query asked by different user
          • cardinality, card(R)

            • number of rows
        • COM_MIN algorithm

          • input: a relation R, a set of simple predicates Pr

          • output: a “complete”, “minimal” set of simple predicates Pr'

        • PHORIZONTAL Algorithm

          • input: a relation R, a set of predicates Pr

          • output: a set of minterm predicates M according to which relation R is to be fragmented

      • derived horizontal (DHF

        • Based on the fragments created by PHF, apply similar fragmentation to other related relations.

        • eg. after PHF, we divide “PAY” to 2 fragments. There is also a relation “EMP” related with “PAY”. We can also divide “EMP” by the same rule

    • vertical fragmentation (VF

      • affinity matrix

        • calculate by access frequency matrix and usage matrix
      • bond energy algorithm (BEA

        • input: the AA matrix (attribute affinity)

        • output: the CA matrix(clustered affinity matrix)

        • by changing the order what’s the most contribution I can get?

        • find the best order for columns

    • hybrid fragmentation

      • apply both horizontal and vertical

      • reconstruction

        • vertical: join

        • horizontal: union

data distribution

  • allocation alternatives

    • non-replicated

      • each fragment resides at only one site
    • replicated

      • fully replicated

        • each fragment at each site
      • partially replicated

        • each fragment at some of the sites
      • if read-only queries » update queries, replication is good

  • fragment allocation

    • problem: fragments, network, application

    • find the optimal distribution

      • minimal cost

      • performance

      • constraint

        • response time

        • storage

        • processing

    • decision variable

      • Xij. 1 if fragment i store in at Site j. 0 otherwise
    • both FAP and DAP are NP-complete

      • heuristic based on. about finding the best combination

combined approach

transaction

all operations as one unit. whole or nothing

ACID

  • Atomicity

    • one unit
  • Consistency

  • Isolation

  • Durability

concurrent execution

  • increase processor and disk utilization (I/O no need CPU)

  • reduced average response time

  • important: multi tasks run in the but the result should be the same as serial running

  • validation

    • except read - read, all the others are conflict

    • try to move commands to see if they can be restored to the serial running format

    • if the commands are conflict, it should not be moved

serializability

  • view serializability

    • not strict

    • the initial, update, final result should be the same as serial schedule

    • check a schedule is serializable is NP-Complete problem

  • conflict serializability

    • more strict

      • conflict serializable is the sub set of serializable
    • there is no any conflict between transactions(R/W, W/W)

    • test method

      • swap non-conflicting instruction

        • If a schedule S can be transformed into a schedule S’ by a series of swaps of non-conflicting instructions, we say that S and S’ are conflict equivalent

        • a schedule S is conflict serializable if it is conflict equivalent to a serial schedule

      • precedence graph

        • transaction => node

        • conficts => edge

        • if graph has cycle, means not serializable

          • can do topological sorting

failure

  • rollbacks

    • cascading rollback

      • 1 transaction failure, all the other transactions rollback
  • recoverable schedule

    • ensure data consistency

    • reading transaction can read data which not commit yet, but cannot commit before the writing transaction

  • cascadeless schedules

    • enhace recoverable schedule

      • is the subset of recoverable
    • transaction can only read data which is commited

    • the schedule which try to avoid cascading rollbacks

concurrency control

concept

  • the mechanism provided by the db system

  • serial schedule is recoverable and cascadeless

  • have to trade off between serial schedule and concurrent schedule

  • ensure schedule is conflict or view serializable

  • ensure the schedule is rcoverable and preferably cascadeless

  • to achieve these purpose, it needs a “protocol” to assure serializability

protocols

  • lock-based protocols

    • exclusive (X) mode

      • cannot add any other lock

      • only one transaction can R/W data

    • shared (S) mode

      • can add more shared lock

      • multiple read transaction can read the data at the same time

    • two-phase locking protocol

      • grow-lockpoint-shrink

        • grow

          • the transaction acquire all the lock before access without release

          • can convert lock-S to lock-X (upgrade)

        • shrink

          • start to releasing locks, cannot acquire any new lock

          • can convert lock-X to lock-S (downgrade)

      • type

        • strict two-phase locking

          • keep all the X-lock till commit/abort
        • rigorous two-phase locking

          • keep all the locks till commit/abort
      • ensure conflict-serializable

      • cannot avoid deadlock

      • startegy

        • read: if lock: read() else: if lock-X: wait() grant lock-S read()

        • write: if lock-X: write() else: if other locks: wait() if lock-S: upgrade to lock-X else: grant lock-X write()

    • lock table

      • maintain by lock manager

      • lock table, record the type of lock granted or requested

      • like hash table

      • validate before grant new lock

        • if there are multiple locks, the last one can only be lock-X
  • Graph based protocol

    • alternative to two phase locking

    • Tree protocol

      • Only exclusive locks are allowed

      • once unlock, cannot relock

      • conflict serializable

      • not gurantee recoverability

      • no deadlock

deadlock prevention strategies

  • wait-die

    • older may wait for younger release

    • younger never wait, rolled back instead

  • wound-wait

    • older can force rollback younger

    • younger may wait for older

    • fewer rollback than wait-die

  • timeout-based schemes

deadlock detection

  • wait-for graph

    • Ti -> Tj: Ti is waiting for a lock held by Tj

    • deadlock if there is a cycle

deadlock recovery

  • total rollback

  • partial rollback

    • difficult

multiple granularity

  • can be represented as a tree

  • locks a node, also locks all the children node

  • Fine granularity: high concurrency, lower in tree

  • Coarse granularity: low concurrency, higher in tree

  • intention lock modes

    • 3 more lock mode than S, X

    • intention-shared (IS)

      • same as S, but locking at a lower level
    • intention-exclusive (IX)

    • shared and intention-exclusive (SIX)

    • allow a higher level node to be locked without having to check all descendent nodes

    • the compatibility matrix for all lock modes

Timestamp-based protocols

  • timestamp order = serializability order

  • Timestamp-ordering protocol

    • WTS(Q) (W-timestamp): the largest timestamp of any transaction that executed “write(Q)”

    • RTS(Q) (R-timestamp): the largest timestamp of any transaction that executed “read(Q)”

    • algorithm

      • Ti = Read(Q)

        • if TS(Ti) < WTS(Q), Reject (Ti needs the value that was already overwritten)

        • if TS(Ti) >= WTS(Q), execute, RTS(Q) update to max(RTS(Q), TS(Ti)) (Read after latest update is accepted)

      • Ti = Write(Q)

        • if TS(Ti) < RTS(Q), Reject (Ti produce a value that was needed previously)

        • if TS(Ti) < WTS(Q), Reject (Ti try to write an obsolete value)

        • else, WTS(Q) update to TS(Ti)

transaction processing-2

Distributed TM Architecture

serializability

  • the condition that global transaction is serializable

    • each local history should be serializable

    • two conflicting operations should be in the same relative order in all of the local histories where they appear together

concurrency control algorithms

  • Pessimistic

    • Two-Phase Locking-based (2PL)

      • centralized 2PL

        • only one 2PL scheduler in the distributed system

        • lock requests are issued to the central scheduler

        • pros: Simple

        • cons: reliability, bottle neck

      • distributed 2PL

Deadlock

  • locking-based algorithm may cause deadlocks

  • TO based algorithm that involve waiting may cause deadlocks

  • wait-for graph

    • Ti waits for Tj

    • Ti –> Tj

Query Processing

for one query, there may be several strategies

  • optimization: calculate the cost, then choose the lowest one

    • access cost

    • transfer cost

  • example

  • problem

  • Cost of Alternatives

Complexity of relational operations

  • Select Project

    • O(n)
  • Project (eliminate duplicate) Group

    • O(n * log n)

    • sorting + check the array sequentially

  • Join Semi-Join Division Set

    • O(n * log n)
  • Cartesian Product

    • O(n^2)

Query Processing Methodology

    1. Query Decomposition
    • input: Calculus query on global relations

    • Normalization

    • Analysis

    • Simplification

    • Restructing

    • output: Algebraic query

    1. Data Localization
    • input: Algebraic query on distributed relations

    • Localization program

    • Reduction based on the fragmentation strategy

      • PHF

        • Select

          • Because we have already divided the relations base on some rule. Only have to access the relations that have intersection with the query
        • Join

          • Distribute join over union

          • (R1 U R2)⋈S => (R1⋈S) U (R2⋈S)

          • by distribute 1 join to multiple join, we can eliminate some of them that have no intersection with the query

      • VF

        • find useless intermiediate relations
      • DHF

        • mix the PHF-Select and PHF-Join

        • example

        • query

          1. eliminate by Selection
          1. join over union
          1. eliminate the empty intermediate relations
      • Hybrid Fragmentation

        • remove empty relations by selection on HF

        • remove useless relations by projection on VF

        • distribute joins over unions to isolate and remove useless joins

    • output: Fragment query

    1. Distributed Query Optimization
    • input: Fragment query

    • Find the best global schedule

    • query optimization process

      • Search Space

        • The set of equivalent alebra expressions

        • Join Trees

          • Linear join tree

          • Bushy join tree

      • Cost Model

        • I/O cost + CPU cost + communication cost
      • Search Algorithm

        • exhaustive search / heuristic algorithm

        • how to “move” in the search space

        • Deterministic

          • start from base relations and build plans by adding one relation at each step

          • DP: BFS

          • Greedy: DFS

        • Randomized

          • trade optimization time for execution time

          • iterative improvement

使用 Hugo 建立
主題 StackJimmy 設計