Friday, November 5, 2010

Locking Schemes for Replicated Data Updating

Data distribution is commonly used in high-performance computing (HPC). Basically there are two fundamental data distribution topologies. One is replication; the other is partition.

With data partition, you can achieve parallel processing of a large amount of data.
With data replication, you can achieve load balancing and high availability(HA).

Even though a data item has several replicas in a data replication environment, it should have some degree of transparency and appear to only one virtual global item to end users.
The biggest challenge using data replication is the proper trade-off between data coherence and performance based on your business requirements.

Some kind of locking scheme is usually employed in order to achieve data coherence.
I will list some replication and locking schemes I have experienced using Oracle10g advanced replication, Oracle10g RAC, Oracle10g TimesTen and Gigaspaces XAP 7.1.

Before we go to details, let's suppose you have a distributed airline ticketing system (DATS hereafter) which has two databases: one is in NY and the other is in LA. Depending your replication scheme, data can be updated either at one site only and replicated to the other or at both sites and replicated to the each other.
Further suppose the following actions take place in time sequence:
  1. There is only one ticket left in the database. Because both local database replicas are synchronized at this time, the only ticket can be taken by either NY or LA;
  2. A NY customer bought this ticket. This action was updated in the local NY database and will be replicated to LA somehow depending on the replication scheme;
  3. Depending on your replication scheme, the LA database can show the same ticket either still available or already taken by a NY customer. If the same ticket still appears available to the LA database, it will be sold to a LA customer. This will create an oversold situation.
1. Synchronous Replication Using Distributed Locks and Local Transactions

Oracle RAC (formerly OPS in version 8i and prior) allows the same physical database to be accessed at multiple instance sites. In order to allow users to read and write any data at any time at any instance site, Oracle RAC uses "Cache Fusion" to ensure data coherence.

"Cache Fusion" basically uses synchronous replication with a distributed locking manager (DLM). DLM acts as a distributed lock (DL) coordinator among other functions so that the same resource - e.g. a table row - can only be granted to one instance site for changing at a time while other sites have to wait.

DLM also act as a global resource directory. For example, when instance site 1 updated a row, it doesn't need to actively push the new version of data to all other instance sites.  When instance site 2 later requests the same row, DLM can tell it to get the latest version from instance site 1.
Also instance site 1 doesn't need to use any distributed transaction thanks to DLM and the fact that there is still only one physical database (so far I haven't seen any synchronous replication that use both distributed locks and distributed transactions).

Benefits include very high degree of data coherence and load balance for both reads and writes.
Drawback include poor write performance and requirements of high-speed interconnect due to distributed locks.
Distributed locks usually consist of quite a few daemon processes and data structures at each site whose coordination performs poorly in a low-speed interconnect such as LAN and WAN. For Oracle "Cache Fusion", distributed locks are implemented with Global Cache Service (GCS), Global Enqueue Service (GES) and Global Resource Directory (GRD).

If we apply this scheme to DATS assuming the poor interconnect performance is tolerable, step 3 has to wait for the DL to be release by step 2. When step 3 gets the DL, the same ticket will show already taken by step 2.
(In order to have good performance, most multi-tiered applications use optimistic locking that can create lost update problem. For example if we use optimistic locking in both databases in DATS, the application tier in step3 can first read the LA database before step 2 and then sell the same ticket to a LA customer after step 2.
The application must use "optimistic locking with version checking" to fix this issue. One version checking is just a version number that increases whenever there is any corresponding data change.
Suppose the version is 0 at step 1. Step 2 updates it to 1. The version found by the application tier read at step 3 is also 0. But when the application tier tries to sell the same ticket, it will fail because it will find the version has changed to 1 from its cached value 0.
All our arguments assume "optimistic locking with version checking" .)

2. Synchronous Replication Using Local Locks and Distributed Transactions
Oracle's multimaster replication (also called peer-to-peer or n-way) has two data coherence protocols.
One of them is synchronous replication that applies any changes or executes any replicated procedures at all sites participating in the replication environment as part of a single distributed transaction. If the DML statement or procedure fails at any site, then the entire transaction rolls back.

The distributed transaction ensures data consistency at all sites in real-time. However it doesn't use any distributed locking. Instead it only uses local locks in the participant local transaction.
This is the case when an application performs a synchronous update to a replicated table. Oracle first locks the local row and then uses an after row trigger to lock the corresponding remote row. Oracle releases the locks when the transaction commits at each site.

Benefits include high degree of data coherence, simple implementation, easy administration and fit to both high-speed interconnect and low-speed LAN and WAN (implementing distributed locks in low-speed LAN and WAN is much harder than using lock locks in such environments).
Drawbacks include possible deadlock due to temporal behavior of local and remote locking, high availability requirements on network and poor write performance.

If we apply this scheme to DATS assuming the poor interconnect performance is tolerable, step 3 has to wait for step 2 to release the local and remote locks. When step 3 gets the locks, the same ticket will show already taken by step 2.

3. Synchronous Replication Using Local Locks and Local Transactions
TimesTen's unidirectional active-standby pair configuration only uses so called "return twosafe
replication". It provides fully synchronous replication between the master (the active site) and subscriber (the standby site).
No distributed transaction or distributed lock is involved. Instead only local transaction and local lock are used. Specifically the local transaction on the subscriber is first committed before on the master. If the subscriber can't commit, the master will not commit either.
At any time, only the active site can be updated that greatly simplifies data updating complexity (otherwise using local locks and local transactions will be insufficient) and ensures fast fail-over to the standby site in case of the active site failure.

This scheme has similar benefits and drawbacks to the previous scheme in section 2.
However its performance is even better thanks to the avoiding of two-phase commit (2PC) required in a distributed transaction. It also eliminates the deadlock because only the active site allows updates.
Although the standby site seems to be a waste of capacity, you can collocate it with another active site as shown in figure 1 (by another I mean it has different data from the collocated standby site).
This scheme has lower degree of data coherence due to the inconsistency resulted from master commit failure even the subscriber has successfully committed (the root cause is it doesn't use distributed transactions as you can guess. But you should also know that data inconsistency can still result from the second "commit" phase failure in a 2PC process).

TimesTen's practice in this scenario is consistent with its configurations for high performance in other areas such as asynchronous logging and data caching with write-behind policy, 

Gigaspaces IMDG has a very similar topology called primary-backup replication. The only difference is it uses distributed transactions instead of local transactions only. So it has higher degree of data coherence than TimesTen.
Another advantage is the fail-over happens in Gigaspaces IMDG transparently to end users while TimesTen customers need to resort to some third-part cluster manager or some custom software.

If we apply this scheme to DATS, either NY or LA site will be the active site and the other standby site has to connect to the active for data updating (in reality, active-standby often is used with a high-speed interconnect). The local lock in the active site prevents the oversold situation.

This scheme along with data partitioning as shown in figure 1 is strongly recommend compared to the two previous synchronous schemes if it can your business requirements.
Although the two previous synchronous schemes allow updating anywhere, updating the same resource entails costly lock coordination over network. Scalable updating is usually achieved by data partitioning.
Although the two previous synchronous schemes allow distributed and scalable reads, you can fine-tune your partitions to allow more current reads.

Figure 1: The Primary-Backup Partition in Gigaspaces

4. Asynchronous Replication with Update Anywhere
Another data coherence protocol in Oracle's multimaster replication is asynchronous replication that allows users to update data at any participant site.
This scheme is also used in Oracle's updatable materialized view replication and TimesTen's bidirectional master-subscriber replication for general distributed workloads.

With this scheme the data changes at one site will be queued for propagation to other sites and committed locally. The queued changes will be propagated in batches in a separate transaction. So it doesn't use any distributed lock or distributed transaction. Instead it only use local locks required in the corresponding local transaction.

Benefits include great read and write performance, easy implementation, simple administration and fit to low-speed interconnection such as LAN and WAN and disconnected updating.
Drawbacks include limited degree of data coherence depending on data refresh frequency, and possible data change conflicts.
Because there is no distributed lock or distributed transaction involved, a replication conflict occurs if two transactions originating from different sites update the same row at nearly the same time (when the queued changes are propagated to the other site, the other site will have two versions of data changes. So which one should take place?).

A conflict resolution method must be provided to resolve the data inconsistency. Both Oracle and TimesTen have a prebuilt "latest timestamp" resolution method that makes the change with the latest timestamp the winner. Oracle also allows you to customize a resolution method based on your business requirements.

This scheme can't be applied to DATS if oversold situations are not allowed because the changes at NY and LA sites can be committed independently in two different transactions that result in the same ticket being sold to two customers.
If occasional oversold situations are allowed, the NY and LA sites can sell tickets at different times thanks to the three hours time zone difference. If a replication conflict does occur, relevant information should be recorded in the database based on which your front-end application takes proper actions (in reality a reservation system like DATS doesn't use this scheme).

5. Asynchronous Replication with Update on Master Site only

This scheme is used in Oracle's read-only materialized view replication, TimesTen's unidirectional master-subscriber replication and Gigaspaces IMDG's master-local replication.

This scheme basically has similar benefits and drawbacks to the previous scheme in section 4. However because it only allows updates at the master, it eliminates the notorious replication conflicts, which most of the time proves to be a very sound design in an asynchronous replication environment.

If we apply this scheme to DATS and suppose NY is the master site (or a third site is the master), NY has to wait if LA first gets the local lock at the master site. The local lock in the master site prevents the oversold situation.

Life is much easier using Gigaspaces IMDG's master-local topology as shown in figure 2 because it automatically delegates your local cache updating to the master which propagates the same updating to other local caches. Gigaspaces IMDG also supports optimistic locking with versioning.
You must do both by yourself if you use Oracle's read-only materialized view replication and TimesTen's unidirectional master-subscriber replication.

Figure 2: Gigaspaces Master-Local Topology where Master can be Figure 1


  1. Yongjun,
    This is very good write-up. To summaries; here are the GigaSpaces supported topologies when having IMDG across distributed systems:
    - Partitioned-sync-replicated IMDG. Each partition has only one primary instance and one backup instance. Distributed transaction used. Good for LAN.
    - Partitioned-Async-replicated IMDG. Each partition has only one primary instance and one backup instance. Distributed transaction used. Good for WAN.
    - Sync-replicated IMDG – all nodes are active. Read/write allowed from all nodes. Local transaction used. Good for LAN.
    - Async-replicated IMDG - all nodes are active. Read/write allowed from all nodes. Local transaction used. Good for WAN.
    - Multi IMDGs – synchronization done via custom Mirror implementation. Each site running its own Partitioned-sync-replicated IMDG. Good for WAN with many independent sites.

    In most cases a partition will have a single backup since in case of a failure, the exiting backup will be voted to be the primary and a new backup will be created automatically.

    With all the options above the application may have a client side cache running with its own memory address. We support 2 flavors of client cache:
    - Local cache – lazy data load approach (based on client activity)
    - Local view – pre-fetch data load approach (based on SQL Query)
    Both using versioning with optimistic locking.

    See more details here:

    Shay Hassidim
    Deputy CTO