Wednesday, November 2, 2011

Floating-Point Number Equality Comparison in Oracle, Sybase / SQL Server, Java and JMS

As you know that computer can't store all floating-point numbers exactly primarily due to memory constraint. A double value such as 0.1 is actually represented as 0.1000000000000000055511151231257827021181583404541015625 in computer memory. Due to rounding errors, your equality comparison on floating-point numbers will probably fail.
In this topic I will show how SQL and Java store exact floating-point numbers (or just decimal numbers), and how floating-point literals are represented.

SQL99 uses FLOAT, REAL and DOUBLE PRECISION for approximate floating-point numbers, and NUMERIC(p,s) / DECIMAL (p,s) for exact floating-point numbers. For floating-point literals, SQL99 stores them as approximate floating-point types if they use the E notation (scientific notation); otherwise as the exact floating-point type if the literals meet the precision and scale of the exact types.
Because the NUMERIC(p,s) / DECIMAL (p,s) stores values without loss of precision for your decimal numbers, it should be used if you want to compare the equality between a floating-point column (in DB) or variable (in Java) to a decimal literal such as 0.1.
Different database vendors may have a bit different variants from the above specification.

Oracle 10g uses NUMBER(p,s) for both floating-point types in addition to its BINARY_FLOAT, BINARY_DOUBLE and FLOAT(p) for approximate types. Make sure you always specify the precision (p) for exact types.
If a floating-point is sufficed by f  or F (for BINARY_FLOAT), or d or D (for BINARY_DOUBLE), it is of approximate types; otherwise of the exact NUMBER(p,s) type.

Sybase ESA 15 and MS SQl Server 2008 both comply with the above SQL99 standard very well. And both also have MONEY(SMALLMONEY) for exact float-point numbers.

Java has BigDecimal to store decimal values exactly. For example new BigDecimal("0.1")  stores 0.1 exactly.

JMS's message selector is a string whose syntax is based on a subset of SQL92. Unfortunately JMS doesn't support exact floating-point numbers and restricts its exact numeric literal only to those without a decimal.

Wednesday, August 17, 2011

Key Concepts to Understand Spring's Consistent Transaction Handling

Class FooBarService shows simplified pseudo code for different transaction handling paradigms using JDBC,Hibernate and JTA.
Class FooBarServiceSpring shows
simplified pseudo code for consistent transaction handling using Spring.
Obviously Spring's transaction handling is quite different from JDBC and Hibernate; but is closer to JTA. However, by no mean does Spring only support JTA. Instead, it is only a matter of configuring different PlatformTransactionManager (PTM hereafter) in your application context in order to support different transaction handling paradigms.  For example, Spring's DataSourceTransactionManager, HibernateTransactionManager and JTATransactionManager are for JDBC, Hibernate and JTA, respectively.

This discussion can help readers understand some internals of Spring's  PTM implementations on heterogeneous native transaction platforms.


//different transaction paradigms using JDBC,Hibernate and JTA
class FooBarService {
  //local transaction paradigm using JDBC
  public void foo() {
    //creates a session with the backend database
    Connection conn = dataSource.getConnection();
    //starts a new local transaction.
    conn.setAutoCommit(false);

    updates-data-through-this-conn

    //commits the local transaction
    conn.commit();
    //closes the session
    conn.close();
  }

  //local transaction paradigm using Hibernate
  public void bar() {
    //creates a session with the backend database
    Session session = sessionFactory.openSession();
    //starts a new local transaction.
    Transaction tx = session.beginTransaction();

    updates-data-through-this-session

    //commits the local transaction
    tx.commit();
    //closes the session
    session.close();
  }

  //Global transaction paradigm using JTA
  public void fooBar() {
    userTransaction.begin();

    //creates a session with the backend database1
    Connection conn1 = dataSource1.getConnection();
    updates-data-through-this-conn1

    //creates another session with the backend database2
    Connection conn2 = dataSource2.getConnection();
    updates-data-through-this-conn1

    //closes the two sessions
    conn1.close();
    conn2.close();

    userTransaction.commit();
  }
}


//consistent transaction handling paradigms using Spring
class FooBarServiceSpring {
  @Transactional
  public void foo() {
    //updates data through some data source connection
  }
 
  @Transactional(propagation=Propagation.REQUIRES_NEW)
  public void bar() {
    //updates data through some data source connection
  }
  @Transactional
  public void fooBar() {
    foo();
    bar();
  }
}


1.
Key Transaction Abstraction
The following 3 interfaces are key to understand Spring's transaction handling:
  • TransactionDefinition
    It allows you to define transaction requirements (isolation level, propagation behavior, timeout and read-only status) before your transaction is started.
    In FooBarServiceSpring, you specify your TransactionDefinition in @Transactional.
  • TransactionStatus
    Once your transaction is started, this interface allows you to query its current status including rollback flag and new vs existing flag, and to mark it to rollback only (for example when you encounter an exception).
    Spring's default TransactionStatus implementation also includes the underlying native transaction such as a connection for JDBC, a session for Hibernate and a user transaction for JTA.
    In FooBarServiceSpring, @Transactional allows you to specify your rollback rules. For other status information, you either implicitly knows(such as IsNewTransactio and IsComplete) or can retrieve from the underlying native transaction bound to the thread (Spring binds JDBC connection and Hibernate Session to the thread; JTA's UserTransaction is also bound to a standard JNDI name). Please note that Spring doesn't bind TransactionStatus to the thread because PTM can pass it around in your AOP proxied methods.
  • PlatformTransactionManager
    It wires together the above 2 interface and works as a coordinate overall.
    Specifically its getTransaction() method takes your TransactionDefinition and creates a TransactionStatus representing either a new or existing transaction.  Its commit() and rollback() methods take the returned TransactionStatus and commit and rollback the target transaction, respectively.
    It allows you to demarcate transactions as a singleton because it can get different transactions bound on different threads.
    In FooBarServiceSpring, @Transactional uses whatever transaction manager you configured in your application context such as DataSourceTransactionManager for JDBC.
When we say Spring transaction, we really mean TransactionDefinition and TransactionStatus. The PTM is more like a JTA UserTransaction / TransactionManager. But once again, it handles all types of native transaction API's behind the scene.
    2. Physical Transaction and Logical Transaction
    FooBarServiceSpring's foobar() calls foo() and bar(). Suppose there is no transaction before you call foobar(). So foobar() creates a new physical transaction while foo() starts a nested logical transaction due to its default propagation being REQUIRED; the logic transaction ends when foo() returns. 
    But bar() suspends the current transaction and creates a new physical transaction due to its propagation being REQUIRED_NEW. The new transaction ends and the suspended transaction is resumed when bar() exits.
    Because each (physical or logic) transaction always has a begin and end, we sometimes use transaction scope.

    For data persistence, only the outer physical transaction can commit or rollback. But
    on each individual inner logic transaction (method) level, you always specify your TransactionStatus and can mark transaction to rollback . However the embed native transaction in TransactionStatus is either new (if it is a physical one) or existing (if it participates in an existing one).
    You can also turn on the "validateExistingTransaction" flag in PTM
    so that the inner logic transaction will reject participation if its TransactionStatus is not compatible to the outer TransactionStatus.

    3. Resource, Connection, Session, Transaction and Resource Manager (RM)
    Resource can mean any valuable data; but in PlatformTransactionManager it is either a connection (for JDBC and JMS etc) or session (for Hibernate and JMS etc).

    Both connection and session represent a communication link with the some backend resource manager (RM for short). In the database world, both termscan be used interchangeably. Although Hibernate uses session, it uses a connection behind the scene.
     
    Transaction represents a unit of work. We mentioned in the above Section 1 that Spring's abstract transaction covers an underlying native transaction.
    For JDBC and Hibernate, you create local transactions using connections / sessions and because they have one-to-one relationship, you can sometimes interchange the 2 terms. In other words, if the inner logic transaction participants in the outer physical transaction, it must reuse the the same connection / session.
    For JTA, you creates global transactions using transaction managers; multiple connections can be enlisted in global transactions as transaction branches. One thing that is very different from JDBC and Hibernate is that you can conduct DML changes using different connections to the same backend RM instance in the same global transaction (This is because when a connection is enlisted, XID is created to represent this branch both in the JTA and the RM. Even you use a different connection, the RM can still identify the same transaction using the passed-in XID).

    RM managers resources (transaction) on behalf of you. It is commonly used in JTA/XA world. For most of us, databases are common RM.

    4. Transaction and Session -- Which Comes First?
    In the foo() and bar() of FooBarService, you first create a connection / session, then starts a transaction. Before you close the connection / session, you first commit / rollback the transaction.
    But in the foobar() of
    FooBarService, you first starts a transaction, then create a connection / session. Before you commit / rollback the transaction, you first close the connection / session.
    In FooBarServiceSpring, all you have is just @transactional and you are not allowed to open or close any connection / session explicitly. So how can Spring hide the difference between JDBC, Hibernate and JTA and whois taking care of connections / session anyway?
    Data access experience tells us that we always need a connection / session and its lifecycle methods (open, close etc) are just boilerplate and resource will leak if you forget to call the close method (actually this happens very often). But we must define transaction requirements by ourselves. For example, only you know that the bar() in
    FooBarServiceSpring requires REQUIRE_NEW propagation based on your business rules.

    So Spring is taking care of connections / sessions behind the scene. You just need to specify you transaction requirement using @transactional. The next section tell you more details. 
    5. Transaction and Session -- Resource Synchronization with Transaction
    Because you always explicitly specify your transaction requirements, Spring synchronizes the needed resource (connection or session) with the transaction at both transaction start time and ending time or at the transaction scope boundaries.
     

    For JDBC and Hibernate, Spring binds the connection / session (hence also local transaction) to the thread so that it is always available on the execution call stack. 
    Spring PTM automatically either retrieves an existing connection / session from the thread (if an outer transaction is already there) or creates a new connection / session and binds it to the thread before starting a transaction.
    When transaction is right before being committed or rolled back, Spring PTM also automatically closes the connection / session and unbinds it from the thread.


    For JTA, Spring gets the UserTransaction from the standard JNDI name, so the global transaction is also available on the execution call stack.
    When you retrieves a connection / session, it is automatically enlisted in the global transaction (Spring doesn't help here; the underlying XADataSource usually does the magic).

    When the JTA transaction is right before being committed or rolled back, Spring PTM once again automatically closes the connection / session .

    6. Nested Transaction and Transaction Suspend / Resume
    • For nested transaction, both  the outer and inner transaction are live in your thread while when a transaction is suspended, your thread work on another new transaction;
    • Nested transactions are dependent i.e. only the outer transaction is physical and can commit the change; but you can rollback the inner logic transaction.
      The suspended transaction and the new transaction are two physical transactions and are independent i.e. they can both commit or rollback independently;
    • Suspending tranasction is deadlock-prone because the backend RM still holds the resource locks on the suspended transaction even your JTA has switched to another new transaction. You should be very careful even your application also locks some resources in the suspended transaction that are needed by the new transaction;
    • Nested transactions are only supported by JDBC with the savepoint feature. Because XA doesn't support nested transactions, neither JTA can (This seems to be ironic because most databases support savepoint);
    • Suspending a JDBC transaction in Spring means first unbinds it from the thread, then creates a new connection and binds to the thread. For JTA, Spring just delegates the suspend/resume to the underlying UserTransaction in addition to its own bookkeeping for resource synchronizations.
    7. Propagation.SUPPORTS and No Transaction
    No transaction means your data changes are committed on each individual SQL statement level (it is autocomit for JDBC) instead of grouping several SQL statements in a transaction.

    Propagation.SUPPORTS also commits data changes on each individual SQL statement level if there is no existing transaction. But if you set your TransactionSynchronization in PTM to ALWAYS, it also supports resource synchronization on an empty transaction scope (empty transaction scope means no underlying native transaction).
    To make it concrete, the same JDBC connection or Hibernate Session will bind to the thread inside the empty transaction scope for reuse in the execution call stack until the scope ends.  Be aware that the connection is of course in autocommit mode.

    Please remember that if you empty transaction scope has active resource synchronization, DON'T nested Propagation.REQUIRED or Propagation.REQUIRED_NEW in it because resources synchronizations for the 2 transaction scopes will conflict (unfortunately you have to read the PMT source code in order  to get to the bottom).



    Wednesday, August 10, 2011

    Dynamically Creates Quartz Scheduler using Spring

     Here are  my requirements: I need to dynamically create quartz cron triggers and add them to my scheduler. All my triggers schedule the same job but with trigger specific job mapping data.

    Before I show how to dynamically create quartz schedulers using Spring, let's quickly review Sping's static configuration. The following excerpts come from "Chapter 23 Scheduling and Tread Pooling" in Spring 2.5.6 reference document and shows how to statically create a quartz scheduler:

    <bean name="exampleJob" class="org.springframework.scheduling.quartz.JobDetailBean">
       <property name="jobClass" value="example.ExampleJob"/>
    </bean>

    <bean id="cronTrigger" class="org.springframework.scheduling.quartz.CronTriggerBean">
       <property name="jobDetail" ref="exampleJob" />
       <!-- run every morning at 6 AM -->
       <property name="cronExpression" value="0 0 6 * * ?" />
    </bean>

    <bean class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
       <property name="triggers">
          <list>
             <ref bean="cronTrigger" />
          </list>
       </property>
    </bean>

    Spring's JobDetailBean extends quartz's JobDetail and provides the following added functions:
    1. Setups sensible defaults in its life-cycle init method afterPropertiesSet() e.g. set the job's name to the bean name and set the job's group name to "DEFAULT" if they are empty (Remember that quartz scheduler uniquely identifies a job by its name and group);
    2. Supports any job class that implements the Runnable interface besides quartz'a regular Job interface;
    3. If your actual Job class extends Spring's QuartzJobBean, Spring also automatically injects the job map data to the job's properties;
    Spring's CronTriggerBean extends quartz CronTrigger and provides the following added functions:
    1. Setups sensible defaults in its life-cycle init method afterPropertiesSet() e.g. set the trigger's name to the bean name and set the trigger's group name to "DEFAULT" if they are empty (Remember that quartz scheduler uniquely identifies a trigger by its name and group).
    2. Associates the trigger to the provided's JobDetail in afterPropertiesSet();
    3. Set the trigger's timezone to the default one and start time to the current time afterPropertiesSet();
    4. Sets trigger specific job map data through the Java's Map interface;
    Spring's SchedulerFactorBean is a quartz Scheduler factory bean and wires jobs and triggers together. It implements many functions behind the scene that you have to do by yourself  if you directly use quartz API:
    1. Overrides default quartz properties e.g. thread pool parameters and jmx export;
    2. Registers triggers;
    3. Registers jobs. If you already assigned a job in your Spring trigger bean, you don't need this;
    4. Sets the JobFactory to the default Spring's AdaptableJobFactory if it is not set by you. AdaptableJobFactory support Item 2 in the above JobDetailBean's added functions. AdaptableJobFactory's sublass SpringBeanJobFactory also supports the same function as mentioned in Item 3 in the above JobDetailBean's added function.
    5. Schedules your jobs with quartz's scheduler;
    All the above functions are implemented in the Spring' init method afterPropertiesSet().

    Here are the steps to dynamically configure quartz scheduler using Spring to meet my above requirements.
    1. Still configures my JobDetailBean, and my actual Job also extends Spring's QuartzJobBean in order to take advantage of the added function;
    2. Still configures my CronTriggerBean using Spring  in order to take advantage of the added function. But its scope should be "prototype" so that I can dynamically retrieve a new trigger bean instance and set its cronExpression and its job map data.
      Here is the new trigger configuration:

      <bean id="cronTrigger" class="org.springframework.scheduling.quartz.CronTriggerBean" scope="prototype">
      <property name="jobDetail" ref="exampleJob"/>
      </bean>

      Here is the pseudo code:
      CronTrigger ct = (CronTrigger)appCtx.getBean("cronTrigger");
      ct.setName(ct.getName() + some-distinguisher);
      ct.getJobDataMap().put(some-key, some-value);
      ct.setCronExpression(some-exp); 

    3. Still configures my SchedulerFactorBean. But I need to add my above triggers to the scheduler. Because I create the above trigger after SchedulerFactorBean's afterPropertiesSet(), the trigger's associated JobDetail is not added to the scheduler. But I can explicitly register the JobDetail:

      <bean id="scheduler" class="org.springframework.scheduling.quartz.SchedulerFactoryBean" lazy-init="false">
         <property name="jobDetails">
            <list>
              <ref bean="exampleJob"/>
            </list>
         </property>
         <property name="triggers"> <!-- will dynamically add more triggers -->
            <list/>
         </property>
         <property name="autoStartup" value="true"/>
         <property name="quartzProperties">
            <props>
              <prop key="org.quartz.scheduler.jmx.export">true</prop>
              <prop key="org.quartz.threadPool.threadCount">2</prop>
            </props>
         </property>
      </bean>
      Here is the pseudo code:
      Scheduler sched = (Scheduler)appCtx.getBean("scheduler");
      sched.scheduleJob(ct);

    Wednesday, June 15, 2011

    Will class literals (some-named-type.class) cause class initialization?

    Suppose you have the following classes:

    class Foo {
        static int cls_var = 10;
        static {
            System.out.println("cls_var:" + cls_var);
        }
       
        int inst_var = 20;
    }

    class Bar {
        public static void main(String arg[]) {
            System.out.println(Foo.class.getName());
        }
    }


    Will the class literal in bold blue cause Foo to be initialized (assign cls_var to 10 and run the following initializer)?
    On the console you will only see "Foo"; you will not see "cls_var:10" as you may think you should see.

    In order to find why, you need to know that JVM subject a class to the following 3 processes in order:

    1. Loading 
    2. Linking: Verification, Preparation, and Resolution   
    3. Initialization
    The class literal evaluates to the Class object for the named type (or for void) as defined by the defining class loader of the class of the current instance, and this information is accommodated by the above Step 1.

    Foo's static variable initialization and the executor of static initializer are done in Step 3. Step 3 can be lazily executed by JVM after Step 1, which means Step 3 may not immediately follow Step 1.

    This is why you will only see "Foo" output on the console. For more details, please see the linked JVM spec and Java language spec.

    Tuesday, April 12, 2011

    A GPGPU Article Published in DZone.com

    Large data set processing needs parallel computing that can be hosted on commodity processors ranging from low-end multi-core CPUs to high-end many-core GPUs. OpenCL can program both CPU and GPU, and even future parallel processors. Readers can learn from the article what workloads each processor is good at.

    The article first discusses why OpenCL is a better programming paradigm for parallel computing on heterogeneous processors than others (OpenML,pThread, CUDA and ATI Stream); then compares the major hardware differences between CPU and GPU, and what different workloads CPU and GPU are each good at; finally a brief example is given regarding how to parallel an originally single-threaded algorithm to CPU and GPU using OpenCL.

    This article was based on the coding experience with Lili Zhou, a medical physics resident at NY Presbyterian Hospital / Weill Cornel Medical Center. More discussions are expected as we are migrating more single-threaded programs to multi-core CPU and GPGPU.
    This article is basically a rerun of my previous post.

    Friday, April 1, 2011

    Throws RunTimeException in Java Class Static Initializer?

    When you want to run some logic during class initialization, you put it into so called class initializer e.g. something like this:

    public class MyTest {   
        static {
            if (some-condition)
               throw new RuntimeException("something wrong");
        }
       
        public void test() {
            System.out.println("throw RTE in static initializer");
        }
    }

    However there is some caveat. If the RuntimeException is thrown, class MyTest will still be loaded, but in erroneous state. So you can't do handle the following scenario.
    Assume in the static initalizer, you want to retrieve some external property values and the RTE will be thrown if some property value is not configured yet. MyTest is also supposed to be deployed on the server side.
    When JVM first attempted to load MyTest, the RTE was thrown because of some missing property value. The user saw the RTE and configured the missing property values. Now he/she hopes to be able to access this class.
    However when MyTest is being accessed again, he/she will get NoClassDefFoundError.

    This error is so confusing in this case because binary code of MyTest was indeed found by the JVM classloader. However it is just in erroneous state.
    In order to find details, you need to read section 2.17 Execution in JVM Specification. Basically a class like MyTest must go throw 3 stages in order before it is being put in use: Loading, Linking and Initialization.  The static initializer is executed in Initialization. If any exception is thrown in class initialization, JVM will throw ExceptionInInitializerError and put the class in erroneous state. (since the Initialization is after loading, the class was indeed found by JVM and loaded)
    Based on Step 5 in section 2.17.5 in the above reference, JVM will stop initializing by throwing NoClassDefFoundError even the class has been found and loaded.

    Because of this caveat, you have to throw an exception in a class constructor and use a class variable as a flag to void repeating execution of your logic that is supposed to be run only once.

    Tuesday, March 1, 2011

    A Data Replication Article Published in InfoQ.com

    Although data replication is an old technology as you can find it in Oracle Advanced Replicatin and Oracle RAC (Formally OPS), it is also widely used in some new products such as TimesTen, Gigaspaces and NoSQL.

    This technical article is based on my previous posting "Locking Schemes for Data Replication Updating" and reviewed by Ron Bodkin, CEO of Think Big Analytics.

    It discusses and compares several schemes commonly found in commercial and open-source products.
    It helps you understand the prons and cons of different data replication scheme and assists you in choosing the right one suited for your business requirements.

    Monday, February 28, 2011

    Welcome to the World!

    My baby, Bryan Zhou Jiao, finally came to the world on Feb 15, 2011. Now I truly underdand how hard the ten-month long pregnancy is, which is especially true considering both my wife and me have passed the prime time for having a baby.
    Fortunately the whole preganancy process basically went smoothly and both Mom and the baby are safe and healthy.
    I am also a bit suprised so far. I am very afraid of baby crying and originally I though babies would cry very often. Luckily my little one is not worrisome and he only cried when he woke up and felt hungry, which seems to be his only requirement from us.

    Here are some pictures. 



    More photos and videos are available at Flickr.


    Saturday, February 26, 2011

    GPGPU High Performance Computing using OpenCL -- Parallel Computing on Heterogeneous Platforms

    The following technical discussion was co-authored with Lili Zhou who is a medical physics resident at NY Presbyterian Hospital / Weill Cornel Medical Center.
    The discussion can help you understand the major hardware differences between multi-core CPUs and many-core GPUs (GPGPUs) and the different types of parallel workloads that CPUs and GPUs are each suited for.
    Here it is.


    Processing of large data sets entails parallel computing in order to achieve high throughput. Traditionally high throughput was only achievable through either proprietary hardware (Application-Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or even Cell BE) or multi-node clusters.
    However, as throughput-oriented multi-core processors become more pervasive, the same performance is now achievable through these commodity parallel architectures that range from multi-core CPU for low parallelism to many-core Graphics Processing Unit (GPU) for massive parallelism.
    This article focuses on major hardware differences between CPU and GPU, which further decides the different workloads that each processor is suited for. In the end, it gives a data partition example using OpenCL across CPU and GPU, and compares the performance differences on the two platforms.

    1 OpenCL – a New and Better Parallel Computing Paradigm

    Because you most probably have different workloads that can be better processed using CPU than GPU or vice versa, you need to master parallel computing for both CPU and GPU.
    However, parallel computing for heterogeneous processors is challenging as traditional programming paradigms for CPU and GPU are very different.
    For example, you usually use POSIX Threads or Pthreads or OpenMP for CPU.
    Although GPU typically only handles graphics computations, its gigaflops or even teraflops of floating point performance and high memory bandwidth make it suited to some non-graphics applications including data-intensive scientific and engineering computations, which actually falls into the scope of so-called General-Purpose Computing on Graphics Processing Unit (GPGPU).
    However, GPGPU programming originally required programmers to possess intimate knowledge of graphics APIs (OpenGL or DirectX) and GPU architecture and to make their non-graphics applications look like graphics ones by mapping them into problems that drew triangles and polygons. This placed a great constraint for non-graphics domain scientists and engineers.
    This old approach was adopted by Fang et al in the CT reconstruction algorithm.
    With the advent of Unified Shaders, dedicated general purpose GPU such as the Telsa series from NVIDIA and the FireStream series from AMD, and high-level programming paradigms such as CUDA from NVIDIA and ATI Stream from AMD, CPU-based non-graphics applications can now directly access the tremendous performance of GPU without the above limitations.
    Unfortunately all of the above different programming paradigms have very sharp learning curves and some are vendor-specific. Worse yet, none of them may be supported by future parallel processors.
    OpenCL (Open Computing Language)
    is an open royalty-free standard for general purpose parallel programming across CPU, GPGPU, and such other processors as DSP and the Cell/B.E. processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms.
    Although OpenCL is very new, it is a better choice because of the following factors:
    • It utilizes a subset of ISO C99 with extensions for parallelism. You are probably already familiar with C programming;
    •  You don’t need to learn anything else in order to take advantage of the parallel structures of current and even future parallel processors;
    •  You just need to analyze your algorithms for data or tasks decomposition without even knowing the underlying complex multi-threaded programming mechanism, not to mention any graphics API or GPU internals.
    We will use OpenCL terminologies in the following discussion. If you are not familiar with them, you are encouraged to read the first three chapters in the OpenCL specification documentation.

    2 Major Differences between CPU and GPGPU in Parallel Computing

    You need to know the major hardware differences to be able to select either CPU or GPGPU for different workloads. This discussion uses Dell Precision T7500 Tower Workstation (it has 2 Intel E5507 quad-core processors without Hyper-threading) and NVIDIA Tesla C2050 for reference. You can click their hyperlinks to get introductions for their capabilities. 
    C2050 is based on NVIDIA’s next-generation CUDA architecture codenamed “Fermi” and has 14 CUDA 32-core multi-processors one of which is shown below courtesy of this resource:

    2.1 GPGPU Supports Much More Hardware Threads than CP

    T7500 only supports 8 hardware threads (4 compute units or core in CPU terms * 2 processors) while C2050 supports 21,504 (14 compute units or multi-processors in CUDA terms * 1536), which means C2050 is in a better position to process massive amounts of data in parallel than T7500.

    2.2 GPGPU’s Threads Are Lightweight

    By the above “hardware threads”, we mean the context or state of a processing element (one of the 4 compute units in T7500 or one of the 32 CUDA core in a C2050 compute unit) is saved in registers allocated to each thread so that context switching can be done in just one cycle.
    If you create more than 8 threads for T7500, context switching will involve software operation, which takes many more cycles and are accordingly expensive and heavyweight.
    Because C2050 has 32,768 registers, it can support as many as 21,504 in-flight hardware threads.

    2.3 GPGPU’s SIMD and Global Memory Interface Is Much Wider than CPU’s

    Single-Instruction Multiple-Data or SIMD is suited to process large data sets in batch mode.  T7500 implements SIMD with SSE that is 128-bit wide while C2050 implements SIMD with Single-Instruction Multiple-Thread or SIMT (a smallest execution unit of 32 threads runs a single instruction) that is 1024-bit wide.
    T7500 has 192-bit global memory (system memory in CPU terms) interface through 3 parallel channels per processor while C2050’s memory interface is 384-bit through 6 channels.
    This means C2050 is better suited to large data sets that possess spatial locality.

    2.4 CPU Has Larger Cache and Lower Latency than GPGPU

    T7500 has 32K L1 and 256K L2 per compute unit and 4M unified L3 while C2050 has 16K L1 per compute unit and 768K unified L2, which means that CPU is more lenient to the requirement on spatial locality than GPGPU and also has lower memory access latency than GPGPU.
    However GPGPU is optimized for high-throughput by hiding the latency for memory access and other instructions through fast context switching to ready threads in the scheduling queue, which assumes, in addition to there being a large number of in-flight thread, the number of arithmetic operations per memory operation is also high (Imagine if you only perform one arithmetic per memory load, context switching doesn’t make much sense).

    2.5 GPGPU Gives Developers Direct control over Its Fast On-chip Local Memory

    GPGPU’s local memory (shared memory in CUDA terms) approximately corresponds to CPU’s L1 cache with respect to latency.
    This memory region is shared by all work items (parallel threads) in a work group (a thread block), and can be used to boost effective bandwidth and system scalability by reducing or eliminating redundant loads from global memory and coalescing stride accesses to global memory.
    The coalescing of strides is possible only when memory accesses of all work items in a work group fall into the local memory. In other words, a work group exhibits spatial locality in the local memory.
    C2050 has 48K local memory. Without using it, the redundant global memory loads from tens of thousands of threads if they do exist, will cause too many memory bus contentions, and un-coalesced stride accesses will waste too much memory bandwidth.

    2.6 Data Needs Transfer Between GPGPU and Host

    T7500 uses Gen2 PCIe X 16 for data transfer. Although the bandwidth of Gen2 PCIe X 16 is 8GBps, it is much lower than C2050’s memory bandwidth 144GBps.
    So the number of compute operations per data element transferred in your parallel algorithms should be high otherwise they might be better handled using CPU.

    3 Different Workload Processing for CPU and GPGPU

    A fundamental requirement for parallel computing on both CPU and GPGPU is that you data or tasks can be decomposed for concurrent processing. However CPU and GPGPU are suited to different types of workloads.
    On the other hand, if your data or tasks exhibit little or no parallelism, you should leave them to CPU for serial processing. OpenCL allows you to parallelize parts of your CPU-based applications by putting the parallel parts into an OpenCL kernel while leaving the rest untouched.

    3.1 GPGPU Is Better Suited to Process Large Data Sets

    Small data sets can’t justify the time incurred by GPGPU’s slow data transfer on PCIe and other setup costs. Large data sets also spawn a large number of lightweight threads that are required in order to take full advantage of GPGPU’s massively parallel structure and to hide GPGPU’s the latency for memory access and other instructions by fast context switching.

    3.2 GPGPU Requires the Number of Arithmetic Operations per Memory Operation Be High

    This is to hide GPGPU’s long latency for memory access and other instructions. Otherwise CPU is a better choice thanks to its larger cache.

    3.3 Data Transfer on PCIe Should Be Minimized

    In other words the number of compute operations by GPGPU per data element should be high otherwise CPU is a better choice due to the low PCIe bandwidth.

    3.4 Memory Access Should Have Spatial Locality

    This appears more important to GPGPU than CPU because CPU has larger cache and GPGPU’s wider SIMD and memory interface otherwise only means waste of bandwidth and limited scalability across its compute units.
    Please note that you can relax this requirement a bit for GPGPU if your parallel algorithm meets the above 3.2 very well because a large number of arithmetic operations do hide memory access latency. Actually this relaxation is also a winning point because in order to use CPU’s SIMD feature, you must explicitly pack your data into vectors, which requires your data to have perfect spatial locality.

    4 Partition Your Data for Data Parallelism

    The key to OpenCL programming is to partition your data or tasks so that they can be processed in parallel. Task parallelism needs to vectorize your data and / or to submit multiple small kernels.
    Since you face data parallelism most of the time and it needs more data design consideration, this discussion will only deal with data parallelism.
    Data parallelism
    allows you to concurrently apply the same OpenCL kernel (the function to run on GPU in parallel) to multiple independent work items (data elements) in an OpenCL index space (a memory object decomposed into data elements).  OpenCL doesn’t require you to have a one-to-one mapping between work items and data elements.
    When you design a new parallel algorithm, you should put the requirements in Section 3 into consideration. A good candidate for parallelism is the FOR or WHILE loops with large iterations, and large data arrays.
    One loop or one array can correspond to one dimension in an OpenCL index space while the loop iterations and array elements can map to work items. Nested loops or multi-dimension arrays can correspond to up to three dimensions in an OpenCL index space.
    When you try to parallel an existing singled-threaded algorithm, you should also target its loops and data arrays and analyze its data access pattern to decide which processor is better suited, CPU or GPGPU?
    If you are not sure about the selection, which is especially true when your existing algorithm doesn’t fit the requirements in Section 3 one way or the other (for example, the algorithm doesn’t possess too much spatial locality),  you can first parallelize it to CPU. Then you can enhance it to GPGPU with slight modifications (still remember that OpenCL is a general purpose parallel programming paradigm across CPU, GPGPU and other processors?) and compare them.
    This approach is what we did for our CT image reconstruction program – Conebeam.
    The Conebeam runs on Dell Precision T7500 Tower Workstation equipped with a NVIDIA Tesla C2050.  It has one one-dimensional input array, one one-dimensional output array, and three nested FOR loops that result in about 100M final iterations. The three FOR loops dynamically calculate the indexes for both arrays and compute the output array values based on the input array.
    There are more than 200 arithmetic operations per memory access. Only the input and output arrays need transferring to and from GPGPU, respectively.
    Here is pseudo-code method to parallelize on GPU:
    01.void someMethod(float *inBuf, float *outBuf) {
    02.    for(int angle=0; angle
    03.      for (int binX=0; binX
    04.        for (int binY=0; binY
    05.          //the number of final iterations here is about 100M
    06.          int idxIn = calculate based on angle,binX and binY;
    07.          int idxOut = calculate based on angle,binX and binY;
    08. 
    09.          outBuf[idxOut] = calculate based on inBuf[idxIn];
    10.        }
    11.      }
    12.    }
    13.}
    Because the three FOR loops calculate the values of the output array independently, they can be parallelized as the three dimensions in an OpenCL index space.
    Because the three FOR loops have large iterations, and the two arrays are also large, this program seems to be a good candidate for GPGPU.
    However, because the index accessing order of the two arrays is not synchronous with the three FOR loop iteration order, this program has very bad spatial locality and local memory can’t be used. Fortunately because it has high arithmetic operations per memory access, it should hide memory access latency in some degree if we use GPGPU.
    Before we compare the two OpenCL implementations with real tests, here is some estimation:
    Since T7500 has 8 compute units running at 2.2GHz while C2050 has 16 compute units running at 1.1GHz, T7500 has an approximate equivalent of 14 C2050 compute units.
    T7500 has 192-bit global memory interface through 3 parallel channels per processor, so its overall memory interface is 384-bit. C2050’s memory interface is also 384-bit through 6 channels. 
    The significant differences include (a) T7500 has just one processing element per compute unit while C2050 has 32 processing elements per compute unit; (b) T7500 has much larger cache than GPGPU and (c) T7500’s memory bandwidth is 230GBps while C2050 is 144GBps (GPU usually has higher bandwidth than CPU. However this high-end E5507’s effective memory speed is 4.8GHz while C2050 is only 3.0GHz).
    The GPGPU version should be approximate 32 times faster than the CPU version if we put aside the cache factor and the CPU version doesn’t use SSE feature (The algorithm’s spatial locality is so bad that the above (c) shouldn’t generate any advantage to the CPU version).
    Here are the real testing results:
    The original single-threaded version ran about 70 minutes. The CPU version didn’t use SSE in order to port to NVIDIA GPGPU version easily and it ran about 12 minutes. The GPGPU version ran about 1 minute.
    The CPU version didn’t have 8 times improvement over the single-threaded version because the Conebeam doesn’t have spatial locality and the 4 cores per processor incur memory bus contentions.
    The GPGPU version didn’t have 32 times improvement over the CPU version because again the Conebeam doesn’t have spatial locality and CPU has much large cache.
    In order to have linear scalability on GPGPU, the algorithm should be enhanced to exhibit spatial locality if it is possible.

    5 Further Discussion

    A major challenge for GPU programming is to preserve GPU performance levels while increasing the generality and expressiveness of application interfaces.
    Although OpenCL, CUDA and ATI Stream relieve programmers from learning graphics jargons, they may not allow you to make good use of all GPU resources.  The authors are not sure whether this concern still hold true even with NVIDIA’s Telsa series and AMD’s FireStream series.
    While OpenCL provides programming portability, it doesn't necesssarily provide performance portability.

    Wednesday, February 2, 2011

    GPGPU High Performance Computing using OpenCL -- Work with Multiple Vendor Platforms using ICD

    Sometimes you have different workloads that are better handled with different vendor OpenCL implementations.
    For example if your data presents access locality or coherence, the number of calculation per data element transferred on PCI  is high and the number of calculation per memory access is also high, you can use NVIDIA's OpenCL implementation on its GPGPU.
    On the other hand, if your data doesn't possess all of the characteristic mentioned above, you can use ATI/AMD's OpenCL implementation on CPU so that you at least can take advantage of all your CPU cores.

    OpenCL has an extension "cl_khr_icd". It is technically called OpenCL ICD (Installable Client Driver) that is a means of allowing multiple OpenCL implementations to co-exist and applications to select between them at runtime.

    In my case since I have both CPU and GPGPU-friendly data, I installed on a 64-bit Windows 7 workstation both NVIDIA GPU Computing SDK 3.2 that support OpenCL 1.0, and ATI Stream SDK 2.2 that support OpenCL 1.1.

    1. Find out whether your Vendor's SDK Supports OpenCL ICD
    You can compile and run NVIDIA's oclDeviceQuery program. If you see "cl_khr_icd" in section "CL_DEVICE_EXTENSIONS" it means NVIDIA supports Opencl ICD.
    You can also run ATI's "clinfo" program. It actually prints out all OpenCL platforms. Again if you see "cl_khr_icd" in section "Extension" it means the corresponding platform supports Opencl ICD.

    2. How does ICD Work?
    Although I setup all my Visual Studio projects with NVIDIA GPU Computing SD, I basically can still dynamically load ATI's CPU OpenCL implementation.
    The project setup basically needs two configurations. One is to specify the OpenCL include directory that contains vendor-neutral headers such as cl.h, cl_ext.h and some vendor-specific headers such as cl_agent_amd.h for AMD.
    Unless your SDK's have different OpenCL versions, all vendor-neutral headers should be the same. If you need vendor-specific headers, of course you have to setup your project with that vendor's SDK.

    The other configuration is to specify the OpenCL library directory where a vendor-neutral stub library "OpenCL.lib" is located. This "OpenCL.lib" is called ICD loader that is responsible to dynamically load a vendor specific ICD library.
    When you install a OpenCL SDK, the SDK registers its ICD DLL in the register key
    HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors on Windows if it supports ICD.
    So the ICD loader can scan the above register key and dynamically load your vendor's DLL library.

    For more information, please refer to this resource. 

    3. How does ICD affect your Code?
    Your application is now responsible for selecting which of the OpenCL platforms present on a system it wishes to use, instead of just requesting the system default.
    This means using the clGetPlatformIDs() and clGetPlatformInfo() functions to examine the list of available OpenCL implementations and selecting the one that best suits your requirements.

    For example, previously you could call the following function without specifying the specific platform value:
    context = clCreateContextFromType(
         0, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);

    With ICD, you must specify a specific platform:

    cl_context_properties cps[3] = {
                  CL_CONTEXT_PLATFORM,
                  (cl_context_properties)platform,
                  0};
    context = clCreateContextFromType(
         cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);