Mysterious (dead?)locking

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Mysterious (dead?)locking

nigelm
Hi.

When we use HSQLDB for some system tests of a deployed webapp, about 60% of the time the system freezes without completing.

We're on 2.3.0, and have moved to using MVCC.

On examination of the locked sessions, it would seem that everything is waiting on a 'Checkpoint' operation (see below) by looking at INFORMATION_SCHEMA.SYSTEM_SESSIONS - Everything is waiting on this, but this is allegedly not waiting on anything itself.

The thread dump shows the checkpoint is coming from a regular timed task.

Is there anything I can do to further diagnose this, as I can't understand why it's reporting connection 10 isn't waiting on anything - but the system is locked solid.




INFORMATION_SCHEMA.SYSTEM_SESSIONS:
-----------------------------------------------------
[3, 2013-12-24 10:39:06.0, SA, true, true, false, 0, PUBLIC, false, 0, , 10, SELECT VERSION FROM KMS_VERSION WHERE ID='KMS', 1]
[6, 2013-12-24 10:39:17.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT * FROM QRTZ_SCHEDULER_STATE WHERE SCHED_NAME = 'schedulerFactoryBean', 1]
[7, 2013-12-24 10:39:34.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND TRIGGER_STATE = ? AND NEXT_FIRE_TIME <= ? AND (MISFIRE_INSTR = -1 OR (MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= ?)) ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC, 1]
[10, 2013-12-24 10:41:53.0, _SYSTEM, true, false, false, 0, PUBLIC, false, 0, 3,7,6,11, , CHECKPOINT, 1]
[11, 2013-12-24 10:42:13.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < ? AND TRIGGER_STATE = ?, 1]
[13, 2013-12-24 11:26:17.0, SA, true, false, false, 0, PUBLIC, false, 0, , , select * FROM INFORMATION_SCHEMA.SYSTEM_SESSIONS, 0]

Thread dump
----------------
Thread t@31: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=811 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=55, line=969 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted frame)
 - org.hsqldb.lib.CountUpDownLatch.await() @bci=12 (Compiled frame)
 - org.hsqldb.Session.executeCompiledStatement(org.hsqldb.Statement, java.lang.Object[], int) @bci=289 (Compiled frame)
 - org.hsqldb.Database$CheckpointRunner.run() @bci=25 (Interpreted frame)
 - org.hsqldb.lib.HsqlTimer$TaskRunner.run() @bci=19 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Mysterious (dead?)locking

nigelm
I think I may have solved this issue, and I believe it boils down to a couple of issues in the implementation of CountUpDownLatch.java.

Firstly, the latch declares a

    volatile int            count;

And increments/decrements it in countUp / countDown by doing

    count--;
and
    count++;

Unfortunately, this is not thread-safe, as the former is equivalent to

int localCount = count;
localCount = localCount - 1;
count = localCount;

So increments / decrements can be 'lost', causing the database to wedge. The first fix is to replace the count with an AtomicInteger, which makes the situation better but still locks on occasion.

The second issue is more subtle, and is an interaction between await() and countUp(). 

52: public void await() throws InterruptedException {
53:
54:        if (count == 0) {
55:            return;
56:        }
57:
58:        latch.await();
59:    }
...
74: public void countUp() {
75:
76:      if (latch.getCount() == 0) {
77:            latch = new CountDownLatch(1);
78:        }
79:
80:        count++;
81:    }

The scenario starts with the latch having been used once, and latch.getCount() == 0.

Thread 1 enters countUp(), executes line 76, then gets suspended.
Thread 2 enters countUp(), and executes all statements (creating a new latch, setting count to 1).
Thread 3 enters await(). count != 0, so it executes through and suspends in the latch.await();
Thread 1 re-awakes. It starts executing from line 77. This over-writes the latch that thread3 is waiting on.

Thus Thread3 will never be released, because the countdown latch it is waiting in has been overwritten. 

I've patched the implementation locally to synchronize around the checking and creation of a new latch, preventing the two latches from being created. I'm not entirely sure that getLatch() is required, but it was a belt & braces.

My builds that were failing roughly 75% of the time with a wedged database now seem to be fine.

The full re-worked latch is here; it'd be great if it could be applied to the next release.

HTH,
Nigel

--
public class CountUpDownLatch {

    volatile CountDownLatch latch;
    AtomicInteger count;

    public CountUpDownLatch() {
        latch = new CountDownLatch(1);
        count = new AtomicInteger(0);
    }

    public void await() throws InterruptedException {

        if (count.get() == 0) {
            return;
        }

        CountDownLatch theLatch = getLatch();
        theLatch.await();
    }

    public void countDown() {

        int value = count.decrementAndGet();

        if (value == 0) {
            latch.countDown();
        }
    }

    public long getCount() {
        return count.get();
    }

    public void countUp() {

        makeNewLatchIfNeccessary();

        count.incrementAndGet();
    }

    public void setCount(int count) {

        if (count == 0) {
            if (latch.getCount() != 0) {
                latch.countDown();
            }
        } else {
            makeNewLatchIfNeccessary();
        }

        this.count.set(count);
    }

    private void makeNewLatchIfNeccessary() {
        synchronized(this) {
            if (latch.getCount() == 0) {
                latch = new CountDownLatch(1);
            }
        }
    }

    private CountDownLatch getLatch() {
        CountDownLatch value;
        synchronized(this) {
            value = this.latch;
        }
        return value;
    }
}




On Tue, Dec 24, 2013 at 1:19 PM, Nigel Magnay <[hidden email]> wrote:
Hi.

When we use HSQLDB for some system tests of a deployed webapp, about 60% of the time the system freezes without completing.

We're on 2.3.0, and have moved to using MVCC.

On examination of the locked sessions, it would seem that everything is waiting on a 'Checkpoint' operation (see below) by looking at INFORMATION_SCHEMA.SYSTEM_SESSIONS - Everything is waiting on this, but this is allegedly not waiting on anything itself.

The thread dump shows the checkpoint is coming from a regular timed task.

Is there anything I can do to further diagnose this, as I can't understand why it's reporting connection 10 isn't waiting on anything - but the system is locked solid.




INFORMATION_SCHEMA.SYSTEM_SESSIONS:
-----------------------------------------------------
[3, 2013-12-24 10:39:06.0, SA, true, true, false, 0, PUBLIC, false, 0, , 10, SELECT VERSION FROM KMS_VERSION WHERE ID='KMS', 1]
[6, 2013-12-24 10:39:17.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT * FROM QRTZ_SCHEDULER_STATE WHERE SCHED_NAME = 'schedulerFactoryBean', 1]
[7, 2013-12-24 10:39:34.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND TRIGGER_STATE = ? AND NEXT_FIRE_TIME <= ? AND (MISFIRE_INSTR = -1 OR (MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= ?)) ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC, 1]
[10, 2013-12-24 10:41:53.0, _SYSTEM, true, false, false, 0, PUBLIC, false, 0, 3,7,6,11, , CHECKPOINT, 1]
[11, 2013-12-24 10:42:13.0, SA, true, false, false, 0, PUBLIC, false, 0, , 10, SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < ? AND TRIGGER_STATE = ?, 1]
[13, 2013-12-24 11:26:17.0, SA, true, false, false, 0, PUBLIC, false, 0, , , select * FROM INFORMATION_SCHEMA.SYSTEM_SESSIONS, 0]

Thread dump
----------------
Thread t@31: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=811 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=55, line=969 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted frame)
 - org.hsqldb.lib.CountUpDownLatch.await() @bci=12 (Compiled frame)
 - org.hsqldb.Session.executeCompiledStatement(org.hsqldb.Statement, java.lang.Object[], int) @bci=289 (Compiled frame)
 - org.hsqldb.Database$CheckpointRunner.run() @bci=25 (Interpreted frame)
 - org.hsqldb.lib.HsqlTimer$TaskRunner.run() @bci=19 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)


------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Loading...